Feature
Software: the Achilles' heel of network processors?
The difficulty of programming network processors and coprocessors has for the past few years tripped up the success of these devices. Has anything significant changed, or is this market fated to continue to only slowly stumble forward?
By Nicholas Cravotta, Technical Editor -- EDN, 4/11/2002
|

Programming network processors and coprocessors is a difficult task. Many companies admit that their previous-generation parts were too difficult to program and are investing heavily in creating better development tools. The challenge is daunting for several reasons. First, the applications themselves are complex, and their design requires great ingenuity to develop techniques to process packets at wire speed. Another challenge is that the gamut of devices differ enough from each other to allow no coherent classification; for example, every "forwarding engine" assumes a different level of function and offers different means for offloading tasks to other devices to accelerate processing. Also, getting these devices to communicate and make full use of each other's unique features can take weeks to months. And, as much as the data plane (wire-speed processing) and control plane (system-level management) are independent, they must work together intimately.
One of the primary bottlenecks to design is learning how to use the various devices available. Each device or software suite takes a different approach to network processing, offering unique methods for solving certain portions of the problem. The programmability of many network processors and coprocessors lets you build on these methods to also create your own methods for approaching these problems. This situation means that software plays an increasingly more important role.
Development tools and off-the-shelf software are key differentiators of ICs. As the race continues toward simplifying the programming of these devices, many vendors have come to rely on C as a way to avoid the complexities of assembly or microcode by abstracting the problem of understanding the internal workings of a device or the complexities of having two devices communicate with each other. And all vendors are turning more of their R&D dollars toward developing tools beyond assemblers and debuggers.
Those vendors that offer C tools claim that developing in C is significantly easier than developing in assembly or microcode. They claim that their compilers create code that is on par with handwritten code. They also claim that those programmers familiar with C will somehow have an advantage over those programmers who don't.
The reality is that few of the vendors actually support C. First, the C they are citing has been stripped of huge blocks of features, such as floating-point instructions or pointers. Internally, the structure is also different. For example, functions are often resolved as inline code. Next, the C may offer a superset of specialized instructions that optimize certain tasks. These instructions may be intrinsics, pragmas, extensions, or macros that give you "direct" access to the hardware architecture or function calls to inline assembly blocks or direct commands to a hardware accelerator. At this point, you have to consider whether this C is a simple extension of C or more of a C**, a new language only somewhat related to C.
Some vendors pitch their versions of C** as languages with "guardrails" that won't let you do things you can't or shouldn't do. However, to understand how to avoid scraping these guardrails when you want your code to scream, you have to understand the programming model the assembly of the network processor expects. Note that this is not the programming model the C compiler expects. If the network processor really expects assembly and wasn't designed to support C-types of commands, the compiler forces you to program in a manner that results in less efficient code. The underlying hardware with its specialized acceleration engines may require a unique method of programming to take full advantage of these engines. If you use intrinsics to access these accelerators, then you're not using C anyway and have probably added a layer of abstraction and inefficiency to your code by not directly using inline assembly.
Using C does not reduce the complexity of the network-processing problem. You can give the task a C syntax, but this approach doesn't reduce the internal complexity. Using C doesn't mean you don't have to understand what you're doing. For example, table-look-up algorithms depend upon the type of hardware acceleration available. Some co-processors offer specialized functions that go beyond mere look-up and address issues, such as lowering power consumption by hashing look-up keys and segmenting large tables. Accessing these engines to their fullest capacity requires access to the engines themselves and understanding how they interact with the rest of the system. Using C cannot shield you from having to learn such details, but it can hinder you from taking full advantage of them. You may even go so far as to try to trick the compiler into doing things you want to do but don't know that you shouldn't be doing with the network processor. From another perspective, consider the differences between Java and C. On the surface, the two languages look similar. Yet, what is good programming in C is not necessarily good programming in Java. C allows the use of pointers, which are fundamental data-management tools. Java, on the other hand, has garbage collection, which colors many design decisions. In some cases, the fact that you know a similar language, C, may actually increase the time it takes you to learn a variant of the language, C**, as you struggle to "unlearn" common structures in C that are cumbersome in the world of C**.
C doesn't easily handle, for example, the basic task of bit manipulation. If you want to do a multi-tuple look-up, you need to extract fields from different parts of a packet to create a single look-up key. Likewise, the result of the look-up may contain several results that you need to extract and store. In such a case, C is not an optimal language. Pulling bits takes about as many lines of code as assembly would, and, because of the abstraction from the underlying hardware, you cannot easily take advantage of bit-manipulation engines. A representative from one vendor claims that bit manipulation is straightforward and simple in C, stating "testing the 15th bit in the second word of an IP header...would compile to no more than five machine instructions." Another vendor talks about using string-copy instructions to rip bits from a tag and compress them. Is this a good approach for programming or for execution?
Bit manipulation is a common enough task to warrant some kind of abstraction. Some vendors provide this abstraction via templates with header constants defined for particular protocols. Some have created microcode libraries optimizing protocol-specific functions, such as striping fields to create a look-up key. Others offer C functions or pragmas that provide inline assembly for standard, common, protocol-specific functions using on-chip acceleration engines. However, if you want to do something nonstandard, such as creating a hash key for accelerating table look-up, you'll have to drill down to the assembly yourself. Some proprietary languages recognize the frequency of such operations by offering an instruction that generates look-up keys; its arguments are the positions of fields in the packet and the registers you want to store the results in. Such syntax allows you to create a key in one line of code and efficiently uses acceleration engines (see sidebar "Proprietary babble"). Note that each of these approaches may not long remain a competitive advantage because other vendors will carry these advances to their next generation of tools, as well.
The underlying fallacy of the C-language argument lies in that vendors claim that, as programmers move to a high-level language, the skill of the programmer has less impact on code than at the assembly level; that is, C evens the playing field. To some degree, abstraction frees programmers from understanding what is happening "under the hood" and enables them to focus on what they want the network processor to do. However, most designers didn't build their network processors with C in mind, and bridging a general-purpose language, such as C, to a highly specialized assembly is bound to result in inefficiencies. This restriction means that when you're trying to squeeze another few cycles from an algorithm, you'll find it more difficult to figure out how to pull bits out in 24 steps instead of 27 if you don't understand what the network processor can and is doing. If the vendor offers no access or fails to maintain its assembly-level tools, you may be unable to recapture this lost efficiency.
C or assembly: a question of maturityDespite these limitations, one can say a lot in favor of programming in C. Code optimized for a device can suddenly become less so when the vendor releases the next generation. For example, one of the old x86 variants took one less clock cycle to execute storage through the accumulator rather than directly through a register. When Intel released the next-generation chip, the role of the accumulator changed internally, and the position reversed: The direct code worked faster.
Getting too involved with the underlying architecture presents a temptation to many engineers to build code optimized for that architecture. After all, every engineer knows that a human coder is more efficient than a compiler. But you have to approach this situation from a system-level perspective. Is reducing a core code loop more efficient if it takes a programmer a week to do so? A week is more time than what many coprocessor vendors claim is necessary to construct a shim layer, which adapts the abstracted code to the available hardware without modifying the majority of the abstracted code. The trade-off takes a wider perspective than merely how fast code executes. Which is more important: running a block of code a few instructions faster or evaluating the performance of an additional coprocessor?
Using a higher level language that shields you from the details lets you more quickly sketch out an architecture. Again, using C sometimes results in significant inefficiencies. But these inefficiencies have a lower cost if you still have available overhead in the network processor. The trade-off becomes the ability to create code that leaves room for future products or testing several divergence architectures and profiling system-level inefficiencies to evaluate which overall approach will serve best in the long term (see Web-only sidebar "Migration patterns"). As long as the network-processor vendor has an available assembler, you can optimize key sections of code when you have time to take care of details. Regarding overhead, optimizing core code between generations of products frees up some headroom. Also, don't underestimate the value of next-generation devices (see Web-only sidebar "Silent but deadly"). In the time it takes you to optimize your code, the vendor may have released a chip that offers enough of a performance gain to obviate the optimization. Of course, you have to have faith in a vendor's road map to adopt this course (see Web-only sidebar "How well do you know your vendor?"). The goal to keep in mind is to create a finished product. Great code in a product that isn't finished before the R&D dollars run out is worth nothing (see Web-only sidebar "What's the real cost?").
Abstracting apples and oranges as fruitTwo years ago, many network-processor vendors loudly proclaimed that programming in their network processors' microcode was easy. Today, they are just as loudly proclaiming today vendors that expect you to code in microcode are making things difficult for you. All this marketing hype hides a truth: The single biggest step vendors can take to accelerate software development is to simplify the programming model (see Web-only sidebar "Some questions about software").
One popular method of simplifying programming is to create software abstracts of functions that map to various hardware elements. Thus, the programmer focuses on the application and avoids the perplexing details of hardware implementation. The general idea is that you can reduce any application to some significantly smaller number of core functions, much in the same way that the core instructions of a programming language form the basis of complex applications. For example, Internet Protocol Version 4 routing can be reduced to approximately 30 fundamental functions. One such function might modify a table entry in a forwarding table. Above this subset, the software need not understand how you modified a table—only that you can modify it. To port this application, you need to supply shim code, the code necessary to implement the core functions on the hardware platform. Some network processors supply code or libraries, and APIs access their functions. In such cases, shim code would bridge the two APIs, and you would have to write shim code for both the control- and the data-plane processors. Add a coprocessor, and you also add a shim. You may also need to build a shim for the operating system (see Web-only sidebar "Open-source operating systems"). This approach may leave you with several shim layers to create.
The admitted complexity of shim layers varies from vendor to vendor. Some claim that you can port to their APIs in a week. Others suggest allocating three to six months for the process. The longer figures are probably more accurate in that they consider such issues as the complexity of the other API you are creating the shim for; the learning curve for the tools and software; how much code in kilobytes versus megabytes you have to port; how long it takes to sort through the code and understand all the relevant dependencies, including both the data- and the control-plane shim to the operating system; and testing and validating the shim to the point that you consider it not only working, but also robust. If the vendor still claims that you can write the shim in a week, ask the marketers why the company's engineers haven't yet written it, or make writing the shim part of your purchase deal (see Web-only sidebar "Anatomy of a third-party network-processing-software vendor").
Some vendors offer "device-independent" APIs, meaning that you can use such software regardless of the hardware implementation you choose. This software to some degree shifts the onus of compatibility from software to hardware. Traditionally, software developers have written software to match the hardware. Now, you have to make the hardware match the software. To interface the control and data planes, this matching may mean that you may have to write a shim not on the control plane but on the data plane, where processing is most costly. If the shim is relatively thin (meaning not complex or process-intensive), this extra cost is negligible.
Device independence also challenges the premise that network processing is a system-level problem. Network processors and coprocessors as a rule offer features that differentiate them from other devices. To abstract functions means to limit the visibility of application code to capitalize on these unique features. Some functions, such as "search table," remain relatively constant across implementations. "Relatively" is the key word, however. The special features of each coprocessor can make all the difference in table look-up and management, often key differentiators among products. Some vendors address this issue by writing blocks of modular code and providing the various network processor- and coprocessor-specific shims as well as a wider API that can expose such features in the modular code. Such generalization, however, may add layers of inefficiency unless you can configure the modular code for this purpose. Also, look at the internal partitioning within the software to see how its designers tried to solve the problem; the paradigms they selected to some degree define the limits of your system-level performance.
On the network-processor side, many vendors have tried to simplify the programming model in various ways (see Web-only sidebar "The Network Processor Forum"). Several of the multiprocessor/multithreaded network processors have single-threaded programming models. In other words, you write your code as if your design had only one thread and one processor. The compiler and network processor take care of everything else for you. What sacrifice in performance do you make for this simplicity? There are many internal constraints, including access to memory, to coprocessors, and to internal buses, within a network processor. If the programmer has no idea what these constraints are, the code could challenge a compiler to optimally allocate these resources (see sidebar "Marketing mistruths and other lies").
At the onset of a design, it may be unclear which functions should run on which piece of hardware. Given the variety of coprocessors and kind and amount of processing each does, partitioning functions is difficult and often an after-the-fact experiment. An onboard coprocessor can process security functions, for example, inline. In this approach, all packets pass through an encryption engine before hitting the network processor. Alternatively, the coprocessor could pass these functions to a dedicated encryption "blade," or card. It's also unclear how much of say, SSL (Secure Sockets Layer) processing should take place in the coprocessor, the control processor, and the network processor. Unfortunately, few tools exist for making system-level evaluations without actually requiring you to design the system. Magnitudes-of-order difference can result from system-level structural changes, so evaluating architectures becomes an intensive process.
By abstracting code, you retain some freedom in how to partition functions, especially if the API has several layers that you can peel back to the level you wish to work at (see sidebar "Customizing code: a serious prospect"). Thus, you have the choice of moving some control code to the data plane. This task is especially important depending on how you implement the data plane and what coprocessors and features each of these offers. In other words, if you spec a device that accelerates a function, does the API become a logical stranglehold preventing you from partitioning that function to the device that should handle it, rather than the device that the software architects decided should handle it?
Development toolsFew vendors still offer solely an assembler/debugger combo. Most vendors offer a development environment with a compiler, a debugger, a simulator, a profiler, and a traffic generator with a variety of reference designs from which to launch designs (see Web-only sidebars, "Simulator tools," "Profiler tools," "Traffic-generator tools," and "Some reference designs are more equal than others"). Those chips that aren't programmable come with the appropriate "configuration" tools to set the device up for an application. A few vendors offer frameworks, encompassing the traditional design environment. Finally, a short list of vendors offers tools that abstract the entire design cycle away from the hardware, topped with code-generation tools.
Tools at this highest level of abstraction map abstractions of functions to hardware or software. During initial modeling, you have the option of trying different mappings to test the efficiency of different hardware and software partitioning, as well as hardware devices. You can develop generic functions and later define the abstraction down to lower levels, such as modeling scheduling across multiple processors, threads, or both; shared-memory resources; and synchronization of elements, to name a few. A back-end compiler/assembler generates code for network processors and coprocessors, building a custom implementation from a generic description. How useful this code is depends upon the resolution with which you define your mapping and how cooperative the processor and coprocessor vendors are in supporting the mapping tool; that is, someone has to write the code or back-end compiler/assembler.
The difference between a framework and a development environment is that the framework is targeted for a specific application. Thus, the framework "understands", to some degree, that you are not just developing a network processing system, but that you're developing, say, an IPv4 system. Common aspects and concepts of the application are blended into the environment. Frameworks often provide a skeleton system, which directs or forms the basis for development, guiding the developer along a known course.
Even if a vendor offers several devices and a development environment that "seamlessly" integrates them, you may want only one part from the vendor and the ability to mix and match it with devices from other vendors. You should be able to use only those parts of the tool suite that apply to the device you choose. For the tools to be most useful, you want to be able to link the tools for the other devices without having to fight with scripts. It's worth checking to see whether can you add extensions to an environment. For example, you may want to analyze data in a manner that the tool doesn't currently support. Finally, let the vendor show you some of the extra features that make its tools a cut above the competitions' (see Web-only sidebar "Bells and whistles").
Of course, the most important characteristic of a tool is how it helps you develop your design. High-level abstraction tools may not be useful to you if you've already decided upon your hardware architecture and have a code base to carry over. It could be more work figuring out how to abstract all these elements so that you can map them back to themselves than to simply develop more code.
Developing code for network processors today differs greatly from the process it was even two years ago. Since then, the IC vendors have realized that simply having amazing hardware is too little to go to market with. Difficult-to-use tools and the inefficiencies of poor abstraction threatened to make writing software the bane of network processors. Has much changed?
The answer is yes, on many levels. The tools and software have become surprisingly better, given the short time vendors have had to develop them. Is the change enough? That answer depends on your application and how much performance you need to squeeze from a grain of sand. You can't evaluate a network processor/coprocessor and its tool set by watching a demo. The only way to find out is to get your hands dirty with the tools.
Will the network-processing market fade away if the existing tools aren't enough? Too many companies have spent too much money to accept this possibility. If sufficient tools don't yet exist, then ASICs will continue to dominate the market for another year or so until they do arrive. The ability to spin new product lines without spinning new ASICs is simply too appealing a proposition.
(EDN is updating its Network Processors Web resource.)
| For more information... | ||
| When you contact any of the following manufacturers directly, please let them know you read about their products in EDN. | ||
| Agere Systems 1-800-372-2447 www.agere.com | Arc Cores Inc 1-408-437-3400 www.arc.com | Azanda Network Devices 1-408-720-3100 www.azanda.com |
| Bay Microsystems 1-408-653-2181 www.baymicrosystems.com | Broadcom Corp (SiByte) 1-949-450-8700 www.broadcom.com | ClearSpeed Technology (Pixelfusion) +44 0117 317 2000 www.clearspeed.com |
| Clearwater Networks (XStream Logic) 1-408-376-1500 www.xstreamlogic.com | Cognigine 1-510-743-4900 www.cognigine.com | Consystant 1-425-739-9927 www.consystant.com |
| Cypress Semiconductor (Lara Networks) 1-408-943-2600 www.cypress.com | Effnet Inc 1-650-390-8700 www.effnet.com | EZchip Technologies 1-408-879-7355 www.ezchip.com |
| Fast-Chip 1-408-523-8050 www.fast-chip.com | GlobespanVirata 1-888-855-4562 www.globespanvirata.com | Hifn 1-408-399-3500 www.hifn.com |
| IBM 1-650-694-3007 www.chips.ibm.com/products/wired | Improv Systems 1-978-927-0555 www.improvsys.com | Intel Corp 1-408-765-8080 www.intel.com |
| Internet Machines 1-818-575-2100 www.internetmachines.com | IP Infusion 1-408-794-1500 www.ipinfusion.com | Kawasaki LSI 1-408-570-0555 www.klsi.com |
| Lexra Inc 1-408-573-1890 www.lexra.com | LSI Logic 1-866-574-5741 www.lsilogic.com | LVL7 1-919-865-2700 www.lvl7.com |
| Marvell 1-408-222-2500 www.marvell.com | Micron Technology 1-208-368-4400 www.micron.com/tcam | Mindspeed Technologies 1-949-579-3000 www.mindspeed.com |
| AMCC Networks 1-408-731-1600 www.mmcnetworks.com | Mosaid Technologies 1-613-599-9539 www.mosaid.com | Motorola 1-800-521-6274 www.motorola.com |
| NEC Electronics 1-408-588-6000 www.necel.com | NetLogic Microsystems Inc 1-650-961-6676 www.netlogicmicro.com | NetPlane Systems Inc 1-781-329-3200 www.netplane.com |
| Paxonet Communications 1-510-770-2277 www.paxonet.com | PMC-Sierra 1-604-415-6000 www.pmc-sierra.com | RadiSys Corp 1-503-615-1100 www.radisys.com |
| Radlan 1-408-996-2121 www.radlan.com | SiberCore Technologies 1-613-271-8100 www.sibercore.com | Silicon Access Networks 1-408-545-1100 www.siliconaccess.com |
| Solidum Systems Corp 1-613-724-6004 www.solidum.com | Teja Technologies 1-408-288-2560 www.teja.com | Tensilica 1-408-986-8000 www.tensilica.com |
| Terago 1-408-941-9664 www.terago.com | TranSwitch Corp 1-203-929-8810 www.transwitch.com | Vitesse Semiconductor Corp 1-800-848-3773 www.vitesse.com |
| Wind River 1-510-748-4100 www.windriver.com | Xelerated +46 8 506 257 00 www.xelerated.com | Zettacom 1-408-869-7000 www.zettacom.com |
| Resources | ||
| Network Processing Forum www.npforum.org | EDN Network Processing | |
| Author Information |
You can reach Technical Editor Nicholas Cravotta at 1-510-558-8906, fax 1-510-558-8914, e-mail ednnick@pacbell.net. |
|
|
|















You can reach Technical Editor Nicholas Cravotta at 1-510-558-8906, fax 1-510-558-8914, e-mail