Feature
Narrow the scope
NPUs focus to improve performance.
By Nicholas Cravotta, Technical Editor -- EDN, 6/26/2003
|

Some NPU (network-processing-unit) vendors blame the economy for their lagging business. Others blame engineers' resistance to move from difficult-to-design and expensive-to-spin ASICs. No matter what the reason, the number of R&D dollars for evolving NPUs and their coprocessors dwarves the paltry revenues vendors have so far seen.
The drive to eliminate the use of ASICs is substantial. If engineers thought a reasonable alternative existed, they would jump on it. The problem is that moving from ASICs to NPUs shifts the design headache from hardware to software. ASIC-design tools are more mature and easier to work with than the scanty network-processing-design tools and off-the-shelf software available. Have you heard the one about how many engineers it takes to make a wire-rate OC-48-line card using an NPU? The punch line's not funny.
The general-purpose-NPU oxymoronMany NPU start-ups have criticized "general-purpose" NPUs as trying to be everything to everybody, resulting in another layer of legacy chains that, with unmanageable and inappropriate design tools, forces engineers to overcome design-environment constraints rather than focus on the problems at hand. Witness the rise of architectures in which an engineer might have to program and manage nearly 100 simultaneous threads.
Part of the problem is that early NPUs may have too soon brought full programmability to the table. Network processing is a hard engineering problem, and it's only going to get harder as the industry discovers the truly difficult challenges that today's surface challenges mask. Doubt exists about whether NPUs are viable alternatives to ASICs; after you look at the complexity of designing with and programming some NPUs, spinning an ASIC looks easy.
In time, full programmability will have its place. Perhaps it arrived early because various microprocessor vendors thought to solve network-processing problems using more of the existing technology. But more of the same is not the answer. Throwing more processors or cores with the wrong architecture that uses the wrong language with the wrong programming model just might not be the direction to take.
Many recent start-ups have focused on more configurable architectures, but their architects designed them with particular applications in mind. You can to some degree program such architectures, but the developers of these architectures designed them with applications. You have to strike a balance between efficiency in processing; in development; and in cost, size, and power consumption. By defining a particular network-processing problem to solve, these vendors have more efficiently allocated resources.
Some vendors have taken these ideas to the extreme. For example, Xelerated's X10 family of NPUs has 200 cores, each executing a four-operation VLIW (very-long-instruction-word) instruction on a packet before handing off the packet to the next core. Comparing these types of processors to processors using traditional methods is difficult.
PartitioningYou still face a number of key considerations when designing with NPUs, no matter how clever the architecture you choose. For example, many vendors tell you that their processors can easily handle an OC-48 load at a certain level of servicing. However, they don't tell you how they expect the control-plane processor to take over some functions when the data plane becomes overloaded.
As your designs push to higher data rates, the amount of control information also increases. Some vendors of control-plane software functions have begun to call upon the data plane to take over functions such as aging and learning that can map well to some NPUs. It's a question of where you partition the control and data plane. You can easily partition your system such that the NPU can handle what you've determined is the data plane and create a control plane that no CPU can handle. For example, major sources of data-plane offloading are exceptions (see sidebar "Security issues"). The NPU handles the typical, common, well-defined protocols cases and hands over the hard, obscure, other-protocol packets to the control plane. Too narrowly define what the NPU can do, and exceptions will bring your system to its knees (see sidebar "Exception overloading"). The profiling of exception traffic is tricky. Also consider that, although 10% of your traffic today represents exceptions, as standards shift—and the functions in your NPU don't—the presence of exceptions increases.
Interface bottlenecksAs data rates increase, device-interconnection interfaces have become a critical resource requiring careful management and protection. For example, one challenge in budgeting interface bandwidth is "explosive editing," which occurs, for example, when you add a header, such as MPLS (Multiprotocol Label Switching) to a packet. Your ingress might be 2.5 Gbps, but the egress now exceeds 2.5 Gbps.
Partitioning sometimes falls along logical lines, such as separating the ingress, or datapath-in, and the egress, or datapath-out processing. Many NPU vendors suggest using one of their NPUs for ingress processing and one for egress processing. If you partition your system in this way, using two NPUs seems to make sense.
The problem is that, although these two subsystems share some processing, ingress processing is much more complex and thus requires a more powerful NPU. As a result, you can easily overprovision the egress side, wasting money, space, and power. If you use separate chips, you might spec a lower performance NPU for the egress port or use an ASIC. Unfortunately, this cost savings comes at the expense of requiring two NPU-development platforms or an ASIC to maintain. Another problem with separate ingress/egress processing is sharing resources. You must now manage access to memory and search engines.
Some NPUs offer full-duplex processing but have internally separated egress and ingress, meaning that you still pay for egress performance that you won't use. Ideally, you should be able to use resources for either ingress or egress processing, thus reducing the overall resources you have to allocate for the system.
If you anticipate that your design will support explosive editing, you need a speeding mechanism that will allow the internal interfaces to operate faster than 2.5 Gbps. It's also important to consider how much of the raw interface speed you lose to the logical layer and contention for multiple-point interfaces. For point-to-point interfaces, you need to consider the latency and overhead of a fabric, as well as the use of bandwidth when a device acts as a bridge for another device.
One method for increasing interface efficiency is the use of macros. Several coprocessors and memory controllers now support commands that either contain the information necessary for several operations or a pointer to a preset queue of commands. Another method is for the NPU to internally manage getting results back from a coprocessor. This approach conserves bandwidth by eliminating the coprocessors' having to send a ready flag and the NPU's ask for the result. It also saves programming cycles on the NPU because the code need not explicitly retrieve the result.
Proprietary interfaces allow you to maximize specialized performance from devices. Using them, however, can limit the domain in which your device can operate. For example, you may stipulate that it can communicate only with other devices having the same proprietary interface. You can avoid this constraint, however, if your device also supports a bridging function. Many NPUs and coprocessors have embraced the new NPF (National Processing Forum) interfaces for this reason (see sidebar "The NPF-interface specs").
Remember memoryNPUs often have their own memory controllers, which resolve contention among several devices to preserve bandwidth. These built-in controllers can save you board space by eliminating the need for an external controller, but they also limit your memory type and size options. Internal controllers may also limit the sizes of memory your device can support. The more specialized the memory, the better the performance you can achieve in some applications. However, this performance increase usually entails higher cost, fewer size options, and fewer overall capabilities, making the memory less flexible, depending upon the direction of growth for your application. You may also encounter difficulties if you want to use multiple NPUs.
However, bandwidth to the memory, rather than the memory itself is often the problem (see sidebar "System redundancy"). One challenge with interfaces is that, with so much data entering and exiting an NPU, you might want to increase interface speeds by widening the buses. But doing so takes up a lot of pins. One problem with internal NPU-memory controllers is that they may not take advantage of intelligent management techniques that can actually increase bandwidth efficiency. For example, a controller could bundle two nonconsecutive requests in the queue to allow two 16-bit values read as a single 32-bit access.
Some memory controllers support quality-of-service queues to improve latency for critical flows. A controller can also place memory in channels to compensate for the random-access nature of packets; because packets leave a line card in priority order, their order into the line card differs greatly from their order out. Instead of using a flat memory, the controller spreads data across several banks to avoid the penalties associated with consecutive bank or row reads. With such a scheme, using more banks potentially means less contention. Additionally, you can reduce the access latency by using several memories in parallel; the bottleneck becomes the interface to the controller. Your memory vendor can probably suggest spreading schemes to use, although engineers can differentiate their products by shaping them to match the nature of the traffic patterns their designs typically encounter.
Intelligent memory control gives you greater memory efficiency but potentially destroys the deterministic nature of serial memory access—that is, accesses processed in the order they are received. Now, the worst case for a memory access depends on the accesses before and after the access in question (see sidebar "Worst-case analysis"). If you use a deterministic NPU, it must include a time-out mechanism that guarantees that every access will complete within a set time. However, several low-priority requests could contend with each other, reducing efficiency and exacerbating the problem, possibly missing a deterministic deadline and causing a processor to stall or worse. The best technique for preserving memory bandwidth is to eliminate the need to move and store all data, if possible.
Power downAnother critical resource is power. Most strategies for increasing performance increase power consumption. These strategies include increasing interface and clock speeds, adding processing units, overestimating worst-case scenarios, and so on. Memory, along with the interfaces that feed data to the NPU, retrieve data from the chip, and communicate with other NPUs and coprocessors all consume power. Adding a second NPU adds the power consumption of the second CPU to the mix. Whenever you pass data between devices, you burn more power than if you kept processing on one device. Hence, it makes more sense to completely process a packet on a single device than to pass it among several devices. Having more devices to place on a board also increases trace lengths and power consumption. For example, a fabric with 15-in. traces that consumes 14.1W may consume 16.1W with 35-in. traces. This scenario can be troublesome if you haven't prepared for it.
Given that you allocate processing resources on a worst-case basis, a typical design will not fully use the processor—that is, the engines will not be working all the time. If the nonoperating state of the processor doesn't reduce power consumption, you burn more power than you need.
One plus one does not equal twoOne method of preparing for worst-case processing to guarantee wire speed or as insurance for future upgradability is to use another NPU. Unfortunately, this approach entails more than slapping in an extra chip to double performance.
Adding the second NPU to ensure that you can meet worst-case processing means that one NPU was not enough. Adding a second NPU also creates a new level of complexity in managing the two devices. For example, if you put the NPUs parallel to each other, you may need another device, which also consumes power, to manage which NPU gets which packet. If you place the NPUs inline, you have to divide and then restitch the processing, complicating coding and requiring the creation of a handoff API. Also, multiple processors complicate coordinating independencies among tasks, threads, and cores. Two NPUs accessing the same external resources, such as search engines and memory, require additional devices to manage the contention that will arise on these interfaces.
That said, adding an NPU may be the right answer for your application. Overprovisioning allows you to be less careful and accurate in defining your worst case. However, it brings enough new complexity with it that you might first consider other alternatives, such as reducing inefficiencies in your design or redefining the worst case. For example, you could require allocation of a line card solely for exceptions and hand packets to this resource whenever you encounter worst-case processing. This scheme has its own issues, however, although solving one problem creates new ones, such as increased latency, longer queues for packet ordering, managing the flow of packets to and from the line card, and having a second such line card for redundancy. Other options include using a less expensive RISC processor to offload processing or leaving a second, unpopulated NPU slot for applications that don't need it or as future insurance.
The company you keepA surprising number of companies have invested in network-processing technology. Both start-ups and established companies that projected revenues this year and last face the challenge of finding continued financial support as they go deeper into debt. The resulting shakeup has arrived sooner than some companies expected.
One key factor in selecting a vendor is how well the vendor appears to be able to weather the market downturn; you are betting on not only the company's technology, but also its ability to stay in business. If it refuses to show financial statements, consider looking elsewhere. Also, consider whether its funds are secure or if impatient venture capitalists can suddenly pull the plug. In all fairness, expect to show financial results for your own company. Each NPU vendor can hedge only so many bets, and you have to prove you're a good one.
One argument large vendors make is that small companies have trouble maintaining the pace of R&D spending to stay competitive in this market and that, because time to getting revenue is so long, a small company will go out of business before it can collect on its first purchase order.
However, selecting a large company is also risky. Small start-ups are all-in, and venture capitalists may be willing to spend $5 million more to protect the $50 million that they already invested. Larger companies may have changes in management who, in an effort to increase profits and reduce costs, terminate slow products lines, such as NPUs. Vitesse, for example, in a meeting describing how its switch fabric complemented another vendor's NPU, claimed that teaming with several large players was safer than working with several smaller companies. The irony is that Vitesse has killed not one, but two, NPU-product lines.
When assessing large companies, check their overall commitment to network processing (see sidebar "Can partners keep the ship afloat?"). Are they buying or selling off NPU intellectual property? Did an undue percentage of the last layoff come from NPU divisions? Given the impending shakeup, it becomes increasingly important to have to a plan should your vendor go under (see sidebar "What to do when your vendor bites the dust").
The clever start-ups are like small fish that can avoid the crushing net of the economy closing in on them because they have defined relatively niche applications for which they've optimized the efficiency of their devices. The general-purpose NPU vendors are like bigger fish hoping they adapt fast enough to either slip through the net themselves or develop enough momentum to tear through. Many of the general-purpose-NPU vendors have responded by adding dedicated internal engines to self-offload the kinds of functions configurable engines are good at (see sidebar "Other considerations").
General-purpose NPUs do have the advantage of being better able to leverage software across multiple applications than can niched NPUs. However, is the advantage of being able to process many protocols at a variety of depths a strength in its breadth of application or a weakness because a niche device will always be available that can better handle each application?
Just because a lot of companies spend a lot of money developing interesting technology that does everything doesn't mean the technology is going to stick. Remember how many times media processors have come and gone? Chips that do cutting-edge audio, video, and graphics haven't survived, either. Devices that handled one or two of these functions well replaced them. Perhaps the NPU industry can learn from history before it becomes history itself.
See the PDF version of this article for a list of network-processor vendors.
| Author Information |
You can reach Technical Editor Nicholas Cravotta at 1-530-268-7715, fax 1-617-558-4470, e-mail nick@edn.com. |
| Acknowledgment | ||
| Thanks to Lance Levanthal, PhD, conference consultant for the Network Processors Conference for his contributions to this article. | ||
|
|















You can reach Technical Editor Nicholas Cravotta at 1-530-268-7715, fax 1-617-558-4470, e-mail