Feature

Narrow the scope

NPUs focus to improve performance.

By Nicholas Cravotta, Technical Editor -- EDN, 6/26/2003

AT A GLANCE
  • Some NPUs can do everything. Others can do only particular tasks, but they do them very well.
  • Interface bottlenecks have made interconnect bandwidth a critical resource to manage.
  • You may find it more efficient to redefine your target application than to add a second NPU to make sure you reach wire speed.
  • The industry shakeup has started. Do you have a backup plan?
Sidebars:
What to do when your vendor bites the dust
System redundancy
Worst-case analysis
The NPF-interface specs
Can partners keep the ship afloat?
Security issues
Other considerations
Exception overloading

Some NPU (network-processing-unit) vendors blame the economy for their lagging business. Others blame engineers' resistance to move from difficult-to-design and expensive-to-spin ASICs. No matter what the reason, the number of R&D dollars for evolving NPUs and their coprocessors dwarves the paltry revenues vendors have so far seen.

The drive to eliminate the use of ASICs is substantial. If engineers thought a reasonable alternative existed, they would jump on it. The problem is that moving from ASICs to NPUs shifts the design headache from hardware to software. ASIC-design tools are more mature and easier to work with than the scanty network-processing-design tools and off-the-shelf software available. Have you heard the one about how many engineers it takes to make a wire-rate OC-48-line card using an NPU? The punch line's not funny.

The general-purpose-NPU oxymoron

Many NPU start-ups have criticized "general-purpose" NPUs as trying to be everything to everybody, resulting in another layer of legacy chains that, with unmanageable and inappropriate design tools, forces engineers to overcome design-environment constraints rather than focus on the problems at hand. Witness the rise of architectures in which an engineer might have to program and manage nearly 100 simultaneous threads.

Part of the problem is that early NPUs may have too soon brought full programmability to the table. Network processing is a hard engineering problem, and it's only going to get harder as the industry discovers the truly difficult challenges that today's surface challenges mask. Doubt exists about whether NPUs are viable alternatives to ASICs; after you look at the complexity of designing with and programming some NPUs, spinning an ASIC looks easy.

In time, full programmability will have its place. Perhaps it arrived early because various microprocessor vendors thought to solve network-processing problems using more of the existing technology. But more of the same is not the answer. Throwing more processors or cores with the wrong architecture that uses the wrong language with the wrong programming model just might not be the direction to take.

Many recent start-ups have focused on more configurable architectures, but their architects designed them with particular applications in mind. You can to some degree program such architectures, but the developers of these architectures designed them with applications. You have to strike a balance between efficiency in processing; in development; and in cost, size, and power consumption. By defining a particular network-processing problem to solve, these vendors have more efficiently allocated resources.

Some vendors have taken these ideas to the extreme. For example, Xelerated's X10 family of NPUs has 200 cores, each executing a four-operation VLIW (very-long-instruction-word) instruction on a packet before handing off the packet to the next core. Comparing these types of processors to processors using traditional methods is difficult.

Partitioning

You still face a number of key considerations when designing with NPUs, no matter how clever the architecture you choose. For example, many vendors tell you that their processors can easily handle an OC-48 load at a certain level of servicing. However, they don't tell you how they expect the control-plane processor to take over some functions when the data plane becomes overloaded.

As your designs push to higher data rates, the amount of control information also increases. Some vendors of control-plane software functions have begun to call upon the data plane to take over functions such as aging and learning that can map well to some NPUs. It's a question of where you partition the control and data plane. You can easily partition your system such that the NPU can handle what you've determined is the data plane and create a control plane that no CPU can handle. For example, major sources of data-plane offloading are exceptions (see sidebar "Security issues"). The NPU handles the typical, common, well-defined protocols cases and hands over the hard, obscure, other-protocol packets to the control plane. Too narrowly define what the NPU can do, and exceptions will bring your system to its knees (see sidebar "Exception overloading"). The profiling of exception traffic is tricky. Also consider that, although 10% of your traffic today represents exceptions, as standards shift—and the functions in your NPU don't—the presence of exceptions increases.

Interface bottlenecks

As data rates increase, device-interconnection interfaces have become a critical resource requiring careful management and protection. For example, one challenge in budgeting interface bandwidth is "explosive editing," which occurs, for example, when you add a header, such as MPLS (Multiprotocol Label Switching) to a packet. Your ingress might be 2.5 Gbps, but the egress now exceeds 2.5 Gbps.

Partitioning sometimes falls along logical lines, such as separating the ingress, or datapath-in, and the egress, or datapath-out processing. Many NPU vendors suggest using one of their NPUs for ingress processing and one for egress processing. If you partition your system in this way, using two NPUs seems to make sense.

The problem is that, although these two subsystems share some processing, ingress processing is much more complex and thus requires a more powerful NPU. As a result, you can easily overprovision the egress side, wasting money, space, and power. If you use separate chips, you might spec a lower performance NPU for the egress port or use an ASIC. Unfortunately, this cost savings comes at the expense of requiring two NPU-development platforms or an ASIC to maintain. Another problem with separate ingress/egress processing is sharing resources. You must now manage access to memory and search engines.

Some NPUs offer full-duplex processing but have internally separated egress and ingress, meaning that you still pay for egress performance that you won't use. Ideally, you should be able to use resources for either ingress or egress processing, thus reducing the overall resources you have to allocate for the system.

If you anticipate that your design will support explosive editing, you need a speeding mechanism that will allow the internal interfaces to operate faster than 2.5 Gbps. It's also important to consider how much of the raw interface speed you lose to the logical layer and contention for multiple-point interfaces. For point-to-point interfaces, you need to consider the latency and overhead of a fabric, as well as the use of bandwidth when a device acts as a bridge for another device.

One method for increasing interface efficiency is the use of macros. Several coprocessors and memory controllers now support commands that either contain the information necessary for several operations or a pointer to a preset queue of commands. Another method is for the NPU to internally manage getting results back from a coprocessor. This approach conserves bandwidth by eliminating the coprocessors' having to send a ready flag and the NPU's ask for the result. It also saves programming cycles on the NPU because the code need not explicitly retrieve the result.

Proprietary interfaces allow you to maximize specialized performance from devices. Using them, however, can limit the domain in which your device can operate. For example, you may stipulate that it can communicate only with other devices having the same proprietary interface. You can avoid this constraint, however, if your device also supports a bridging function. Many NPUs and coprocessors have embraced the new NPF (National Processing Forum) interfaces for this reason (see sidebar "The NPF-interface specs").

Remember memory

NPUs often have their own memory controllers, which resolve contention among several devices to preserve bandwidth. These built-in controllers can save you board space by eliminating the need for an external controller, but they also limit your memory type and size options. Internal controllers may also limit the sizes of memory your device can support. The more specialized the memory, the better the performance you can achieve in some applications. However, this performance increase usually entails higher cost, fewer size options, and fewer overall capabilities, making the memory less flexible, depending upon the direction of growth for your application. You may also encounter difficulties if you want to use multiple NPUs.

However, bandwidth to the memory, rather than the memory itself is often the problem (see sidebar "System redundancy"). One challenge with interfaces is that, with so much data entering and exiting an NPU, you might want to increase interface speeds by widening the buses. But doing so takes up a lot of pins. One problem with internal NPU-memory controllers is that they may not take advantage of intelligent management techniques that can actually increase bandwidth efficiency. For example, a controller could bundle two nonconsecutive requests in the queue to allow two 16-bit values read as a single 32-bit access.

Some memory controllers support quality-of-service queues to improve latency for critical flows. A controller can also place memory in channels to compensate for the random-access nature of packets; because packets leave a line card in priority order, their order into the line card differs greatly from their order out. Instead of using a flat memory, the controller spreads data across several banks to avoid the penalties associated with consecutive bank or row reads. With such a scheme, using more banks potentially means less contention. Additionally, you can reduce the access latency by using several memories in parallel; the bottleneck becomes the interface to the controller. Your memory vendor can probably suggest spreading schemes to use, although engineers can differentiate their products by shaping them to match the nature of the traffic patterns their designs typically encounter.

Intelligent memory control gives you greater memory efficiency but potentially destroys the deterministic nature of serial memory access—that is, accesses processed in the order they are received. Now, the worst case for a memory access depends on the accesses before and after the access in question (see sidebar "Worst-case analysis"). If you use a deterministic NPU, it must include a time-out mechanism that guarantees that every access will complete within a set time. However, several low-priority requests could contend with each other, reducing efficiency and exacerbating the problem, possibly missing a deterministic deadline and causing a processor to stall or worse. The best technique for preserving memory bandwidth is to eliminate the need to move and store all data, if possible.

Power down

Another critical resource is power. Most strategies for increasing performance increase power consumption. These strategies include increasing interface and clock speeds, adding processing units, overestimating worst-case scenarios, and so on. Memory, along with the interfaces that feed data to the NPU, retrieve data from the chip, and communicate with other NPUs and coprocessors all consume power. Adding a second NPU adds the power consumption of the second CPU to the mix. Whenever you pass data between devices, you burn more power than if you kept processing on one device. Hence, it makes more sense to completely process a packet on a single device than to pass it among several devices. Having more devices to place on a board also increases trace lengths and power consumption. For example, a fabric with 15-in. traces that consumes 14.1W may consume 16.1W with 35-in. traces. This scenario can be troublesome if you haven't prepared for it.

Given that you allocate processing resources on a worst-case basis, a typical design will not fully use the processor—that is, the engines will not be working all the time. If the nonoperating state of the processor doesn't reduce power consumption, you burn more power than you need.

One plus one does not equal two

One method of preparing for worst-case processing to guarantee wire speed or as insurance for future upgradability is to use another NPU. Unfortunately, this approach entails more than slapping in an extra chip to double performance.

Adding the second NPU to ensure that you can meet worst-case processing means that one NPU was not enough. Adding a second NPU also creates a new level of complexity in managing the two devices. For example, if you put the NPUs parallel to each other, you may need another device, which also consumes power, to manage which NPU gets which packet. If you place the NPUs inline, you have to divide and then restitch the processing, complicating coding and requiring the creation of a handoff API. Also, multiple processors complicate coordinating independencies among tasks, threads, and cores. Two NPUs accessing the same external resources, such as search engines and memory, require additional devices to manage the contention that will arise on these interfaces.

That said, adding an NPU may be the right answer for your application. Overprovisioning allows you to be less careful and accurate in defining your worst case. However, it brings enough new complexity with it that you might first consider other alternatives, such as reducing inefficiencies in your design or redefining the worst case. For example, you could require allocation of a line card solely for exceptions and hand packets to this resource whenever you encounter worst-case processing. This scheme has its own issues, however, although solving one problem creates new ones, such as increased latency, longer queues for packet ordering, managing the flow of packets to and from the line card, and having a second such line card for redundancy. Other options include using a less expensive RISC processor to offload processing or leaving a second, unpopulated NPU slot for applications that don't need it or as future insurance.

The company you keep

A surprising number of companies have invested in network-processing technology. Both start-ups and established companies that projected revenues this year and last face the challenge of finding continued financial support as they go deeper into debt. The resulting shakeup has arrived sooner than some companies expected.

One key factor in selecting a vendor is how well the vendor appears to be able to weather the market downturn; you are betting on not only the company's technology, but also its ability to stay in business. If it refuses to show financial statements, consider looking elsewhere. Also, consider whether its funds are secure or if impatient venture capitalists can suddenly pull the plug. In all fairness, expect to show financial results for your own company. Each NPU vendor can hedge only so many bets, and you have to prove you're a good one.

One argument large vendors make is that small companies have trouble maintaining the pace of R&D spending to stay competitive in this market and that, because time to getting revenue is so long, a small company will go out of business before it can collect on its first purchase order.

However, selecting a large company is also risky. Small start-ups are all-in, and venture capitalists may be willing to spend $5 million more to protect the $50 million that they already invested. Larger companies may have changes in management who, in an effort to increase profits and reduce costs, terminate slow products lines, such as NPUs. Vitesse, for example, in a meeting describing how its switch fabric complemented another vendor's NPU, claimed that teaming with several large players was safer than working with several smaller companies. The irony is that Vitesse has killed not one, but two, NPU-product lines.

When assessing large companies, check their overall commitment to network processing (see sidebar "Can partners keep the ship afloat?"). Are they buying or selling off NPU intellectual property? Did an undue percentage of the last layoff come from NPU divisions? Given the impending shakeup, it becomes increasingly important to have to a plan should your vendor go under (see sidebar "What to do when your vendor bites the dust").

The clever start-ups are like small fish that can avoid the crushing net of the economy closing in on them because they have defined relatively niche applications for which they've optimized the efficiency of their devices. The general-purpose NPU vendors are like bigger fish hoping they adapt fast enough to either slip through the net themselves or develop enough momentum to tear through. Many of the general-purpose-NPU vendors have responded by adding dedicated internal engines to self-offload the kinds of functions configurable engines are good at (see sidebar "Other considerations").

General-purpose NPUs do have the advantage of being better able to leverage software across multiple applications than can niched NPUs. However, is the advantage of being able to process many protocols at a variety of depths a strength in its breadth of application or a weakness because a niche device will always be available that can better handle each application?

Just because a lot of companies spend a lot of money developing interesting technology that does everything doesn't mean the technology is going to stick. Remember how many times media processors have come and gone? Chips that do cutting-edge audio, video, and graphics haven't survived, either. Devices that handled one or two of these functions well replaced them. Perhaps the NPU industry can learn from history before it becomes history itself.

See the PDF version of this article for a list of network-processor vendors.


Author Information
You can reach Technical Editor Nicholas Cravotta at 1-530-268-7715, fax 1-617-558-4470, e-mail nick@edn.com.


Acknowledgment
Thanks to Lance Levanthal, PhD, conference consultant for the Network Processors Conference for his contributions to this article.

 

What to do when your vendor bites the dust

What happens when your NPU (network-processing-unit) vendor goes belly up or sells its NPU division to another company? In many cases, you can get a vendor to agree to release IP (intellectual property) if it goes out of business. Alternatively, if you have enough cash, you can buy the failing company. However, supporting someone else's hardware or software will likely increase your time to market. Some companies just can't get the chips to work; in these cases, you have no IP to scrape off the floor.

Consider the plight of Consystant, whose investors decided to halt operations just as purchase orders were starting to arrive. Now, the company's IP is up for sale. What does this mean for customers? Also, consider the sale of Netplane by Conexant (Mindspeed) to Motorola. If you were committed to using Netplane tools, all of the driving forces have changed. Will Motorola continue to support you, or will the vendor gut Netplane to promote its own C-Port NPU family?

Even if your vendors are doing well, you might discover that you're using a second-rate part. You might be tempted to continue to use the device because you've already invested too much, and you can't afford to change vendors and directions. Try looking, however, at the issues from the perspective of the cost of staying with the device. Developing workarounds may cost you more in the long term than changing devices.

Being prepared means anticipating vendor failure and working the cost of changing parts into your time and cost budgets. When developing NPU code using a coprocessor, for example, abstract your code beyond the coprocessor API or develop according to strict guidelines, which will make it less painful to switch coprocessors. You might even go so far as to employ two development teams working with different devices.

 

System redundancy

When evaluating how much headroom you have for extra features and future enhancements, don't forget about redundancy and fault tolerance. Some flows and protocols employ extensive state information—that is, persistent data between packets. To create a system that can fail over gracefully, you need to provide a means for mirroring this state information.

In some cases, running a redundant line card and having it process packets in parallel creates much of the required state information that a smooth failover requires, but only the active card actually passes packets on. The problem with this scheme is that it requires 1-to-1 redundancy, because each redundant blade can handle the load of only one line card.

Checkpointing, a scheme in which active line cards periodically send a snapshot of state information to a redundant line card, allows for n-to-1 redundancy. Checkpointing, however, requires cycles you may have budgeted for packet servicing. With deterministic architectures, unless you send state information for every packet—a costly proposition in processing cycles—you may have trouble squeezing in this periodic function.

Note that, sometimes, only the active card can create state information, such as the key it creates while initiating a secure session. Thus, you still need a mechanism for passing this information to the redundant blade even with 1-to-1 redundancy. How much of your cycle budget goes to redundancy depends upon how graceful you need failovers to be. You need to consider the impact on interface bandwidth, because updating a checkpoint may eat into this budget. Thus, you need to consider redundancy as an integral part of your worst-case evaluation.

Network monitoring should also be an early consideration. GateD routing software from NextHop, for example, supports XML for automatic monitoring of line cards or via a browser. However, you don't want a network query to shut down traffic processing, so these functions need to execute during free cycles. If you're using an operating system that lacks support for task prioritization, you need to design-in a means for identifying and using free cycles.

 

Worst-case analysis

It's important to define how important the worst case is to your customer. Consider determining several worst cases, such as a CWC (ceiling worst case) and a WWC (worst worst case). The WWC is what you determine to be the worst operating conditions possible. The CWC takes into account the probability and importance of the WWC, representing a tolerance, or how close to the WWC you want to go. For example, for a WWC analysis, you would assume that every resource request meets with contention. Reality is that even if such a case did occur, it wouldn't last long. The advantage of aiming for a real-world CWC is that, if you try to reach a theoretical WWC, you may be unable to build your device.

For deterministic engines, each packet gets a well-defined window of processing resources. If you have defined your worst case accurately, you can tell whether a deterministic NPU (network-processing unit) will be unable to guarantee wire rate. However, to determine whether it will be able to run at wire rate, you still need to perform a significant engineering task. For example, you may have estimated that you'll need 50 cycles and three memory reads for a task. If your assessment is correct, the deterministic NPU will run at wire speed. If it is wrong, you'll need some errors in your favor on your other estimates to make wire speed.

Deterministic processors achieve wire speed by provisioning enough resources to meet worst-case processing constraints. This approach means that deterministic processors overprovision resources when running typical traffic, meaning that you burn more power, take up more board space, and pay more for the NPU than you need to. Note that the vendor's analysis of its target application space, rather than your application, determines this provisioning.

Nondeterministic processors make better use of resources by allowing packets that require more processing to borrow unused resources from easier-to-process packets. For example, if a deterministic processor has allocated four look-ups per packet, the processor cannot service packets that require five look-ups. A nondeterministic processor, on the other hand, can steal a look-up from a packet that requires three or fewer look-ups. Note that you have to evaluate your worst case in a sustained threshold—that is, how many high-service packets you can take on before borrowing no longer works.

This flexibility lets you better manage the cost of an NPU but at the expense of working with a device that is more difficult to program and for which it is much more difficult to profile a worst case. Instead of having set resource limits based on a single packet, you have limits based on packets over time. Determining the effect of threads and cores in contention is extremely challenging. How much servicing you can do with packets is a function of how often you encounter typical packets versus your worst case.

Most NPU vendors focus on the 40-byte packet problem in determining the worst case. They state that, when every packet is 40 bytes—an unrealistic situation—the device will process at wire speed. These vendors often overlook the worst case of larger packets that require more servicing than a 40-byte acknowledgment packet. Some vendors base worst cases on a single packet, and others base them on the characteristics of packets over time.

You also need to consider the variable characteristics of the traffic itself. For example, how much of the traffic has become fragmented and needs to be reassembled? How will you work this information into your worst-case analysis? If your design supports high-level servicing, what is the threshold ratio of low- and high-level service packets you can support? Even more challenging, what will your product have to do tomorrow that you haven't even considered yet? Additionally, with so many devices offering special power-saving features, determining the worst-case power consumption and duration becomes a complex calculation as well.

Few tools help you determine the worst case before you've actually finished your design. Vendors suggest that you can just throw off packets as exceptions and use the control plane as a coprocessor, or you can add an NPU.

Do you really need to meet wire rate 100% of the time, or is it OK to lower the bar to 99.9%? Consider the cost of missing a worst case and the probability of encountering it versus the cost of designing to that worst case. You can design a system that operates 100% of the time at wire speed, but will anyone pay for it? In this case, the cost is making products people can use at a price they will pay.

 

The NPF-interface specs

The NPF (Network Processing Forum) has released the LA (look-aside)-1 interface between an NPU (network-processing unit) and a coprocessor and the streaming interface for inline or switch-fabric devices; higher speed versions of these interfaces are in the works. These specs define the physical layer, so, for two devices to be compatible, they still need to share a common logical layer. Thus, an NPU with an LA-1 does not automatically interface with an LA-1-compatible search engine, and an NPU that supports multiple search engines would need drivers. Experts at the NPF say creating such drivers is a minor task.

Arguments exist against the work the NPF has accomplished, however. Some industry participants believe, for example, that the LA-1 interface will be unable to handle 10 Gbps for some applications and processors. However, the critical argument questions the value of narrowing innovation to a common denominator. Much of the value in the various NPUs and coprocessors is in their differences, not their similarities. Proprietary features, such as adding additional header information or using different encoding schemes—beyond specs—allow these devices to achieve higher performance..

In this regard, several companies, although providing an NPF-standard interface, continue to provide a proprietary interface. If you choose to use the proprietary interface, evaluate how close it is to the NPF-standard interface. The vendor may at some point abandon the proprietary interface, and you will have to rewrite code or develop an API shim to continue to use the part. If the APIs differ too much, this shim will be thick and require more cycles than you might want to give up.

The question is whether the NPU market is finished with unbridled innovation. Standards unite markets, and the NPU industry could use some unity as the market recovers. Standards also help you more easily design-in a second-choice part when the vendor of your first-choice part goes out of business. They let you focus on the processing part of the problem rather than reinventing how you pass information among devices. However, unless a standard encompasses enough true innovation, the market will falter because it doesn't offer enough beyond traditional and entrenched methods.

Can partners keep the ship afloat?

One measure of vendor stability is how many companies are committed to an architecture. For its IXP family, Intel has generated an impressive community of third-party vendors. You might argue that Intel’s third-party strategy makes it a secure choice, because so many companies are riding on its success. Perhaps Intel can afford to weather these times, but if the third-party vendors don’t start seeing revenue, they may need to pull out or die. Additionally, Intel funds many of its partners. Such funding can disappear if Intel needs to divest funds without appearing to waver in its commitment to the NPU (network-processing-unit) market.

When you depend on many companies, you open yourself to certain risks. For example, vendors that go out of business leave you with a hole in your tool chain. You also face the challenge of integrating multiple and disparate vendor products and of getting any of these vendors to accept responsibility when integration is difficult.

When determining the viability of a partner company, consider that third-party software vendors are also more protective of their IP (intellectual property) because they aren’t using the software to promote chips sales. It may be easier to use off-the-shelf software than to hand-code it yourself, but off-the-shelf packages are difficult to profile and integrate. Hardware vendors also pose a challenge. Radisys, not Intel, supplies the IXP evaluation boards. Radisys makes its money from engineers who buy the company’s boards instead of designing their own or paying Radisys to make changes for them. The question then becomes: Who supports the evaluation board? Radisys can’t afford to unless you plan on buying more boards from it.

Thus, in some respects, a large third-party network can be a disadvantage. The tools, software, and hardware are disconnected from the NPU vendor, and the incentives to support your development are complex, based on whether you develop your own designs. However, vendors that provide most or all of their own tools and software provide less extensive spreads than can third-party networks.

 

Security issues

As data rates increase, security becomes an important component in network processing. Most NPUs (network-processing units) treat secure packets as exceptions because of the time it takes to process them. Even with an encryption coprocessor, the overhead to manage security associations on the NPU can be costly, and the NPU needs a second interface to dedicate to the encryption coprocessor if a search engine is also in use. Additionally, encryption latency is too long for deterministic architectures to handle as a look-aside process.

Just omitting secure packets as exceptions doesn’t solve the problem, however. You need interface bandwidth on the order of four times the packet length to remove the packet from the datapath, pass it to the encryption coprocessor, receive it from the encryption coprocessor, and replace it on datapath. You consume four times the packet length of memory bandwidth, as well, if you have to store the packet before forwarding it and after receiving it back. These figures do not account for any bus contention or overhead for commands to process the transfers over the interface.

If the encryption subsystem is local to the line card, your device can consume less bandwidth by sending a pointer to a shared memory. Alternatively, you can place encryption coprocessors inline to handle all security processing before the NPU even sees the packet, although some people remain unconvinced that these coprocessors can handle wire-rate speeds if too many packets require processing. If the coprocessor uses SPI, you may be able to channel incoming data to support multiple inline coprocessors.

Another issue with security coprocessors is that they typically have ingress and egress interfaces. Thus, if your system has several NPUs, you can connect only one NPU to the ingress port. You have to either direct all secure traffic to this NPU, which is difficult to determine until after you start processing the packet, or implement a separate coprocessor for each NPU. If you use multiple coprocessors, you have the problem of managing security associations, because the coprocessors have neither shared memory nor a communication link.

 

Other considerations

TM (traffic management) has become a well-defined enough function to offload from the NPU (network-processing unit). Teradiant, for example, offers the Multiservice TM, part of its TeraPacket chip set, for handling TM functions. The device can also supply added headers, so packets need not access look-up tables on the egress port. However, the egress port must have a traffic manager that supports the added header, which adds 11 bytes. For 40-byte packets, this scenario requires accounting for a 27.5% increase in interface and fabric bandwidth. Also, intelligent fabrics may not understand the extra header.

Intelligent traffic managers can reduce the “cell tax” for fixed-length fabrics. Instead of sending an X+1-byte packet using two X-byte fabric transactions and thus wasting X–1 bytes of bandwidth, you can queue multiple packets across transactions boundaries to fully use available bandwidth. Such intelligence comes at the cost of an extra header to define packet boundaries.

Deterministic NPUs require the worst-case number of look-ups per packet. If you use an external search engine, you must set aside many more look-ups than you use.

Look-up engines no longer compete solely on look-up performance. Many offer advanced power-consumption-reduction features, although using these features often adds complexity, both by consuming table bits to mark entries and by interfacing bandwidth to pass commands. Other new features include key reuse, allowing you to execute multiple look-ups using the same key, reducing interface bandwidth consumption; macrolike instructions for configuring and executing multiple look-ups with a single command; statistics support, such as counting the number of matches for an entry; and automatic associated data look-up, in which the search engine passes a pointer to associated data rather than to a table entry, saving a step for the NPU. Note that automatic associated-data look-up has potential problems that can affect the determinism of the device, such as when the associated data is longer than the bus of the device—32 bits of data on a 16-bit bus, for example—and the device has several such look-ups in a row.

When engineering for the worst case, you have unused resources when you’re processing typical traffic. At such times, consider using these otherwise-wasted cycles for self-management. You can use flags between threads and cores on the same chip to register how hard the chip is working and whether you can afford to run self-management code. On the data plane, you could analyze and compress information on queues and traffic profiles during heavy loads and send it to the control processor for aggregation or prepare a redundancy checkpoint. On the control plane, you could evaluate the search-engine look-up table and widen gaps before they become problems.

Hardware-evaluation and -development platforms from NPU vendors are essential. Some vendors provide only a board, which means you can’t connect the NPU to a fabric without doing some work. Others supply complete backplanes into which you can plug your own cards. Keep in mind that manufacturers fabricate these backplanes with newer materials and better connectors than the deployed backplanes with which your finished design must actually operate.

Look at the reference platform. Is the design based on the vendor’s APIs or hand-coded with workarounds to be fixed later to improve performance. If the vendor expects you to rely on these tools, it should stand behind them by using them as well.

If you’re using an FPGA to bridge two devices and have some leftover gates, you can potentially do some preprocessing or postprocessing on data passed among the devices to save a few cycles on the receiving end.

High-priority queues can be smaller than low-priority queues because data sits for less time in them.

Several vendors claim that they have NPUs that can scale from OC-3 to OC-192, but how useful is this ability? The type of servicing on packets—that is, how deep into each packet and protocol you must go at each wire rate—differs; therefore, the processing bottlenecks, at which hand-coding pays the highest dividends, also differ.

Be careful when defining how much secure traffic your system can handle. One argument states that, even though customer demand might now be low, demand will quickly rise once security is a proven function.

When assessing companies, check out their Web sites, but also make a phone call. It can take months for some companies that have gone out of business to bother shutting down their Web sites.

 

Exception overloading

Handing off packets to the control-plane processor as exceptions from an NPU (network-processing unit) may appear to be a panacea for an overloaded data plane. In some respects, however, handing off a packet may actually take up more processing time than having the NPU handle it. Once the control processor processes the exception, the processor must reinsert the exception into the datapath. If you could process the packet on the NPU in the time it takes to hand off the packet and then have it come back through, it might make sense to keep it on the NPU and conserve memory bandwidth. Problems also arise in reordering packets, the memory needed to queue such packets, and the additional latency associated with control-plane processing. On the control plane, the CPU lacks access to a search engine for look-ups, requiring slower look-ups and further reducing exception-processing efficiency.

Profiling your exception capacity is often a case of guesswork. The control-plane-development environment treats the data plane as a black box and vice versa. Currently, to test load balancing between the data plane and the control plane, you need to use real hardware with real traffic. With many NPUs, however, you design with a device that doesn’t yet exist. If you can get hardware, it’s often a version of a chip that represents only a subset of the NPU’s functions, such as one core instead of six cores running on an FPGA. Additionally, you need to code most of your application just to test hardware.

Making a good guess means understanding the details of the NPU architecture and the coprocessors you can attach to it, as well as how much you can count on the control-plane processor. Working in C or another abstracted coding scheme can help. You can usually squeeze a bit more performance out of abstracted code through hand coding. However, estimating how much reserve performance you can squeeze out is difficult, because running one core instead of six shows you whether a single core alone can run at one-sixth the wire rate, but it does not exercise system resources in the way that six cores might.

As insurance, inject some inefficiencies into your code to give yourself even more headroom. Throw in extra memory accesses or search-engine look-ups to more thoroughly stress the system. You need to give yourself overhead with all system resources, not just processing cycles.

One approach for guaranteeing wire rate is to define the worst cases for packet processing and design a system that can handle them. Determining worst cases, however, is trading one guesswork problem for another. To understand what kind of traffic or dynamic conditions are pathological to a system, you need to test the system, which usually means that you have to design it. Building a model using abstracted code lets you quickly build a system, but abstraction insulates you from the details you need to understand to be able to determine worst cases.



ADVERTISEMENT

ADVERTISEMENT

Feedback Loop


Post a CommentPost a Comment

There are no comments posted for this article.

Related Content

 

By This Author


ADVERTISEMENT

Knowledge Center



Technology Quick Links

EDN Marketplace


©1997-2008 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites

ADVERTISEMENT
You will be redirected to your destination in few seconds.