Subscribe to EDN
RSS
Reprints/License
Print
Email
PDF Version

Surpassing the bandwidth limitations of cache-based processing architectures

As bandwidth increases and security threats evolve, network infrastructures must support 10 and 40 Gbps of throughput, with deeper intelligence. Traditional cache-based processors don't suit use in data-plane processing once it scales to millions of flows per second. An alternative tightly couples network flow processors and x86 CPU cores.

Daniel Proch, Netronome -- EDN, March 17, 2011

At A Glance

  • Cisco predicts a fourfold increase in global IP (Internet Protocol) traffic by 2014, with video and wireless services leading the charge.
  • Increasingly elaborate network-security approaches require sophisticated packet processing, a high available-instruction-per-packet rate, and stateful flow management, all at 10 Gbps and higher speeds.
  • General-purpose multicore CPUs are ineffective at data-plane processing for networking applications because the data in these applications is rarely spatially or temporally associative, and CPUs’ caches are therefore too small to meaningfully encompass it.
  • Network flow processors use multiple techniques to hide memory latencies, therefore providing more efficient memory-bandwidth usage than that of general-purpose processors.
  • A heterogeneous multicore architecture tightly couples network-flow-processor cores with general-purpose multicore x86 systems over a high-speed virtualized PCIe (Peripheral Component Interconnect Express) datapath.
Surpassing the bandwidth limitations of cache-based processing architectures imageThe amount of network traffic in today’s wired and wireless infrastructures continues to rise at dramatic rates to keep up with the demand for IP (Internet Protocol)-based voice, video, and data services and applications (Figure 1). Cisco estimates that annual global IP traffic will increase fourfold by 2014, growing from 176 exabytes/year to three-quarters of a zettabyte—that is, 767 exabytes (Reference 1). The primary drivers for this growth are video services and mobile data.

Video, such as TV, video on demand, Internet video, and P2P (peer to peer), will exceed 91% of global consumer network traffic within four years. Internet-based video will grow from 33% to almost 60% of Internet-data traffic, in the process surmounting P2P as the primary video contributor and comprising the equivalent of 12 billion DVDs of data per year. Mobile-data traffic, although still a smaller individual category, will double every year, increasing 39 times in this same four-year period. And P2P traffic, although no longer comprising the most voluminous traffic type by 2014, will still be substantial as a percentage of overall network data.

Surpassing the bandwidth limitations of cache-based processing architectures figure 1

An important implication of this rapid growth in network throughput exists from the perspective of designing and developing products for data-center and carrier environments. The GbE (gigabit-Ethernet) infrastructure that vendors commonly deploy will rapidly saturate, forcing network architects to quickly move to the 10-Gbps Ethernet successor. Vendors will also introduce follow-on 40- and 100-Gbps interfaces in the coming years, which will eventually become commonplace in switching, routing, and network-security products.

Why flows matter

More users and more applications are the fundamental driving forces of this dramatic increase in network throughput. This combination ultimately results in more individual “conversations,” or “flows,” traversing the network at any time. A flow is a unidirectional sequence of packets, all sharing a set of common packet-header values. Two packet-header fields, a source and destination IP-address combination, or as many as 11 Layer 2 through Layer 4 header values can identify a flow (Figure 2).

Surpassing the bandwidth limitations of cache-based processing architectures figure 2Most network equipment employing ASICs or fixed-function network processors, including Ethernet switches and IP routers, processes traffic based solely on the information contained in datagram headers. These devices process traffic packet by packet, keeping no in-memory information, or “state,” of previous packets after each forwarding decision.

Network architects also deploy an array of security applications to protect their critical enterprise and carrier resources. Security-enhancing applications include virus scanning, firewalls, intrusion-detection and -prevention systems, DDOS (distributed-denial-of-service)-mitigation programs, DLP (data-loss-prevention) and test-and-measurement utilities, and network-forensics systems. These applications work almost entirely by implementing DPI (deep-packet inspection) and flow analysis, looking for known network patterns and, upon finding them, blocking or recording them. With the need for application awareness, security processing, and DPI, the amount of processing power for these computationally intensive applications grows exponentially with increases in line rates.

Maintaining the network state on all flows passing through a system is a critical requirement for all of these intelligent applications. Rather than implementing simple packet-based processing, security systems require sophisticated packet and security processing, along with a high available-instruction-per-packet rate and stateful management of flows at 10 Gbps and higher speeds.

Example applications

Considering the evolution of today’s threat landscape, numerous applications would prove ineffective without flow-based stateful processing of network traffic at the line rate. Cyber-security, lawful-interception, and traffic-management applications using DPI and behavioral-analysis techniques must retain a per-flow state because reliable analysis often requires seeing across individual packet boundaries to identify protocols and applications. These applications may also use heuristics or behavioral analysis to reliably detect applications or protocols even if advanced obfuscation or encryption techniques are in use.

As attacks become more sophisticated and attackers become better organized, intrusion-detection and -prevention systems rely on flow processing with many states. Modern attacks use invasion techniques, such as spreading malicious traffic across packet boundaries, payloads, and even IP fragments, to avoid detection. For example, Snort, a popular open-source intrusion-detection and -prevention application, includes a preprocessing module that reassembles an entire TCP (Transmission Control Protocol) flow to run signature-based rules against the entire connection payload, rather than simply examining traffic on a per-packet basis.

Network forensics, data-loss prevention, and antivirus applications, whether host- or network-based, terminate connections at the TCP layer, parse the application protocol, such as HTTP (Hypertext Transfer Protocol), SMTP (Simple Mail Transfer Protocol), P2P, and others, and even reassemble entire file attachments to scan for threats and monitor for confidentiality breaches.

The emergence of stateful next-generation firewalls, devices that integrate traditional firewall and network-intrusion- prevention capabilities, has recently caused a major stir in the market. The essential requirements for an effective next-generation firewall include the ability to identify applications regardless of port, protocol, or encryption scheme; to provide visibility and control over applications; and to accurately identify users to provide real-time protection against a variety of threats, including those at the application layer. A next-generation firewall retains significant attributes of each connection in memory, in which application identification and security processing happen at the beginning of the flow. The firewalls then use the flow state to process the session as a means of increasing performance.

Flow challenges

As networks’ traffic and bandwidth increase, building these networks becomes an increasingly memory-intensive challenge. Processing huge volumes of traffic at high instruction rates and maintaining an accurate tracking of flows require large amounts of memory for a state to remain across all of the packets in the flow.

Surpassing the bandwidth limitations of cache-based processing architectures figure 3Analysis of packet captures of real-world network-backbone links enables further investigation of flow-based forwarding challenges—specifically, the relationship between network throughput, packet size, and flow length in an effort to understand the mean time between packets in a flow (Figure 3). From such information, architects can derive the system memory for stateful flow processing at 10 Gbps and beyond.

It can be shown that the state required to process flows increases linearly with an increase in traffic in networks with similar traffic profiles. Analysis also reveals that the interpacket time within a flow is almost entirely due to application delay and tributary network speed. Transactional and signaling flows tend to be shorter and have greater application delays than do flows from streaming applications (Reference 2). The phenomenon is not due to network-backbone links. Factors such as network transmission and delay are also statistically irrelevant to interpacket times.

You might expect that packets that are sent back to back would be a few microseconds apart, but the data shows that flows are highly temporally dissociative. Average flow interpacket times can be a second or longer. Analysis also shows that bandwidth has no significant effect on flow interpacket time. Bandwidth does dramatically affect other aspects related to flow processing, however. As throughputs increase, the total number of flows increases linearly with throughput, as does the spatial disassociation of packets within a flow.

Flow considerations

Developers build general-purpose multicore x86 and MIPS CPUs with on-chip hierarchical caches to hide memory latencies and increase performance. For maximum cache usage and efficiency, data and instructions should reside in the on-chip cache memories for rapid access. If the cache memory does not store relevant information, the general-purpose processor instead must access external DRAM. Because external-memory bandwidths are significantly slower than those on the CPU, the processor effectively downshifts to the speed of the external memory when it is off cache.

An LRU (least recently used) algorithm typically evicts on-CPU cache memories’ data. Cache-based architectures require that data flows physically reside tightly together in space, time, or both to ensure a high cache hit rate. Therefore, general-purpose multicore CPUs are ineffective at data-plane processing for networking applications because data in these types of applications is rarely spatially or temporally associative. In enterprise and carrier networks, temporal disassociation of packets is evident at all data rates, and traffic is increasingly spatially dissociative as throughputs increase. As bandwidth grows, so do the number of unique flows, making hierarchical cache memories ineffective due to low cache hit rates.

One potential approach to this architectural issue would be to simply continue increasing memory caches’ sizes, but cache capacity is not keeping pace with the requirements for stateful flow processing at 10 Gbps and higher speeds. Internal processor cache memories are typically orders of magnitude smaller than external memories. For example, the Intel Xeon 5640 processor has 12 Mbytes of cache, compared with the multiple gigabytes of external memory that you can attach to these processors.

Conservatively assuming that a system requires 0.5 kbyte of memory to maintain state information for a single flow implies that, to support stateful analysis of 1 million flows, a processor’s cache memory would need to be significantly larger than those that are currently available to avoid ever evicting data from the cache. Similarly, reasonably assuming a 0.5-second latency between packets in a flow on a 1-Gbps link, 500 Mbits, or almost 63 Mbytes, would have traversed the system in that short time. Assuming an average packet size of 440 bytes, more than 142,000 packets, each from different flows, would have traversed the system before the next packet within that same flow arrives. For cache-based architectures to prove effective in these demanding circumstances would require cache memories approaching 1 Gbyte in size, almost 100 times larger than those now available in multicore processors.

Hitting the wall

The Achilles’ heel for any processing architecture is poor memory latency. When a processor’s cache memories are full—that is, when they have a fully occupied memory-data structure—the CPU’s cache is continuously thrashed as new packets arrive. The processor reads data once and then quickly evicts it from cache memory, causing subsequent cache misses and a low overall cache hit rate. The end result is significant memory latency when the CPU must go off-chip to get instructions and data from external memory. When this scenario occurs, the CPU stalls waiting for operations to complete, wasting as many as 200 CPU cycles per transaction for external DDR 3 SDRAM.

Purely cache-based architectures struggle to effectively handle high-packet-rate I/O traffic, security processing, and DPI at 10 Gbps and beyond. General-purpose CPUs are ideal for application and control-plane workloads, but they become a networking bottleneck in high-performance designs requiring high packet touch rates and a large number of instructions per packet over an increasing number of flows. Standard methods of hiding memory latencies, such as the use of hierarchical cache architectures, become ineffective.

Flow processors, conversely, use multiple techniques to hide memory latencies to provide more efficient memory-bandwidth usage than that of general-purpose processors. First, multiprocessing through as many as 40 independent networking- and security-optimized microengines can simultaneously process multiple independent streams of network traffic. Further, chip multithreading removes memory latency by allowing some processes to operate as other threads are waiting to complete. Using chip multithreading, memory operations are asynchronous to the processing threads. The processor structures and arranges read- and write-memory operations, effectively hiding one thread’s memory references behind another thread’s computations.

Heterogeneous flows

An effective processing architecture for intelligent network and security applications must account for cache inefficiencies that occur because of the number of instructions that the processor must apply to each packet to support stateful flow processing. Meeting these performance challenges warrants a new approach to the development of the high-performance systems that intelligent networks require. Such systems must be able to analyze traffic at all layers of the OSI (Open Systems Interconnection) model, from Layer 2, the data-link layer, to Layer 7, the application layer, and perform this intelligent processing on all traffic at sustained throughputs of 20 Gbps and higher. Achieving these goals requires specialized and varied processing elements for a specific type of workload computation.

A heterogeneous multicore architecture sets a new performance benchmark for embedded-application development through discrete processing elements for packet classification, stateful flow management, and application and control-plane processing, each with increasingly fine granularity. This architecture tightly couples network-flow-processor cores with general-purpose multicore x86 systems over a 40-Gbps virtualized PCIe (Peripheral Component Interconnect Express) datapath. This architecture can scale from lowend systems to appliances offering hundreds of gigabits per second of packet analysis, stateful flow monitoring, DPI, and application throughput, all with a common software architecture.

Accelerated designs employing this architecture can enable equipment providers to deliver high-performance, flexible systems that are more efficient than systems employing general-purpose x86 processors alone with standard NICs (network-interface cards). This architecture eliminates memory-latency problems and CPU stalls due to cache misses because the network-flow processor provides a level of preprocessing and properly structures data before transmitting it to the x86 CPU cores. Grouping traffic into flows at the network-flow-processor layer and optionally load-balancing flows across x86 cores, pinning flows to x86 destinations, or both approaches can dramatically increase the probability of cache hits because packets now arrive in a spatially and temporally associative manner.



References
  1. Cisco Visual Networking Index: Forecast and Methodology, 2009- 2014,” Cisco, 2010.
  2. Bradshaw, Christopher, “The Effect of Scaling Network Resources on Flows in an Internet Backbone,” 2010.

Author's Biography

Daniel Proch headshotDaniel Proch is director of product management at Netronome, where he is responsible for the company’s line of network-flow-engine acceleration cards and flow-management software. He has 14 years of experience in networking and telecommunications, spanning product management, chief-technology-office positions, strategic planning, engineering, and technical support.
RSS
Reprints/License
Print
Email
PDF Version
Talkback
Canon Resource Center

Featured Company


Most Recent Resources

Advertisement
Related Content

No related content found.

  • 0 rated items found.
Advertisement

KNOWLEDGE CENTER

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)
Featured Job On
Scroll for More Jobs
Advertisement
About EDN   |   Site Map   |   Contact Us   |   Subscription   |   RSS
© 2012 UBM Electronics. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other UBM Canon sites

UBM Canon | Design News | Test & Measurement World | Packaging Digest | EDN | Qmed | Pharmalive | Appliance Magazine | Plastics Today | Powder Bulk Solids | Canon Trade Shows