Surpassing the bandwidth limitations of cache-based processing architectures
As bandwidth increases and security threats evolve, network infrastructures must support 10 and 40 Gbps of throughput, with deeper intelligence. Traditional cache-based processors don't suit use in data-plane processing once it scales to millions of flows per second. An alternative tightly couples network flow processors and x86 CPU cores.
Daniel Proch, Netronome -- EDN, March 17, 2011
At A Glance
|
The amount of network traffic in today’s wired
and wireless infrastructures continues to rise
at dramatic rates to keep up with the demand
for IP (Internet Protocol)-based voice, video,
and data services and applications (Figure
1). Cisco estimates that annual global IP
traffic will increase fourfold by 2014, growing
from 176 exabytes/year to three-quarters
of a zettabyte—that is, 767 exabytes (Reference 1). The primary
drivers for this growth are video services and mobile data.Video, such as TV, video on demand, Internet video, and P2P (peer to peer), will exceed 91% of global consumer network traffic within four years. Internet-based video will grow from 33% to almost 60% of Internet-data traffic, in the process surmounting P2P as the primary video contributor and comprising the equivalent of 12 billion DVDs of data per year. Mobile-data traffic, although still a smaller individual category, will double every year, increasing 39 times in this same four-year period. And P2P traffic, although no longer comprising the most voluminous traffic type by 2014, will still be substantial as a percentage of overall network data.

Why flows matter
More users and more applications are
the fundamental driving forces of this
dramatic increase in network throughput.
This combination ultimately results
in more individual “conversations,”
or “flows,” traversing the network
at any time. A flow is a unidirectional
sequence of packets, all sharing
a set of common packet-header values.
Two packet-header fields, a source and
destination IP-address combination,
or as many as 11 Layer 2 through Layer
4 header values can identify a flow
(Figure 2).
Most network equipment employing
ASICs or fixed-function network processors,
including Ethernet switches
and IP routers, processes traffic based
solely on the information contained in
datagram headers. These devices process
traffic packet by packet, keeping
no in-memory information, or “state,”
of previous packets after each forwarding
decision.Network architects also deploy an array of security applications to protect their critical enterprise and carrier resources. Security-enhancing applications include virus scanning, firewalls, intrusion-detection and -prevention systems, DDOS (distributed-denial-of-service)-mitigation programs, DLP (data-loss-prevention) and test-and-measurement utilities, and network-forensics systems. These applications work almost entirely by implementing DPI (deep-packet inspection) and flow analysis, looking for known network patterns and, upon finding them, blocking or recording them. With the need for application awareness, security processing, and DPI, the amount of processing power for these computationally intensive applications grows exponentially with increases in line rates.
Maintaining the network state on all flows passing through a system is a critical requirement for all of these intelligent applications. Rather than implementing simple packet-based processing, security systems require sophisticated packet and security processing, along with a high available-instruction-per-packet rate and stateful management of flows at 10 Gbps and higher speeds.
Example applications
Considering the evolution of today’s threat landscape, numerous applications would prove ineffective without flow-based stateful processing of network traffic at the line rate. Cyber-security, lawful-interception, and traffic-management applications using DPI and behavioral-analysis techniques must retain a per-flow state because reliable analysis often requires seeing across individual packet boundaries to identify protocols and applications. These applications may also use heuristics or behavioral analysis to reliably detect applications or protocols even if advanced obfuscation or encryption techniques are in use.
As attacks become more sophisticated and attackers become better organized, intrusion-detection and -prevention systems rely on flow processing with many states. Modern attacks use invasion techniques, such as spreading malicious traffic across packet boundaries, payloads, and even IP fragments, to avoid detection. For example, Snort, a popular open-source intrusion-detection and -prevention application, includes a preprocessing module that reassembles an entire TCP (Transmission Control Protocol) flow to run signature-based rules against the entire connection payload, rather than simply examining traffic on a per-packet basis.
Network forensics, data-loss prevention, and antivirus applications, whether host- or network-based, terminate connections at the TCP layer, parse the application protocol, such as HTTP (Hypertext Transfer Protocol), SMTP (Simple Mail Transfer Protocol), P2P, and others, and even reassemble entire file attachments to scan for threats and monitor for confidentiality breaches.
The emergence of stateful next-generation firewalls, devices that integrate traditional firewall and network-intrusion- prevention capabilities, has recently caused a major stir in the market. The essential requirements for an effective next-generation firewall include the ability to identify applications regardless of port, protocol, or encryption scheme; to provide visibility and control over applications; and to accurately identify users to provide real-time protection against a variety of threats, including those at the application layer. A next-generation firewall retains significant attributes of each connection in memory, in which application identification and security processing happen at the beginning of the flow. The firewalls then use the flow state to process the session as a means of increasing performance.
Flow challenges
As networks’ traffic and bandwidth increase, building these networks becomes an increasingly memory-intensive challenge. Processing huge volumes of traffic at high instruction rates and maintaining an accurate tracking of flows require large amounts of memory for a state to remain across all of the packets in the flow.
Analysis of packet captures of real-world
network-backbone links enables
further investigation of flow-based forwarding
challenges—specifically, the
relationship between network throughput,
packet size, and flow length in an
effort to understand the mean time
between packets in a flow (Figure 3).
From such information, architects can
derive the system memory for stateful
flow processing at 10 Gbps and
beyond.It can be shown that the state required to process flows increases linearly with an increase in traffic in networks with similar traffic profiles. Analysis also reveals that the interpacket time within a flow is almost entirely due to application delay and tributary network speed. Transactional and signaling flows tend to be shorter and have greater application delays than do flows from streaming applications (Reference 2). The phenomenon is not due to network-backbone links. Factors such as network transmission and delay are also statistically irrelevant to interpacket times.
You might expect that packets that are sent back to back would be a few microseconds apart, but the data shows that flows are highly temporally dissociative. Average flow interpacket times can be a second or longer. Analysis also shows that bandwidth has no significant effect on flow interpacket time. Bandwidth does dramatically affect other aspects related to flow processing, however. As throughputs increase, the total number of flows increases linearly with throughput, as does the spatial disassociation of packets within a flow.
Flow considerations
Developers build general-purpose multicore x86 and MIPS CPUs with on-chip hierarchical caches to hide memory latencies and increase performance. For maximum cache usage and efficiency, data and instructions should reside in the on-chip cache memories for rapid access. If the cache memory does not store relevant information, the general-purpose processor instead must access external DRAM. Because external-memory bandwidths are significantly slower than those on the CPU, the processor effectively downshifts to the speed of the external memory when it is off cache.
An LRU (least recently used) algorithm typically evicts on-CPU cache memories’ data. Cache-based architectures require that data flows physically reside tightly together in space, time, or both to ensure a high cache hit rate. Therefore, general-purpose multicore CPUs are ineffective at data-plane processing for networking applications because data in these types of applications is rarely spatially or temporally associative. In enterprise and carrier networks, temporal disassociation of packets is evident at all data rates, and traffic is increasingly spatially dissociative as throughputs increase. As bandwidth grows, so do the number of unique flows, making hierarchical cache memories ineffective due to low cache hit rates.
One potential approach to this architectural issue would be to simply continue increasing memory caches’ sizes, but cache capacity is not keeping pace with the requirements for stateful flow processing at 10 Gbps and higher speeds. Internal processor cache memories are typically orders of magnitude smaller than external memories. For example, the Intel Xeon 5640 processor has 12 Mbytes of cache, compared with the multiple gigabytes of external memory that you can attach to these processors.
Conservatively assuming that a system requires 0.5 kbyte of memory to maintain state information for a single flow implies that, to support stateful analysis of 1 million flows, a processor’s cache memory would need to be significantly larger than those that are currently available to avoid ever evicting data from the cache. Similarly, reasonably assuming a 0.5-second latency between packets in a flow on a 1-Gbps link, 500 Mbits, or almost 63 Mbytes, would have traversed the system in that short time. Assuming an average packet size of 440 bytes, more than 142,000 packets, each from different flows, would have traversed the system before the next packet within that same flow arrives. For cache-based architectures to prove effective in these demanding circumstances would require cache memories approaching 1 Gbyte in size, almost 100 times larger than those now available in multicore processors.
Hitting the wall
The Achilles’ heel for any processing architecture is poor memory latency. When a processor’s cache memories are full—that is, when they have a fully occupied memory-data structure—the CPU’s cache is continuously thrashed as new packets arrive. The processor reads data once and then quickly evicts it from cache memory, causing subsequent cache misses and a low overall cache hit rate. The end result is significant memory latency when the CPU must go off-chip to get instructions and data from external memory. When this scenario occurs, the CPU stalls waiting for operations to complete, wasting as many as 200 CPU cycles per transaction for external DDR 3 SDRAM.
Purely cache-based architectures struggle to effectively handle high-packet-rate I/O traffic, security processing, and DPI at 10 Gbps and beyond. General-purpose CPUs are ideal for application and control-plane workloads, but they become a networking bottleneck in high-performance designs requiring high packet touch rates and a large number of instructions per packet over an increasing number of flows. Standard methods of hiding memory latencies, such as the use of hierarchical cache architectures, become ineffective.
Flow processors, conversely, use multiple techniques to hide memory latencies to provide more efficient memory-bandwidth usage than that of general-purpose processors. First, multiprocessing through as many as 40 independent networking- and security-optimized microengines can simultaneously process multiple independent streams of network traffic. Further, chip multithreading removes memory latency by allowing some processes to operate as other threads are waiting to complete. Using chip multithreading, memory operations are asynchronous to the processing threads. The processor structures and arranges read- and write-memory operations, effectively hiding one thread’s memory references behind another thread’s computations.
Heterogeneous flows
An effective processing architecture for intelligent network and security applications must account for cache inefficiencies that occur because of the number of instructions that the processor must apply to each packet to support stateful flow processing. Meeting these performance challenges warrants a new approach to the development of the high-performance systems that intelligent networks require. Such systems must be able to analyze traffic at all layers of the OSI (Open Systems Interconnection) model, from Layer 2, the data-link layer, to Layer 7, the application layer, and perform this intelligent processing on all traffic at sustained throughputs of 20 Gbps and higher. Achieving these goals requires specialized and varied processing elements for a specific type of workload computation.
A heterogeneous multicore architecture sets a new performance benchmark for embedded-application development through discrete processing elements for packet classification, stateful flow management, and application and control-plane processing, each with increasingly fine granularity. This architecture tightly couples network-flow-processor cores with general-purpose multicore x86 systems over a 40-Gbps virtualized PCIe (Peripheral Component Interconnect Express) datapath. This architecture can scale from lowend systems to appliances offering hundreds of gigabits per second of packet analysis, stateful flow monitoring, DPI, and application throughput, all with a common software architecture.
Accelerated designs employing this architecture can enable equipment providers to deliver high-performance, flexible systems that are more efficient than systems employing general-purpose x86 processors alone with standard NICs (network-interface cards). This architecture eliminates memory-latency problems and CPU stalls due to cache misses because the network-flow processor provides a level of preprocessing and properly structures data before transmitting it to the x86 CPU cores. Grouping traffic into flows at the network-flow-processor layer and optionally load-balancing flows across x86 cores, pinning flows to x86 destinations, or both approaches can dramatically increase the probability of cache hits because packets now arrive in a spatially and temporally associative manner.
|
References |
|
Talkback


















Daniel Proch is director of
product management at
Netronome, where he is
responsible for the company’s
line of network-flow-engine
acceleration cards and flow-management
software. He has 14 years of experience in
networking and telecommunications, spanning
product management, chief-technology-office positions, strategic planning,
engineering, and technical support.




