Zibb

Feature

Designing a multicore + coprocessor security system

Network and communications processors are moving to heterogeneous processor configurations for efficiency, cost, and best-of-breed flexibility.

John Bromhead, LSI Corp, Semiconductor Solutions Group -- EDN, 7/24/2009

Security in the enterprise is an ever-growing concern. Long gone are the days when a firewall was good enough. Now every bit of every byte of network traffic must be scanned to intercept ever more sophisticated malware. Preventive measures must not only attempt to secure the access to individual computers but also include strategies to protect virtual private networks and shared resources such as servers and network attached storage. Networks attacks need to be stopped at their entry points before they spread.

To completely protect intellectual property and prevent internal threats, corporations need to implement network security measures within the company between departments such as engineering, accounting, and sales. Similarly, a secure system must prevent a virus or a worm brought in by department from spreading to the entire corporate network within the building, or even to offices of the company located in other parts of the world.

To ensure an organization receives the highest levels of integrated application security, IT managers are increasingly implementing firedoors (a term invented by IDC Research) to supplement the firewalls that surround their internal networks. A firewall's function is used to prevent intrusion to a network. Similar to those used in building construction intended to contain and delay structural fire from spreading to adjacent structures, a firedoor compartmentalizes data and reduces the risk of viruses and worms spreading quickly throughout the network. The trade-off is that these intrusion-prevention systems must run at multigigabit speeds so that they don't impede network traffic within the enterprise.

Security system designers trying to solve these problems are immediately drawn to the new range of multicore processors. But with the ever-increasing need to reduce datacenter power consumption, just throwing cores at the problem may not be the most cost and power efficient solution. It is also important to consider that without dedicated coprocessor support, these multicore processors can fall woefully short of the mark. As a rule of thumb, if you dedicate a whole core of a dual- or quad-core 3-GHz processor you will only be able to handle less than a few hundred megabits per second of deep packet inspection. Today's content processors can easily handle 3Gb/s, 6Gb/s, or even 12Gb/s at less than one-tenth of the CPU cycles—that's up to 40X performance boost, not 40 percent - 40X, which is 4,000%.

For these reasons, packet processing in the multicore era often involves the use of a coprocessor. New dedicated hardware accelerators are appearing alongside or embedded in multicore processors that can dramatically offload and reduce the power consumption of security appliances, switches, and gateways. Designers should take a hard look at the technique of partitioning enterprise security applications to take advantage of the latest multicore systems and coprocessors while minimizing power and latency.

Coprocessor offload architecture

A coprocessor is a compute processor used to supplement the functions of the primary processor (the CPU). Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, or, in the case of a security system, deep packet inspection or encryption. By offloading compute-intensive tasks from the main processor, coprocessors can accelerate system performance.

A typical system has a main processor handling the application code while secondary processors handle the security functions (Figure 1). The main processor might concentrate on receiving and processing data while the coprocessor performs scanning for threats to ensure viruses and worms do not infiltrate the enterprise. Removing these tasks from the main processor frees up enough capability for it to handle the main application single-handed and possibly lead to a lower cost and lower power main processor. Security system developers are thus faced with the task of selecting the right multicore processor and the right coprocessor to meet their performance, cost, and power requirements.

Choosing the right multicore processor

The multicore technology era is solidly upon us and nearly every processor vendor is offering or developing multicore products and architectures to support the demand. A multicore processor can combine anywhere from 2 to 16 or more independent CPUs into a single package typically resulting in significant savings of space, cost, and power.

Processor selection is a crucial milestone in designing any system. There are many processor options to consider, and each has its pros and cons. System developers must realize that adopting multicore technology presents as many challenges as it does benefits. Putting multiple execution cores into a single processor does not guarantee greater multiples of processing power. Furthermore, there is no assurance that a multicore processor will deliver a dramatic increase in a system’s throughput.

The multicore processor selection process typically starts with the selection of the right instruction set architecture (ISA). This is usually influenced by availability of applications, OS support, and considerations such as ease of porting legacy software. In some cases, proprietary extensions within the ISA to handle certain tasks more efficiently can become a significant consideration. Finally, there are differences in the level of support for multithreading, virtualization, etc., amongst different industry standard ISAs, so that can also impact this choice.

The next step in the multicore selection process is driven by the single-threaded performance requirement of the security system, usually in terms of DMIPS/MHz of the most granular compute element—either a single core or a single thread of a multithreaded core. This determines whether one can live with a 500-MHz non-superscalar in-order simple CPU core, or whether one needs a 2-GHz superscalar complex CPU core with support for out-of-order execution, sophisticated branch prediction, etc.

The memory subsystem architecture is another key factor that determines the choice of the right multicore processor. Most security applications process packets in a stateful manner; that is, for each packet, they need to lookup the flow state (stored in memory) and process the packet accordingly. The performance scalability of multicore processors is often impacted most by the memory subsystem architecture, which comprises the L1, L2, and sometimes L3 cache hierarchies and DRAM.

An important consideration that can sometimes become an overriding factor for multicore processor SoC selection is the SoC interconnect architecture. Security systems that need to perform multiple networking and security services concurrently at multigigabit throughput need an interconnect architecture that offers deterministic low latencies to shared memories and coprocessors.

Some multicore processors offer a homogeneous CPU multicore strategy wherein all processing gets done by identical CPU cores, without the use of any special purpose hardware accelerators or coprocessors. However it is well established that most security applications that require cryptographic processing or content processing (typically rich regular expression matching) benefit from the use of dedicated function specific coprocessors. So several multicore processor offerings often integrate such security coprocessors or have the ability to work with external security coprocessors.

Table 1 summarizes considerations for choosing a multicore processor.

Topic Questions
Instruction Set—x86, MIPS, PowerPC, or other What application and operating system support (including device drivers) is available for main processor and coprocessors?
CPU core DMIPS/MHz What single threaded performance is required by the application?
Pure multicore (homogeneous) or hybrid (heterogeneous) cores What dedicated cores can your application take advantage of?
Embedded coprocessors or external What suitable internal coprocessors are available, or will you use an external device?
Internal bottlenecks What is the internal bus structure and memory bandwidth? Is there anything that limits throughput?
External bottlenecks Hypertransport (Is it the correct frequency and bandwidth to connect to external devices? PCIe (How many lanes and at what generation—Gen 1, 2 or 3?), PCI-X (How easy will it be to connect to external coprocessors?)
Need for exotic external RAMS—SRAM, TCAM, or RLDRAM Do any internal coprocessors need special memories?
Power consumption Can you reduce the number of cores and power consumption by using external coprocessor(s)?
Ability to handle TCP termination in hardware Is there any additional hardware support for this?
Embedded 1G or 10G Ethernet Is there any additional hardware support for this?

Choosing the right security coprocessor

The choice of the security coprocessor, whether embedded or standalone, depends on many different factors, which vary from ease of development (compilers, tool chains), to performance under varying conditions (packet size, flow size, etc.), to scalability (number of content rules, etc), to cost and power requirements. Usually system product families have a range of performance/cost/power/scalability requirements, and so attention needs to be given to the scalability of the coprocessor architecture—both hardware and software—to address the entire range of these requirements.

When choosing an embedded coprocessor offering, designers must carefully assess the strength of the embedded offering versus “best of breed” standalone coprocessors. For example, questions need to be asked about what possible bottlenecks might occur between the multicore general-purpose engines and the coprocessor and what per-flow throughput is supported. Some vendors have added tens of small engines, but per-flow processing capability can be severely limited. Other issues include the necessity of using expensive exotic RAM such as SRAM, TCAM, or RLDRAM within the system design, many of which are now available only from single sources. Designers must also consider second sourcing special RAM devices, and try to second guess whether or not they will even be available for future upgrades to systems that increasingly need to last longer in tough economic times.

When selecting a coprocessor, designers must also carefully consider the application of the security system and the signature database that it could be required to work with. Security applications can vary from a simple scan for viruses, to intrusion prevention, to a complete unified threat management (UTM) system. The UTM space is growing rapidly, with IDC predicting CAGRs of 22% from 2007 to 2012 (IDC, Worldwide Network Security 2008–2012 Forecast, October 2008, IDC #214246). IDC defines the UTM appliance as an "all-in-one" device that consolidates multiple security systems into one appliance for complete attack protections against the multiple threat mechanisms. The development of UTM is a major firewall inflection trend based on the need for stronger perimeter security solutions.

When considering signature databases, designers need to look beyond the hardware that they've designed and consider that when a new threat comes out, their system has to be rapidly reprogrammed to add protection from the newly discovered threat that needs to be dealt with. Network security rule writing or signature writing is an art form by itself. Complex rules that correctly identify threats without incorrectly indicating false positives are crucial to prevent unnecessary interruptions to the flow of business within a corporation. Signature databases can be designed alongside the hardware or purchased from outside vendors, and it is important to ensure that the coprocessor for the security system is compatible and capable of handling the additional tasks.

The choice between an embedded security coprocessor and a standalone coprocessor should be made once the security system designer has identified the right security coprocessor solution that meets the overall requirements of the supported security applications. Most embedded security coprocessors available today in multicore processors fall short when compared with the best-of-breed standalone security coprocessors, although this could change in the future. If the multicore processor does have the right security coprocessor for your security applications, then the next key consideration is performance scalability. Multicore embedded security coprocessors usually top out at 2–3Gbps throughput, so one still needs the ability to attach a standalone security coprocessor to get more throughput. Ideally, the multicore processor should have the ability to scale out from an embedded coprocessor to an external coprocessor while maintaining a uniform software architecture and software development environment (tool chain, compliers, debuggers, etc.).

Table 2 provides a summary of considerations for choosing a regular expression coprocessor.

Topic Questions
Embedded or stand-alone Application and operating system support (including device drivers)
Internal bottlenecks Internal bus structure, memory bandwidth in coprocessor
External bottlenecks Hypertransport, PCIe (How many lanes and at what generation?), PCI-X
Need for exotic external RAMS—SRAM, TCAM, or RLDRAM Can you utilize DDR2/3 or no external ram?
Per flow performance With network speeds increasing to multigigabit ranges, the per-flow performance is becoming increasingly important
Power consumption  
Maximum DPI throughput Measured at 1500-byte packets and other packet sizes that are important to your application (ie, 64 byte, 512 byte)
Signature database support Will you be creating your own DPI scanning rules or licensing from a third party? Are these supported by the coprocessor?
Cross-packet inspection Is this handled automatically by the hardware?
Number of rules How many rules are supported? How many groups of rules from different vendors can be supported?
Types of rules How sophisticated can the rules be? Are wildcards supported or only literal string matches?

Minimizing latency

One of the larger issues that security system designers must consider is the impact of latency on their designs. It is no longer just about throughput measured in Gb/s. Latencies of over 100 microseconds can cause havoc with voice-over-IP (VoIP) or video calls.

File transfer and e-mail are relatively immune to latency and can survive many seconds, even minutes, of delay (Figure 2). A person browsing the Web will abandon Web sites if pages take too long to load because they are being scanned for viruses by a security system. The problem gets even worse as you look at designs involving Instant Messaging, and even worse yet is the breakup of voice and video when packets are delayed for too long in voice and video applications. For the designer, the issue is not knowing what sort of traffic an edge router or internal security appliance or a switch will have to process. What they can be sure of is that the growth of richer, more latency-sensitive traffic is on the rise and they must design accordingly.

Shared memory, typically associated with homogeneous multicore systems, is accessed through a bus and controlled by a locking mechanism to avoid simultaneous access of the same memory by multiple cores. The shared memory structure can become a bottleneck when too many cores try to access it simultaneously. As memory utilization increases, memory latency increases exponentially. This bottleneck also implies that the memory architecture doesn't scale well with an increasing number of cores. The resulting cache misses cause pipeline stalls and wasted CPU cycles and, ultimately, an exponential decrease in processor efficiency.

The solution is to optimize the system for high utilization. Designers must create systems that can perform under the worst-case conditions. Interestingly, there may not be just one worst-case condition, or "busy hour." In the example of a cell-phone service provider, systems are most challenged during rush hour as millions of individuals in cars begin making calls. As they travel, the call is handed off from tower to tower, causing control-plane overload. While everyone is in their cars, Internet usage is down, so data transmission is at a minimum. A few hours later, the situation is reversed. People arriving at their offices or homes immediately log on to the Internet, causing a spike in data usage and data-plane overload while cell-phone calls requiring a signal hand-off drop dramatically.

It is important, therefore, to design a system for the worst case. If commuters using cell phones begin to experience dropped calls, they are likely going to consider switching to a new service provider. When designing a security system with a multicore processor, the worst-case usage scenario often lies in the memory systems.

Packet handling

One of the biggest problems for general-purpose processors is the handling of network traffic. Network and communication processors, on the other hand, are designed to store, process, and forward large volumes of data packets at wire speed.

General-purpose processors can perform complex calculations and make logical decisions but fall short when relied on to examine and manipulate packet headers. Security-system designers making use of multigigabit firewalls and firedoors should consider adding a network or communications processor to the front ends of their general-purpose processors or look at the new class of heterogeneous processors, which already include packet-handling coprocessors.

It could be argued that designing an efficient multicore + coprocessor security system requires using the best of all worlds. A possible packet-handling design begins with a front-end packet processor, coupled to a general-purpose multicore processor and connected on the back end to a regular expression coprocessor for deep-packet inspection (Figure 3). With this configuration, designers can make use of the general processor to run the main application and supplement it with additional coprocessors to perform traffic management and scanning data for threats.

As embedded devices have more tasks heaped on them, emerging multicore hardware is looking more heterogeneous. Imagine if a large engineering company had a marketing staff made up of identical people who all performed the same function. This example of a homogenous staff wouldn't work. In order to be effective, a marketing team needs to be made up of heterogeneous people such as product-line managers, public-relations specialists, and graphic artists who all do specific functions well. The same is true in systems design. Some vendors have begun adopting this heterogeneous multicore approach by surrounding general-purpose cores with coprocessor cores that perform specific tasks such as packet handling, network termination, TCP offload, cryptography, and regular expression processing can serve the system more effectively.

But one approach that works well today is supplementing a general-purpose multicore processor with best-of-breed security coprocessors to optimize overall system performance and security capabilities. Carefully consideration of issues such as memory latency, packet-handling techniques, choice of application, signature database, and operating system up-front can help prevent a system designer from being left with the “worst of breed” instead of the desired best of breed.

Author Information
John Bromhead is a product marketing engineer, responsible for managing the Tarari Content Processor product line for LSI, one of five main business units within LSI's Networking Portfolio. He has been with LSI for six years, including five years at Tarari, which LSI acquired in October 2007. Bromhead likes wearing Hawaiian shirts and building computers. He built his first one in 1981, a Z80 Nascom.


Reed Business Information Resource Center

Featured Company


Most Recent Resources

ADVERTISEMENT

ADVERTISEMENT

Feedback Loop


Post a CommentPost a Comment

There are no comments posted for this article.

Related Content

 

By This Author

There are no additional articles written by this author.


ADVERTISEMENT

Knowledge Center


Events

Microchip Worldwide Embedded Designer’s Forum
Dates: 10/6/2009 - 2/15/2010
Location: 120 Locations Worldwide

Microprocessor Test and Verification (MTV'09)
Dates: 12/7/2009 - 12/8/2009
Location: Austin, TX

Oxford University Digital Signal Processing Short Course
Dates: 1/25/2010 - 1/27/2010
Location: Oxford, United Kingdom

Oxford University Digital Signal Processing Implementation Short Course
Dates: 1/28/2010 - 1/28/2010
Location: Oxford, United Kingdom

Oxford University High-Speed Digital Design Short Course
Dates: 6/22/2010 - 6/23/2010
Location: Oxford, United Kingdom

Submit an EventSubmit an Event




Technology Quick Links

EDN Marketplace


©1997-2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites