Subscribe to EDN

Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel

March 6, 2008

The former lords of multiprocessing in the supercomputing realm have a term for problems that are easily decomposed for distribution to multiple processors. They call such problems “embarrassingly parallel,” as though you should be embarrassed when you need not break your back to solve a problem. Some well-known problems such as graphics and network packet processing exhibit this type of parallelism. I submit, for your consideration, two refutations of this concept of “embarrassing parallelism.” First, even embarrassingly parallel problems can require some pretty elegant solutions. Second, there’s a lot more parallelism around than is implied in the term. In fact, my colleague and frequent co-author Grant Martin, Tensilica’s Chief Scientist, coined a new term to describe this situation: “conveniently concurrent.”

Conveniently concurrent problems surround us and even problems formerly considered embarrassingly parallel are conveniently concurrent. Take the problem of network packet processing, for example: an endless stream of incoming packets needs processing (the application of a variety of network rules) and routing. At multi-gigabit/sec speeds, this problem becomes pretty big. Nevertheless, it’s a problem that can be elegantly cracked with a multicore approach, like the one Cisco just announced. Cisco’s ASR (Aggregation Services Router) series of edge routers handles packets flowing on multi-gigabit/sec Ethernet networks and applies a range of rules to these packets—rules relating to firewalls, QoS, Web caching, and network flow. The rule sets must be asymmetrically applied to each network packet. For example, a video packet lives under different rules than a VOIP packet.

Cisco engineers recognized that this routing problem was conveniently concurrent and developed a multicore SOC to crack this particular problem. The resulting SOC is called the QuantumFlow Processor, and it’s quite a chip. On chip, there are 40 32-bit, 1.2-GHz packet-processing engines. Each engine works on a packet from birth to death within the Aggregation Services Router. This scheme is similar to the one used in Cisco’s CRS-1 carrier-class router, introduced in mid 2004. However, the 92-terabit/sec CRS-1 is based on a 192-processor SPP (silicon packet processor), where each processor (called a packet-processing engine) works on one packet at a time from birth to death. Cisco’s QuantumFlow processor (code-named Popeye) only has 40 packet-processing engines, but each multithreaded engine handles four threads (each thread handles one packet at a time) so each QuantumFlow Processor chip has the ability to work on 160 packets concurrently. (See the well-done Flash video about this processor here.) Cisco’s CRS-1 sells for $450,000 and up. The company’s ASR 1000 series sells for a tenth of that price.

 

 

Cisco’s multicore QuantumFlow Processor contains 40 multithreaded packet-processing engines

There’s nothing embarrassing about Cisco’s approach to packet processing as far as I’m concerned.

At this point, you might reasonably object and say, “Hey Steve! You might not be embarrassed about that design, but it is embarrassingly parallel. My problem doesn’t look anything like Cisco’s packet processors.” Fair enough. However, I’ll wager that there’s still more convenient concurrency in your design problem than you will admit.

For example, take a look at a block diagram of a personal video recorder (PVR) that appears below. Within that block diagram are seven gray blocks. Each of those blocks represents an opportunity to exploit convenient concurrency. In times past, we might saddle one processor to handle two or more of these tasks. However, as task complexity has grown, each task may now need several hundred MHz worth of processor bandwidth. Multitasking such high-bandwidth tasks carries a severe penalty in terms of power dissipation and energy consumption. With device geometries as small as they are today, power and energy considerations greatly favor the multicore approach.

 


Personal Video Recorder (PVR) block diagram

Need another example? Sure. Take a look at the Super 3G mobile phone handset block diagram below. In this diagram, there are 18 gray blocks, which again shows ample opportunity to exploit convenient concurrency.

 

 

Super 3G mobile phone handset block diagram

 

Intel and AMD are both pushing multicore processors in the PC space these days. They must because the clock-rate wars have ended due to excessive power dissipation. However, in the PC space, all of the old rules from supercomputing days are seeping in and people are searching for compilers that will decompose big problems into processor-sized chunks. We are much more fortunate in the embedded world. Abundant concurrency is arranged so that the problem naturally decomposes into processor-sized chunks. How convenient.

(Note: The SPP and QuantumFlow Processor chips in Cisco’s CRS-1 and ASR routers are both based on Tensilica’s Xtensa processor architecture, but that’s not relevant to the basic claim of this blog entry that convenient concurrency is pervasive. Put another way, convenient concurrency is suitable for use with any processor architecture.)

Posted by Steve Leibson on March 6, 2008 | Comments (12)

April 16, 2010
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Buy Cialis commented:

handicapped sharable minority oppi blogs neighbors response chaithanya costa oscilloscope discoveries


March 12, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Steve Cox commented:

Hi Steve, Excellent article! I think you''ve hit the nail on the head about such architectures (hereogeneous multicore) being the obvious wave of the future in embedded SoC design. As you and Tracy mention, the compositional and pipeline parallelism is obvious to hardware designers, but what is less obvious to hardware folks is that you can get very impressive efficiencies in a programmable solution (performance/$/microwatt). It is true that engineers from the SW and HW realms are increasingly driving toward this common SoC architectural view, but for different reasons. The SW folks are moving this way for the power, efficiency, and simplicity reasons that you state so eloquently. The HW folks are moving in this direction as they realize the benefits of a programmable approach to their design cycle (faster to market, less risk), and the fact that, for many (not all) blocks, a finely tuned ASIP can meet their performance, gate count, and power requirements. Keep it comin''!


March 10, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Steve Leibson commented:

Tracy Hall, I don't take your comment as snarky. It simply means I didn't explain myself well enough. In the PVR and 3G phone block diagrams, I wrote that each gray box represents the opportunity for a processor. It's also an opportunity to use a non-programmable hardware block if that makes more sense. However, in my experience, designers steeped in a "computer-science" frame of mind rarely think of using a processor simply as a replacement for a complex state machine. Yet processors indeed make very good state-machine implementations with firmware upgradeability. And you're right that I'm defining my way out of a problem that's been imporperly characterized as difficult to parallelize. People with hardware-design backgrounds immediately see the problem as concurrent. People with a CS background immediately think mutitasking or multithreading. There lies madness.


March 10, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Steve Leibson commented:

Tom, Cisco's implementation of the Xtensa core supports multithreading. Cisco has an architectural license. There are 192 cores on the SPP chip, four of which are spares and are used to improve manufacturing yields. The software assumes there are 188 operating processor cores on the chip.


March 10, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Tom in Silicon Valley commented:

Two questions, Steve: 1. Which Xtensa processor core supports multithreading? 2. Does the SPP chip in the Cisco CRS-1 router have 188 or 192 Xtensa cores? If you Google "cisco crs-1 router xtensa site:tensilica.com" you will get two different answers from Tensilica's website.


March 10, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Tracy Hall commented:

Forgive what may apprear to be a snarky commnet, but the "problem" (3G) shown is only a problem if you start with a processor bias - i.e. *assume* a processor to do the work. From a hardware geek's perspective, the modular approach is "intuitively obvious"; an insight that leads closer to your model is to use processor cores for these modules to allow for design/application flexibility. Hardware/Hardwired solutions are inherently "parallel" in that sense; it feels to me like you are simply defining your way out of the "parallel" problem - of course a homogeneous array of cores may be best suited to "trivial" (no denigration intended) problems - but a heterogeneous array is simply solving the problem as given, rather than fitting the problem to a given solution of parallel cores...


March 7, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Steve Leibson commented:

Dave, sorry I misunderstood. I believe the shared on-chip memory provides the sort of global biew you're referring to. It directly ties into the QoS function that's part of this router. The ASR 1000 series is designed to be an edge router, so you're right, it does need to know about packet streams and not just about each packet.


March 7, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Dave J commented:

Steve, I was alluding to the fact that for some kinds of packet processing, the packets are related to each other. For example, under TCP the packets can arrive out of order, and if you want to do anything intelligent with their contents, such as screen for viruses, you need to potentially buffer and reorder them. Therefore, one packet one processor is fine, but the next packet that is *of the same flow* of the first one needs to be routed to the same processor as well, or at least a processor that can access the shared state. This itself requires access to all the potential processors, or at least some shared state that maps flows to processors. This is principally a problem for networking equipment on the edge or near-edge of the network, that need to terminate the protocol. Not relevant for the CRS-1, but I don't know anything about the QuantumFlow or its purpose, hence the question.


March 7, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Steve Leibson commented:

Dave, from an IP packet's perspective inside of a Cisco QuantumFlow Processor, there's no evident parallelism. The packet lives in the realm of one processor from the start of processing to the completion. In this sort of problem, one packet --> one processor (or more accurately, one packet --> one processor thread) makes perfect sense because there are so many packets to deal with.


March 7, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Dave J commented:

Let's not let the supercomputer people get us down. They have to achieve massive parallelism, or else there's not much sense to running their code on thousand-of-processor machines. Also, I have a feeling that most scientific codes are quite a bit different than the lot running on, say, a multimedia chip. Inverting lots of fantastically large matrices and other "mesh" stuff is harder on the processor(s) than it is on the programmer(s). There is also typically no real-time component to manage whatsoever. On another topic, can the new Cisco parts manage "flow" based protocols like TCP? The inter-packet state associated with managing such protocols seems to be a major fly in the ointment of such parallel architectures. Not insurmountable, but a headache, making the parallelism a bit less convenient.


March 6, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Steve Leibson commented:

Thanks for the great comments, Louis. One thing to note: Cisco's SPP and QuantumFlow Processor use MIMD processor arrays.


March 6, 2008
In response to: Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
Mapou commented:

Nice article. I also like the informal, easy to read style. And congratulations to Tensilica for its technology being used in Cisco''s awesome new dataflow multicore processor. Convenient concurrency may be pervasive but if you want to implement a web server with millions of clients, you''re out of luck if all you''ve got is a bunch of SIMD cores to play with. You gotta use MIMD cores for some situations. But the problem with pratically all MIMD multicore CPUs though, is that they use coarse-grain, thread-based parallelism. I understand the rationale behind the latest trend toward heterogeneous multicore processors but wouldn''t it be nice if we had a multicore chip that used a universal computing model that combined the qualities of both MIMD and SIMD without the faults? How come nobody in the industry is pursuing the road to universality? Those hybrid monsters are going to be a pain in the ass to write code for, you know. BTW, did you know that it cost Cisco $250 million, 100 engineers and five years to develop their new router? Wow! Not that it matters much. At $35,000 a pop, I''m sure they''ll recoup their investment in no time. :-) Louis Savain

POST A COMMENT
Display Name
captcha

Before submitting this form, please type the characters displayed above. Note the letters are case sensitive:

Advertisement
Advertisement
Advertisement
About EDN   |   Site Map   |   Contact Us   |   Subscription   |   RSS
© 2012 UBM Electronics. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other UBM Canon sites

UBM Canon | Design News | Test & Measurement World | Packaging Digest | EDN | Qmed | Pharmalive | Appliance Magazine | Plastics Today | Powder Bulk Solids | Canon Trade Shows