Multicore Philosophy: Convenient Concurrency versus Embarrassingly Parallel
The former lords of multiprocessing in the supercomputing realm have a term for problems that are easily decomposed for distribution to multiple processors. They call such problems “embarrassingly parallel,” as though you should be embarrassed when you need not break your back to solve a problem. Some well-known problems such as graphics and network packet processing exhibit this type of parallelism. I submit, for your consideration, two refutations of this concept of “embarrassing parallelism.” First, even embarrassingly parallel problems can require some pretty elegant solutions. Second, there’s a lot more parallelism around than is implied in the term. In fact, my colleague and frequent co-author Grant Martin, Tensilica’s Chief Scientist, coined a new term to describe this situation: “conveniently concurrent.”
Conveniently concurrent problems surround us and even problems formerly considered embarrassingly parallel are conveniently concurrent. Take the problem of network packet processing, for example: an endless stream of incoming packets needs processing (the application of a variety of network rules) and routing. At multi-gigabit/sec speeds, this problem becomes pretty big. Nevertheless, it’s a problem that can be elegantly cracked with a multicore approach, like the one Cisco just announced. Cisco’s ASR (Aggregation Services Router) series of edge routers handles packets flowing on multi-gigabit/sec Ethernet networks and applies a range of rules to these packets—rules relating to firewalls, QoS, Web caching, and network flow. The rule sets must be asymmetrically applied to each network packet. For example, a video packet lives under different rules than a VOIP packet.
Cisco engineers recognized that this routing problem was conveniently concurrent and developed a multicore SOC to crack this particular problem. The resulting SOC is called the QuantumFlow Processor, and it’s quite a chip. On chip, there are 40 32-bit, 1.2-GHz packet-processing engines. Each engine works on a packet from birth to death within the Aggregation Services Router. This scheme is similar to the one used in Cisco’s CRS-1 carrier-class router, introduced in mid 2004. However, the 92-terabit/sec CRS-1 is based on a 192-processor SPP (silicon packet processor), where each processor (called a packet-processing engine) works on one packet at a time from birth to death. Cisco’s QuantumFlow processor (code-named Popeye) only has 40 packet-processing engines, but each multithreaded engine handles four threads (each thread handles one packet at a time) so each QuantumFlow Processor chip has the ability to work on 160 packets concurrently. (See the well-done Flash video about this processor here.) Cisco’s CRS-1 sells for $450,000 and up. The company’s ASR 1000 series sells for a tenth of that price.

Cisco’s multicore QuantumFlow Processor contains 40 multithreaded packet-processing engines
There’s nothing embarrassing about Cisco’s approach to packet processing as far as I’m concerned.
At this point, you might reasonably object and say, “Hey Steve! You might not be embarrassed about that design, but it is embarrassingly parallel. My problem doesn’t look anything like Cisco’s packet processors.” Fair enough. However, I’ll wager that there’s still more convenient concurrency in your design problem than you will admit.
For example, take a look at a block diagram of a personal video recorder (PVR) that appears below. Within that block diagram are seven gray blocks. Each of those blocks represents an opportunity to exploit convenient concurrency. In times past, we might saddle one processor to handle two or more of these tasks. However, as task complexity has grown, each task may now need several hundred MHz worth of processor bandwidth. Multitasking such high-bandwidth tasks carries a severe penalty in terms of power dissipation and energy consumption. With device geometries as small as they are today, power and energy considerations greatly favor the multicore approach.

Personal Video Recorder (PVR) block diagram
Need another example? Sure. Take a look at the Super 3G mobile phone handset block diagram below. In this diagram, there are 18 gray blocks, which again shows ample opportunity to exploit convenient concurrency.

Super 3G mobile phone handset block diagram
Intel and AMD are both pushing multicore processors in the PC space these days. They must because the clock-rate wars have ended due to excessive power dissipation. However, in the PC space, all of the old rules from supercomputing days are seeping in and people are searching for compilers that will decompose big problems into processor-sized chunks. We are much more fortunate in the embedded world. Abundant concurrency is arranged so that the problem naturally decomposes into processor-sized chunks. How convenient.
(Note: The SPP and QuantumFlow Processor chips in Cisco’s CRS-1 and ASR routers are both based on Tensilica’s Xtensa processor architecture, but that’s not relevant to the basic claim of this blog entry that convenient concurrency is pervasive. Put another way, convenient concurrency is suitable for use with any processor architecture.)
Buy Cialis commented:
Steve Cox commented:
Steve Leibson commented:
Steve Leibson commented:
Tom in Silicon Valley commented:
Tracy Hall commented:
Steve Leibson commented:
Dave J commented:
Steve Leibson commented:
Dave J commented:
Steve Leibson commented:
Mapou commented:















