Feature
Multicore: the future of SOCs?
Will systems on chips follow server CPUs down the road to having many identical processor cores on a die?
By Ron Wilson, Executive Editor -- EDN, 10/30/2008
|
From the early days of SOCs (systems on chips), when the devices were simply single-chip integrations of board-level microcomputers, their architectural evolution has followed a single clear path. Architects added memory. They integrated application accelerators to execute specific, clearly defined tasks with greater speed and less energy. They introduced more complex interconnect structures and DRAM controllers to support data flows among the blocks.
Then, Intel announced a change of direction in the server market. Under pressure from the realities of less-than-100-nm processes, Intel shifted its focus from ever-faster CPUs to multiple simpler CPUs on one die. This new way to use the transistor budget delivered Intel from the futile search for greater instruction-level parallelism and saved it from the growing energy cost of higher clock frequencies. It also fit the needs of the server world, in which job mixes often present a rich pool of independent threads to execute on the multiple cores.
Today, some SOC architects view the multicore movement as irrelevant to the embedded, often hard-real-time world of SOCs. But others predict that as we move from 90 nm to 65 and 45 nm and beyond, SOCs will follow server and PC-processor chips into the multicore world. Instead of today’s typical architecture, in which a single CPU core sits at the center of a complex fabric of buses and controls a heterogeneous collection of specialized engines and peripherals, a multicore SOC might look more like a sea of nearly identical CPU cores. Perhaps there would be a few application-specific processor cores, as well, and certainly, there would be a huge portion of on-chip memory—some local to the processor cores and some shared. But in contrast with today’s SOCs, there likely would not be a single controlling processor. Rather, the system would distribute control among the cores, and this distributed kernel would dynamically map tasks onto cores, according to current needs and power constraints.
It sounds radical. But examples now exist of SOCs moving in this direction. If the vision comes to pass, it will change much of today’s conventional wisdom about SOCs and how they work.
The motivationWhy would architects contemplate such a radical change? There are two primary reasons: The first is power, and the second is scalability.
In principle, the attraction of multicore processing for power reduction is compelling. By splitting a set of tasks among multiple processor cores, you reduce the operating frequency necessary for each core, allowing you to reduce the voltage on each core. Because dynamic power is proportional to the frequency and to the square of the voltage, you get a big gain there, even though you may have more cores running. Even static power improves as you turn down supply voltage.
Clever software can take even better advantage of the situation. “In the ARM architecture, the wait-for-interrupt instruction gates the CPU clocks, so you pay dynamic power only for cores that are actually doing something,” notes ARM’s product manager, Ian Rickards. “And Linux is capable of recognizing when a core is not being used at all and powering it down altogether. An extension of this idea—which I haven’t seen used yet in multicore configurations—is to use dynamic-voltage-frequency scaling, so that the operating system keeps each core running at the minimum voltage and frequency necessary for its current task.”
Rickards points out that there are other power gains from the lower frequency. If the cores require a lower maximum frequency, it may be possible to use, for example, an LP process instead of a G process or to use 10-track libraries instead of high-performance, 12-track libraries. All of these gains add up to big savings.
But architects have to weigh these advantages against the inherent energy inefficiencies of a programmed-instruction machine compared with those of a hardwired engine. “Accelerators will always be better on power consumption, especially on well-defined tasks like security calculations or regular-expression matching, where the Harvard architecture is nonoptimal,” observes Dan Bouvier, chief technology officer of the microprocessor division of AMCC (Applied Micro Circuits Corp). The simple fact that a programmed architecture has to fetch and decode instructions—whereas a state machine does not—creates a difference in energy consumption. And general-purpose processors often must use many instructions to accomplish what dedicated hardware can do in a single cycle. Skilled power management and custom instruction hardware can minimize these differences but can’t entirely erase them, especially if you compare a power-optimized CPU core with a dedicated engine that is just as optimized.
As the number of processor cores in the SOC increases, the second factor, scalability, becomes important, as well. “It’s much easier to scale with a uniform architecture,” Bouvier says. “If you use specialized accelerators, the non-uniform-programming model creates havoc.”
Vamsi Boppana, senior director of technology at Open-Silicon, makes the same point from a hardware architect’s point of view. “I’m a huge believer that we are at an inflection point,” Boppana says. “We are seeing designs go through now with several hundred processors on a die. It won’t be long until there are thousands. With that many processors, there is simply no way you could manage the complexity of a heterogeneous architecture with static task assignments. These architectures must use SMP [symmetric multiprocessing].”
Creating the architectureIn the view of these experts, then, the future of SOCs lies in more uniform arrays of general-purpose processors. But how do we get from today’s heterogeneous, statically mapped, CPU-centric designs to those SMP designs? The answer is to go back to the application.
“You start out by looking at the nature of the parallelism the application presents, and the nature of the problem you are trying to solve,” states Tensilica’s Chief Technology Officer Chris Rowen. “At the first order, there is usually a fair amount of parallelism between the different functional subsystems in an SOC. For instance, the layers of the protocol stack and the video processing in an Internet Protocol set-top box can run pretty independently. You see a lot of that kind of potential parallelism in the data plane. It’s a historical artifact from the days when the functions were separate chips.
“That kind of structure gives you a start on spreading an application across multiple processor cores. The hard part is teasing apart the single big, unpipelined tasks that are left after you have done the easy part.”
Rowen says that, at the next level of complexity, creating the infrastructure in which the processor cores operate is equally important. As you divide tasks into separate threads, he says, you will see two categories of the connections between the tasks: data-streaming connections, in which data flows in an ordered way from one task into the other, and complex, shared-data connections, in which the exchange of data is unordered and possibly unpredictable. For these two situations, you need two kinds of connections between the processors: queued interconnect for the data streams and combinations of shared-memory, interprocessor-communications mechanisms and programming models for the more complex situations.
The interconnectAs the processing elements in the SOC become more nearly identical, the question of interconnect and memory architectures becomes more pressing. There is a natural tension here between wanting to keep the array of processors completely symmetric so that the operating-system kernel can move tasks around at will and needing to have specific kinds of interconnections for specific kinds of data sharing.
The simplest approach to this problem is to emulate the server world: Give all the processors local caches, often private at L1 and L2, sitting above a huge shared memory (Figure 1). In this way, the programming model is that all the processors always have access to all the memory.
Most architects are adamant that such an approach must have full hardware-supported coherency. “The minimum hardware configuration would include cache coherency across all the processors,” says Kerry Johnson, director of product management at QNX Software Systems. Although it is possible to run a system without hardware coherency, most feel it is simply too complex a problem and shifts too much overhead to the software.
But shared memory may not be the only logical connection between the processors. AMCC’s Bouvier suggests that the interconnect may also want to support some form of hardware-based message-passing or other direct interprocessor-communications link, as well as hardware for I/O virtualization, so that when the system assigns a task to a new processor, the processor doesn’t lose its connection to its I/O streams. The ability to move operating-system tasks, rather than keep them static, also means that you need to either centralize or uniformly distribute interrupt-control and DMA (direct-memory-access) hardware to maintain symmetry.
So, how does the hardware team implement all these facilities without creating too much overhead? That question is nontrivial. Even just connecting the local caches to the shared memory has proved challenging, driving designers from traditional shared buses to switch-based interconnect to complex multilevel interconnect. Adding the need for direct messaging between processors just makes the problem more interesting.
“One of the key questions is how you feed all the engines while staying within your power envelope,” says Dac Pham, director of platforms at Freescale Semiconductor. “You have to look at both the bandwidth and the latency needs of individual data flows, or you can end up starving your CPUs.”
“The solution is not just one approach or the other,” adds Freescale Senior Systems Architect Steve Cole. “For instance, in a recent design, we have employed three levels of cache, a switch-based interconnect, and a configurable hardware-coherency system that allows the designer to extend coherency over some memory structures but not others. It’s important to let the designer avoid the power and latency hits where he doesn’t need coherency, such as in simple flow-through data movement in the data plane,” explains Pham.
Unsolved problemsClearly, there is a road map from today’s SOCs to symmetric-multicore SOCs. But there are also unsolved problems along the way. Chief among these issues are finding enough parallelism in the application to use all those processors, dealing with hard-real-time constraints, finding a distributed operating system to work in this environment, and debugging the resulting system.
Creating parallelism is a necessary first step in exploiting multicore designs. If the parallelism comes from innate data parallelism—for instance, the nearly independent processing of macroblocks in video compression—then the problem may be relatively easy. If finding parallelism requires finding independent program threads in existing algorithms, however, the problem has no known analytical solution; it takes hard work, genius, and luck in about equal measures.
“The embedded world will develop multithreaded applications,” insists Cole. “But, more often than not, the code will be developed from scratch or acquired from a start-up, not developed by reworking existing single-threaded code.”
Real-time constraints are other sticking points. Cache-coherent symmetric-multiprocessing systems are inherently nondeterministic because you cannot predict the latency on a load operation without knowing the exact origin of the data. In a system with dynamic task allocation, the response to an interrupt can vary wildly depending on the task mapping and state of the processors—and their power management—at the instant the interrupt is decoded. The Microsoft approach of declaring a nondeterministic system to be real-time if it runs fast enough may prove inadequate in the SOC world.
|
These facts lead some architects to believe that there will always be a firm partition between the symmetric and the real-time portions of the hardware, with real-time tasks statically mapped onto specialized processors outside the symmetric-multicore array. “I think you will see processors cluster together,” Bouvier says, “with some processors grouping as an SMP subsystem and others standing alone and running their own instances of a real-time operating system.”
Just where the operating system goes is another work in progress. Many architects believe that bringing SMP into the SOC world will eventually require a genuine distributed operating system, in which there is no one master instance of the kernel running on a particular processor but rather a microkernel on each processor with the system dynamically allocating all the other operating-system threads to processors. Embedded-Linux designers are working to accomplish this goal, but no one yet points to a finished product.
Finally, there is the huge matter of debugging. “Distributed applications will require a totally different way of debugging,” says Open-Silicon’s Boppana. “They need a consistent way to access and probe an application and a reliable way of replaying how things actually happened leading up to an event. And people are asking for more kinds of information, as well—activity monitors and thermal monitors, for instance. There is a lot of research going on in this area now.”
Bouvier agrees. “As we move from one big fat thread to many finer threads, the number of interprocessor dependencies grows. This situation creates a real debugging challenge: How do you run, stop, or trace a multiprocessor system with these dependencies? If you can capture this much data, how do you get it out of the chip?” he asks.
Despite these issues, some SOCs today already employ regions of symmetric-multicore processing. In some applications with a high degree of parallelism, such as network processing, the trend has already advanced to large numbers of processors and threads. And it seems clear that this evolution will continue and spread. As it does, much of the complexity that today resides in the hardware architecture and implementation will move into the software, changing the makeup of an SOC design team.
| For more information | ||
| AMCC: www.amcc.com | ARM: www.arm.com | Freescale Semiconductor: www.freescale.com |
| Microsoft: www.microsoft.com | Open-Silicon: www.open-silicon.com | QNX Software Systems: www.qnx.com |
| Tensilica: www.tensilica.com | ||
| Author Information |
| You can reach Executive Editor Ron Wilson at 1-408-345-4427 and ronald.wilson@reedbusiness.com. |
















