IBM Power7 architecture illustrates some issues for the rest of us
IBM's description of the Power7 CPU chip at Hot Chips Wednesday raised three interesting architectural points. None of these is directly important to the commercial microprocessor market, because IBM designed the Power7 chip primarily for use in its own supercomputing server designs. But all three points illustrate architectural issues that will in the future concern the broader computing and embedded communities.
First, some context. Power7 is a single-chip, eight-core cluster of processors. This is quite a leap from previous generations of the Power architecture, in which the largest number of CPU cores on a die was two. The chip is designed to fit into an extended coherent array of processors, scalable up to 32 sockets, or 256 CPUs in a single hardware-coherency fabric. This contrasts with PC or workstation CPUs intended only for much smaller coherency networks. This fundamental difference shows up in all aspects of the chip design.
The first architectural point is the growing importance of memory hierarchy. Many architects today are saying that managing memory traffic has become a more critical problem than CPU microarchitecture in today's multicore systems. The Power7 certainly reflects this growing concern, integrating either three or four levels of local memory—depending on how you count them—onto the die.
The problem, according to IBM chief storage hierarchy and SMP architect William Starke, is a dilemma at the systems level. "Our experience with building up to 64-processor computing systems in previous generations has convinced us that symmetric multiprocessing [SMP] is the best way to deliver performance to a computing cloud," Starke said. But he went on to say that in order to keep even latency-tolerant multithreaded CPUs from standing idle, you really need 2 to 4 MBytes of tightly coupled cache on each processor and a shared "memory tank," as he called it, of more than 30 MBytes.
That was not so much a problem when there were two cores on a chip, each with its own local cache. You could construct a fast parallel interface from each chip to a shared pool of external fast memory. But with eight cores on a die, you simply run out of pins to connect all those local caches to the tank.
IBM's solution was to turn to an embedded DRAM process, apparently a variant of the deep-trench DRAM developed with the now-defunct Qimonda. The company has implemented deep-trench embedded DRAM in its 45-nm SOI process, and used it to put a 32-MByte L3 cache down the center of the Power7 die. Thus, each of the private L2 caches directly abuts the L3.
This scheme came in for some fine tuning. It turns out that IBM's eDRAM design is, to use Starke's word, fluid: the whole 32 MBytes can act as a contiguous memory for one CPU, but there are regions of the array—fingers reaching out from the central spine—that each CPU can access with lower latency than the entire array has. Power7 uses this capability to create, in effect, a local, up to 4-MByte L2.5 cache for each CPU. These smaller, lower-latency regions are part of the whole 32-MByte array, but each CPU can reach its own region at about one-fifth the latency of the full L3. This facility in turn allows the Power7 to use a significantly smaller and faster L2 true cache: 256 kBytes with only an eight-clock latency. The primary instruction and data caches in the CPU cores are only 32 KBytes.
Using on-chip memory allows wide, fast parallel interfaces in the interconnect, rather than much slower off-chip interfaces, between the CPUs and their first three levels of cache. Starke said that a Power7 chip has an aggregate on-chip interconnect bandwidth of over 500 GBytes/s. But even with the on-chip tank, there will still be references to main memory, and the eight CPUs still require enormous DRAM bandwidth. Starke said that previous experience suggests that for the system's memory to balance its computing speed, each core needs 20 GBytes/s of bandwidth to main memory and about 32 GBytes of DRAM space.
The Power7 addresses this need with a pair of massive DDR3 DRAM controllers: massive in reference both to their throughput and their complexity. Each controller handles four 6.4-GHz DDR3 channels, so all together a Power7 chip has a rather staggering sustained DRAM bandwidth of over 100 GBytes/s. But throughput is far from the only problem with DDR3, as anyone who has looked at the issues is well aware. The DRAM chips require extraordinary care in sequencing of RAS operations to minimize power, and in sequencing of page opens and closes to achieve anything close to the theoretical transfer rate. That means in order to be effective, a DRAM controllers has to have at its disposal a rich variety of DRAM requests, and the freedom to reorder them as it will. To this end, each controller has a 16-kByte rescheduling buffer in which to accumulate and reorder memory requests.
The second architectural point is that the superscalar movement is far from over. Early in the quest for superscalar performance there was intense discussion about how much instruction-level parallelism was actually available in real code, and therefore how many independent dispatch paths a CPU really needed. The answer, after considerable discussion, came out at around three. After that conclusion, superscalar architectures stopped adding execution units, except for quite specialized ones, and VLIW machines began to slip out of favor.
But in retrospect, the answer three may have been a function of the times, the relatively high cost of transistors, and the generally horrible code from the Microsoft world that dominated tasks in those days. Clearly the architects of the Power7 cores have come to a different conclusion. Each of the Power7's cores has 12 execution units, including two each load/store and fixed-point, four double-precision floating-point, and a decimal floating-point unit. The core can dispatch six instructions per cycle.
The execution pipelines have been shortened and retimed to improve speed. But otherwise, moving rather against the trend toward simplification of recent designs elsewhere, the internals of the core are quite complex. The Power7 implements Power Architecture version 2.06, which expands rather than simplifies the instruction set and unifies the registers into a single set of 64 128-bit registers, into which the integer registers get mapped. The architecture allows for out-of-order execution and employs a distributed recovery function, allowing about a hundred instructions to be in flight between dispatch and recovery at any given time. It is a core designed with the assumption that its compiler is going to give it a code stream rich in reordering and parallel execution possibilities. The core also offers hardware support for up to four independent threads, reducing its sensitivity to long main-memory latencies.
Finally, the third architectural point is the importance of hardware coherency to symmetric multiprocessing. This becomes a crucial issue as users scale up their server clusters from a few sockets to a few dozen sockets to a few hundred sockets and beyond. The bandwidth demands of coherency algorithms grow more than linearly with the number of CPUs, and as we pass from four-way SMP systems to clouds, controlling this growth becomes a prime architectural problem. For once your cloud has outgrown the reach of the hardware coherency support, you have created a discontinuity in the cloud that is visible to application software.
Taming the problem requires a combination of elegance and massive bandwidth. Characteristically, the Power architects have supplied both.
On the elegance side, Starke explained that simply scaling up the huge pipes with which Power chips communicate coherence transactions to each other would not work: a 256-core Power7 system would require nearly 2 TBytes of coherence-network bandwidth.
Anticipating this, IBM earlier developed a coherence protocol that combines two different mechanisms for broadcasting coherence information: one global, and one speculative and local to a cluster of four chips in the Power7 architecture. In order to use such an approach without simply stopping execution most of the time, the coherence resolution is non-blocking. Starke said that at a given moment, there can be up to about 20,000 coherent store operations in flight within a full-blown 256-core computing system.
This protocol, in turn, is elegantly folded into the hardware. The cache controllers maintain 13 states, not the usual four or so used in smaller coherent architectures. By cleverly hiding a number of scope-indicating bits in the ECC portion of the cache lines, the hardware designers have minimized the impact of this complexity on the overall cache size.
The massive bandwidth side of the formula comes in the form of 120-Byte coherency links that fuse clusters of four chips together into a local group, and reach out to connect groups together, providing for up to 32 chips within a single, multi-scope hardware coherency network. These links provide the enormous bandwidth still necessary, after all the elegance, to support the scheme.
The bottom line on all of this intensive design work is a word used often by both Starke and IBM Power7 Chief Engineer Ron Kalla: balance. By balancing execution speed, memory bandwidth and latency, and coherency overhead the Power7 architects have attempted to make possible systems that will scale up to 256 CPU cores without experiencing a sharp drop-off in performance-per-core. Their success will depend on other factors as well, of course, including the task mix and the quality of the compilers. But as an exercise in CPU chip design, the Power7 suggests what steps hardware architects will have to take. As other kinds of computing problems gravitate toward multicore solutions, embedded-computing architects may find the Power7 an excellent text from which to start their studies.