AMD's Bulldozer And Bobcat: x86 Architectural Innovation Is (Finally!) Once Again Where It's At
AMD rocked the x86 microprocessor world when, in 1999, it unveiled the Athlon microprocessor built on the company’s then-latest K7 microarchitecture (and whose design team, as a quick trivia aside, was led by now-CEO Dirk Meyer). K7 delivered performance-vs-power consumption and -vs-die size punches that knocked Intel back on its heels. Athlon had the not-yet-even-released NetBurst microarchitecture at a clear competitive power consumption disadvantage and forced an Intel product line revamp that didn’t regain its stride until the P6 microarchitecture-derived Pentium M CPUs beginning in 2003, followed by P6-evolved Core microarchitecture-based products three years later.
Since then, however (and if I may be blunt), microarchitecture development has been in ‘coasting mode’ at AMD. The company’s K8-based mobile and desktop computing CPUs, clearly trace their lineage back to the K7 (albeit with 64-bit support and other enhancements). Similarly, the K10 microarchitecture used in Phenom-series and Athlon II mobile and desktop processors, along with their server counterparts, is a fairly minor evolution of the K7-then-K8 foundation.
With yesterday afternoon’s announcements at the Hot Chips conference, AMD thankfully got out of the saddle, stopped coasting and started sprinting again. The company actually has three product line vectors en route. First is Llano, the least technically exciting one and intended for mainstream desktop and mainstream-and-high end mobile systems. It’s a Phenom II (i.e. K10-based) derivative, presumably touting among other things a lithography shrink below today’s 45 nm metric. It’ll also embed on-chip graphics capabilities per the long-promoted and long-delayed ‘Fusion’ plan, although in such a unified configuration AMD prefers to refer to the GPU (graphics processing unit) as an APU (application processing unit) reflective of broader GPGPU functional aspirations.
For entry-level ‘netbook’ and ‘nettop’ designs, AMD finally plans a credible competitor to Intel’s Atom line with the ‘Bobcat’ microarchitecture:
A synthesizeable core, Bobcat can therefore be theoretically ported to a diversity of processes and fabrication sources as well as potentially licensed to partners, as has been the longstanding foundation of ARM’s business model and as Intel announced a while ago that it was planning to do with Atom. Bobcat can be summarized as Atom plus out-of-order instruction execution and load/store operations (albeit minus Atom’s Hyper-Threading dual-virtual-per-single-physical core capabilities), therefore making it conceptually analogous to Via’s Nano processor series. Note that it also adds two key features, 64-bit instructions and hardware virtualization ‘hooks’, which were absent from Intel’s initial Silverthorne and Diamondville Atom CPUs but added to latest-generation Pineview products.
The Bobcat core contains 32 KByte L1 and 512 KByte L2 caches. It will form the foundation of the first Fusion CPU-plus-GPU single-die combo from AMD, code-named ‘Ontario’ and to appear some time in 2011. Products’ core-count plans are unknown at this point; such information is particularly important since earlier this week, Intel expanded its Pineview family to encompass dual-core mobile Atom CPUs (at January introduction, Pineview included single- and dual-core ‘nettop’ processors but only single-core ‘netbook’ variants). AMD bullishly claims that Bobcat is capable of sub-one watt power consumption, thanks in part to aggressive and fine-grained clock gating techniques, and that it will deliver an “estimated 90% of the performance of today’s mainstream notebook CPU [editor note: specifically, an AMD mainstream notebook CPU] in half the area.” Here’s a conceptual diagram of a potential single-core die layout:
Turning the attention to server and high-end desktop designs (inevitably to trickle down into the mainstream over time), AMD also unveiled ‘Bulldozer’ yesterday:
Bulldozer is an intriguing ‘middle path’ spanning the two historical attempts to comprehend multi-threaded software support:
- Intra-core SMT (simultaneous multithreading), exemplified (for example) by Intel’s Hyper-Threading technology, and
- Inter-core (specifically multi-core) CMP (chip-level multiprocessing), which AMD and Intel have both heavily harnessed in recent years
As with Bobcat, AMD strives with Bulldozer to create a core that’s “designed for knee-of-the-curve IPC (instructions per clock)”. Said another way, in the Bobcat foilset, AMD’s architects were chartered with “finding the knee of the curve (scrutinizing performance gains against power costs)”. Each Bulldozer ‘module’ contains two independent integer cores (leading to AMD’s likely marketing-influenced plan to refer to single-module CPUs as ‘dual core’), each with dedicated (and relatively small) 16 Kbyte L1 data cache. But much of the remaining intra-module circuitry is singular in count and therefore integer core-shared:
- Instruction fetch and decode logic, along with 64 Kbytes of (again, relatively small) two-way instruction cache
- A common floating point unit, containing a scheduler, dual 128-bit floating point MAC pipelines and dual 128-bit packed integer pipelines, and
- A 16-way unified L2 cache of un-reported size
At the chip level, multiple processing modules share a common L3 cache (again of not yet-public capacity), along with northbridge logic including an integrated system memory controller. When do shared module resources make sense? The company’s philosophy is summarized in three bullet points:
- When usage is naturally bursty for a single thread
- When there’s little impact on timing and complexity of critical paths, and
- When there’s benefit from increasing amortized bandwidth
To my above comments regarding the diminutive L1 caches, AMD believes that optimized out-of-order execution can effectively hide the added access delays to the larger L2 cache. The company is also counting on aggressive data prefetch techniques to mask the even lengthier system memory latencies. And AMD finally added to Bulldozer a competitive response to Intel’s ‘Turbo Boost’ feature, which automatically increases the internal clock speed when the overall chip resources are under-utilized in order to maximize performance of resources in use while not exceeding chip-level thermal dissipation and power consumption thresholds:
In its Hot Chips presentation yesterday, the company claimed that, in comparison to a true dual-core module design, its hybrid Bulldozer module delivers an “estimated average of 80% of the CMP performance with much less area and power”. Earlier, in an embargoed briefing foilset shared with members of the technical press such as myself, AMD touted an even more robust prediction, that a Bulldozer-based CPU “delivers 33% more cores and an estimated 50% increase in throughput in the same power envelope as Magny-Cours” (the company’s current leading-edge server CPU). AMD’s focus on silicon area-optimized performance makes sense, given that Bulldozer will mark the company’s belated first use of the 32 nm silicon-on-insulator process at foundry partner and former internal fabrication resource GlobalFoundries, whereas Intel’s been running high-volume production on 32 nm for around a year already.
How much reality is behind AMD’s Bulldozer forecast? We’ll have to wait a while to see, I suspect; all that the company’s currently willing to share is that “Bulldozer will be utilized in client and server designs in 2011″. Does that mean January 1? December 31? AMD had better hope that it’s closer to the former than to the latter, because Intel’s not standing still. By the time Bulldozer appears, Intel’s Sandy Bridge microarchitecture will likely already be available in product form.