Design Feature: September 12, 1996
| Download Product Data Sheet (MS Excel) |
| 64-bit chips |
Alpha has a 64-bit load/store architecture with 32 integer and 32 floating-point registers. The Alpha chips consist of the first-generation 21064A, the second-generation, performance-focused 21164, and the integrated 21066.
The 9.5 million-transistor 21164 has a seven-stage integer pipeline that contains two integer units. It also includes a nine-stage floating-point unit that can simultaneously issue an add and a multiply in one cycle. A unique feature of the 21164 is the 96-kbyte, on-chip, L2 cache. To help increase loading efficiency, the 21164 contains merge logic that looks ahead to see if more than one reference is being made to the same cache block. The merge logic can allow up to 20 load instructions to be in operation at the same time.
The 21064 has a seven-stage integer pipeline and a 10-stage floating-point pipeline and can issue up to two instructions per cycle. It implements branch-prediction hardware based on a single branch-history bit (2 bits in the 21064A) stored with each instruction in the instruction cache. The first four pipeline stages are static; the last three stages are dynamic and unaffected by any stall in the pipeline. To buffer slower external memory from the high-speed processor, the chip has two direct-mapped caches--one each for instruction and data. Writes are also buffered, holding up to four 32-byte blocks for pending writes. The 21064 also has provisions to accommodate up to 16 Mbytes of secondary cache.
The 21066 integrates on-chip, fully pipelined integer and floating-point processors. An instruction fetch-and-decode unit prefetch two instructions in parallel and determine if the resources are available to execute one or both of the instructions. The CPU also contains a memory controller, graphics accelerator, and PCI I/O controller.
Power management: Alpha chips support thermal and idle-based power management through a programmable internal clock divider. Clock frequencies can be reduced up to eight times.
Special instructions: All instructions are 32 bits in length and consist of branch, load/store, integer and floating-point operations, and CALL_PAL (Privileged Architecture Library) types. CALL_PAL instructions vector to a software library that automatically performs both privileged and unprivileged functions, such as handling interrupts and exceptions and maintaining TLBs. This instruction class allows Alpha to accommodate VAX-specific hardware characteristics.
Second sources: Mitsubishi second sources the 21066A. Samsung is also licensed to build and market Alpha µPs.
The MIPS R5000 is the first low-end implementation of the MIPS IV instruction architecture. The R5000 implements a five-stage pipeline similar to that of the R4600/R4700. The pipeline includes: instruction fetch, decode, execute, data cache read, and cache write. In addition, the R5000 provides a dual-issue mechanism to allow a floating-point instruction to be issued simultaneously with any other instruction type.
A dual 48-entry, virtually indexed TLB also contributes to performance by allowing back-to-back TLB accesses. The TLB implementation is compatible with R4xxx implementations to ensure full compatibility with system and user software. The R5000 includes a fast multiplier for floating point, as well as a moderately fast and separate integer multiplier. The CPU also supports an interface to a synchronous L2 cache.
Special instructions: The device is fully compatible with MIPS I, MIPS III and MIPS IV. MIPS IV includes support for four multiply-accumulate/subtract floating-point instructions, useful in graphics and signal processing. It also includes new addressing modes for floating-point operations required by compilers optimized for higher performance, floating-point throughput.
Second sources: MIPS Technologies is the primary design center for the R5000. Silicon suppliers are IDT, NEC, and NKK (Santa Clara, CA).
All R4xxx processors implement the MIPS III architecture, which provides 64-bit integer registers, 64-bit instructions, and 32- or 64-bit addressing for each privilege level. The first device to implement the MIPS III architecture was the high- end R4400. The superpipelined R4400 has an eight-stage pipeline: instruction fetch (first and second halves), register file (access), execute, data cache (first and second halves), tag check, and write-back. The drawback to having a long pipeline is obvious when the processor performs branches or memory references: Branches cause a latency of three internal clocks, and loads incur a two-cycle latency. However, superpipelining increases performance, because each stage can run at twice the system clock.
The R4700 (and derivatives) represent the low-cost, low-power end of the R4xxx product family. These CPUs include a five-stage pipeline: instruction fetch, decode, execute, data cache read, and cache write. They rely on increasing clock speed to raise performance. The pipeline includes a one- processor clock delay slot for branch-type instructions.
The R4300i reduced its silicon overhead by overlapping the FPU and executing floating-point instructions in the integer ALU,which simplified the pipeline and eliminated the need for a separate floating-point data path. To accommodate floating-point operations, the R4300i's integer unit has 32 64-bit floating-point registers. The R4700 includes a separate FPU and performs integer multiply and divide in the FPU. (The integer MPY-DIV unit has been removed.)
The R4700 does virtual-to-physical address translation in parallel with its cache accesses. The caches are virtually indexed to speed access but are physically tagged for addressing. The R4700 has a large fully associative TLB with 96 entries. The cache entry's physical tag is checked against the TLB physical address. A four-entry write buffer buffers writes to eliminate write stalls.
Power management: All R4xxx processors have the ability to power down any "nonbusy" functional units. The R4300i's power-management features include a reduced-power mode (reduces clock rate to one-fourth of normal) and a power-down mode that writes CPU state to battery-backed RAM before turning off. The CPU has an instruction micro-TLB that caches the last two TLB entries and, thus, minimizes power dissipation. (The main TLB doesn't have to turn on.) Other circuit-design techniques also minimize power dissipation. The 4700 provides a Wait instruction that disconnects the PLL from the CPU clock.
Special instructions: MIPS R4xxx processors implement the MIPS III instruction set. MIPS III's instructions include double-word loads, stores, shifts, and add/subtract. The R4xxx's on-chip FPU performs 32-bit, single-precision and 64-bit, double-precision floating-point operations. Integer multiply and divide are done stepwise in bit pairs and single bits. The chip handles 32- and 64-bit multiplies and divides. It uses 32-bit arithmetic results seamlessly in 64-bit computations; you don't have to track operands and specify conversion.
Second sources: MIPS Technologies is a design center for the 64-bit MIPS R4xxx RISC processors, which are licensed to silicon vendors, including IDT, NEC, LSI Logic, NKK (Santa Clara, CA), and Toshiba. QED designed the R4700, the only MIPS III processor not designed by MIPS. LSI Logic uses the R4xxx core to develop custom products.
The R10000 runs unrecompiled R4000 code without a performance degradation. The R10000 can dispatch instructions simultaneously to five functional units--floating-point add, floating-point multiply, two integers, and a load/store--after instructions go through a two-stage instruction fetch-and-decode unit. The pipelined integer unit comprises six stages: fetch, decode, issue, execute, cache access, and write-back. Floating-point instructions use a seventh stage attached to the integer pipe. The execution units can complete instructions and write results out of order.
To support out-of-order execution, the R10000 maintains an instruction-status table to determine the instructions waiting to graduate and to put the instruction in order. When handling out-of-order execution, a completed instruction may never graduate, because an exception or branch could invalidate the results. For this reason, an instruction may complete, but, until graduation, its results are tentative and may be discarded.
The R10000 also facilitates designing tightly coupled multiprocessing systems. To accomplish this goal, the CPU has a 64-bit cluster-bus configuration that allows direct connection of four R10000 processors. Attaching the R10000 to an external agent, or cluster coordinator, creates a cluster bus that manages the flow of data within the cluster.
Power management: The R10000 can power down any "nonbusy" functional units.
Special instructions: The R10000 processors implement the MIPS IV instruction set.
Second sources: MIPS Technologies (Mountain View, CA) currently licenses the R10000 to NEC and Toshiba.
UltraSPARC-I and UltraSPARC-II are silicon implementations of SPARC V9, a version of the scalable-processor architecture. SPARC V9 maintains upward binary compatibility with SPARC V8 and extends the architecture with support for 64-bit virtual addresses and integer data sizes up to 64 bits; 32 double-precision, floating-point registers (up from 16); and speculative loads, which don't take a fault if accessing an out-of-range variable. V9 also defines a hardware mechanism that uses compiler technologies that streamline the prefetching of data and instructions.
The superscalar processors have nine-stage pipelines, and the first two stages comprise the instruction fetch and decode. Three additional stages have been added to the integer pipe to make it symmetrical with the floating-point pipe. This architecture simplifies pipeline synchronization and exception handling; it also eliminates the need for a floating-point queue. The CPU's pipeline encompasses two integer ALUs, five floating-point graphic units, and a load/store unit. The vendor also includes a 2-bit dynamic branch-prediction mechanism, which is part of its prefetch unit. As the 16-kbyte instruction cache fills, the CPU uses two extra bits per instruction to tag on information related to the branch prediction for that instruction.
UltraSPARC-I uses data buffers to isolate the L2 cache from the system bus. These buffers enable overlapping of system transactions and perform error detection/correction. The processor contains an on-chip L2-cache controller, and the system bus runs at one-half to one-third the processor frequency. The vendor claims that instructions and data pass between the L2 cache and the CPU at 2.6 Gbytes/sec. UltraSPARC-II can handle multiple outstanding memory requests (three loads/two stores vs one load or store for UltraSPARC-I).
Special instructions: SPARC V9 adds several instructions to the V8 specification: conditional move, 64-bit integer multiply/divide, compare and swap, prefetch, and branch-on register-value instructions. UltraSPARC-I adds graphics instructions (not designated in SPARC V9), referred to as the "Visual Instruction Set." These instructions provide the most common operations related to multimedia compression, imaging, and printing acceleration.