
Table 4 EDN µP/µC Directory: 64-Bit Chips
| Part 1 |
Alpha has a 64-bit load/store architecture with 32 integer and 32 floating-point registers. The Alpha chips consist of the high-end 21164, the superscalar, super-pipelined 21064A, and the integrated 21066.
The 9.3-million-transistor 21164 has a seven-stage integer pipeline that contains two integer units. It also includes a nine-stage floating-point unit that can simultaneously issue an add and a multiply in one cycle. A unique feature of the 21164 is the 96-kbyte on-chip, level-two cache that provides 50% better performance than an external cache. To help increase loading efficiency, the 21164 contains merge logic that looks ahead to see if more than one reference is being made to the same cache block. The merge logic can compress up to 20 loads in a row.
The 21064 has a seven-stage integer pipeline and a 10-stage floating-point pipeline and can issue up to two instructions per cycle. It implements branch-prediction hardware based on a single branch history bit (2 bits in the 21064A) stored with each instruction in the instruction cache. The chip's pipelines are static and dynamic. The first four pipeline stages are dynamic and execute through. To buffer slower external memory from the high-speed processor, the chip has two direct-mapped caches--one each for instruction and data. Writes are also buffered, holding up to four 32-byte blocks for pending writes. The 21064 also has provisions to accommodate up to 16 Mbytes of secondary cache.
The 21066 integrates on-chip, fully pipelined integer and floating-point processors. An instruction fetch and decode unit prefetch two instructions in parallel and determine if the resources are available to execute one or both of the instructions. The CPU also contains a memory controller, graphics accelerator, and a PCI I/O controller.
Power management: Alpha chips support thermal and idle-based power management through a programmable internal clock divider. Clock frequencies can be reduced up to eight times.
Special instructions: All instructions are 32 bits in length and consist of branch, load/store, integer and floating-point operations, and CALL_PAL types. CALL_PAL (Privileged Architecture Library) instructions vector to a software library that automically performs both privileged and unprivileged functions such as handling interrupts, exceptions, and maintaining TLBs. This instruction class allows Alpha to accommodate VAX-specific hardware characteristics.
Second sources: Mitsubishi (Sunnyvale, CA) second sources the 21066A.
All R4xxx processors implement the MIPS III architecture, which provides 64-bit integer registers, 64-bit instructions, and 32- or 64-bit addressing for each privilege level. The first device to implement the MIPS III architecture was the R4000, which was rendered obsolete by the high-end R4400. The super-pipelined R4400 has an eight-stage pipeline: instruction fetch (first and second halves), register file (access), execute, data cache (first and second halves), tag check, and writeback. The drawback to having a long pipeline is obvious when the processor performs branches or memory references: branches cause a latency of three internal clocks while loads incur a two-cycle latency. However, super-pipelining increases performance because each stage can run at twice the system clock.
The R4200 and the R4600 (and derivatives) represent the low-cost, low-power end of the R4xxx product family. These CPUs include a five-stage pipeline: instruction fetch, decode, execute, data cache read, and cache write. They rely on increasing clock speed to raise performance. The pipeline includes a one processor-clock delay slot for branch-type instructions.
The R4200 reduced its silicon overhead by overlapping the FPU and executing FP instructions in the integer ALU; this tactic simplified the pipeline and eliminated the need for a separate FP data path. To accommodate floating-point operations, the R4200's integer unit contains 32 64-bit floating-point registers. The R4600 includes a separate FPU and performs integer MPY and DIV in the FPU (the integer MPY/DIV unit has been removed).
The R4600 does virtual-to-physical address translation in parallel with its cache accesses. The caches are virtually indexed to speed access but physically tagged for addressing. The R4600 has a large fully associative TLB with 96 entries. The cache entry's physical tag is checked against the TLB physical address. A four-entry write buffer buffers writes to eliminate write stalls.
Power management: All R4xxx processors have the ability to powerdown any "nonbusy" functional units. The R4200's power-management features include a reduced-power mode (reduces clock rate to 1/4 normal) and a powerdown mode that writes CPU state to battery-backed RAM before turning off. The CPU has an instruction micro-TLB that caches the last two TLB entries and thus minimizes power dissipation (the main TLB doesn't have to turn on). The 4600 provides a WAIT instruction that disconnects the PLL from the CPU clock.
Special instructions: MIPS R4xxx processors implement the MIPS III instruction set. MIPS III's new instructions include double-word loads, stores, shifts, and add/sub. The R4xxx's on-chip FPU performs 32-bit, single-precision and 64-bit, double-precision FP operations. Integer multiply and divide are done stepwise in bit pairs and single bits, respectively. The chip handles 32- and 64-bit multiplies and divides. It uses 32-bit arithmetic results seamlessly in 64-bit computations; you don't have to track operands and specify conversion.
Second sources: MIPS Technologies is the primary design center for the 64-bit MIPS R4xxx RISC processors, which are licensed to silicon vendors, including IDT (Santa Clara, CA), NEC (Mountain View, CA), LSI Logic (Milpitas, CA), NKK (Santa Clara, CA), and Toshiba (Irvine, CA). QED designed the R4600, the only MIPS III processor not designed by MIPS. LSI Logic does not offer any off-the-shelf processors; instead the company uses the R4xxx core to develop custom products.
Although the architecture of the R10000 is new, its designers decided that it would run unrecompiled R4000 code without a performance degradation. The R10000 can dispatch instructions simultaneously to five functional units--floating-point add, floating-point multiply, two integers, and a load/store--after instructions go through a two-stage instruction fetch-and-decode unit. The pipelined integer unit consists of six stages: fetch, decode, issue, execute, cache access, and writeback. Floating-point instructions use a seventh stage attached to the integer pipe. The execution units can complete instructions and write results out of order.
To support out-of-order execution, the R10000 maintains an instruction-status table to determine the instructions waiting to graduate and to put the instructions in order. When handling out-of-order execution, a completed instruction may never graduate because an exception or branch could invalidate the results. For this reason, an instruction may complete, but, until graduation, its results are tentative and may be discarded.
The R10000 also facilitates designing tightly coupled multiprocessing systems. To accomplish this goal, the CPU has a 64-bit cluster-bus configuration that allows direct connection of four R10000 processors. Attaching the R10000 to an external agent, or cluster coordinator, creates a cluster bus that manages the flow of data within the cluster.
Power management: The R10000 can powerdown any "nonbusy" functional units.
Special instructions: The R10000 processors implement the MIPS IV instruction set.
Second sources: MIPS Technologies (Mountain View, CA) currently licenses the R10000 to NEC (Mountain View, CA) and Toshiba (Irvine, CA).
Development tools: Contact MIPS for a list of development tools.
UltraSPARC is the first silicon version of SPARC V9, a version of the scalable processor architecture. SPARC V9 maintains upward binary compatibility with SPARC V8 and extends the architecture with support for 64-bit virtual addresses and integer data sizes up to 64 bits, 32 double-precision, floating-point registers (up from 16), and speculative loads, which are loads that don't take a fault if accessing an out-of-range variable. V9 also defines a hardware mechanism that uses compiler technologies that streamline the prefetching of data and instructions.
The superscalar processor has a nine-stage pipeline in which the first two stages comprise the instruction fetch and decode. Three additional stages have been added to the integer pipe to make it symmetrical with the floating-point pipe. This architecture simplifies pipeline synchronization and exception handling; it also eliminates the need for a floating-point queue. The CPU's pipeline encompasses two integer ALUs, five floating-point graphic units, and a load/store unit. Sun also includes a 2-bit dynamic branch-prediction mechanism, which is part of its prefetch unit. As the 16-kbyte instruction cache fills, the CPU uses two extra bits per instruction to tag on information related to the branch prediction for that instruction.
UltraSPARC uses data buffers to isolate the level-two cache from the system bus. These buffers enable overlapping of system transactions and perform error detection/correction. The processor contains an on-chip level-two cache controller, and the system bus can run at one-half to one-third the processor frequency. Sun claims that instructions and data can pass between the level-two cache and the CPU at 2.6 Gbytes/sec.
Special instructions: SPARC V9 adds several instructions to the V8 specification: conditional move, 64-bit integer multiply/divide, compare and swap, prefetch, and branch on register value instructions. UltraSPARC adds graphics instructions (not designated in SPARC V9), referred to as the Visual Instruction Set. These instructions provide the most common operations related to multimedia compression, imaging, and printing acceleration.