
Table 3 EDN µP/µC Directory: 32-Bit Chips
The 29000 (29K) is a difficult processor family to classify because the family comprises three product lines, including three-bus Harvard-architecture processors, two-bus processors, and µCs with on-chip peripheral support. The core of the 29K is built around a simple four-stage pipeline: fetch, decode, execute, and writeback. Processor hardware implements pipeline interlocks.
The 29K has a triple-ported register file of 192 32-bit registers, allowing the CPU to handle two operand-register reads and one register-result write in a single cycle. The register file can also perform on-chip stack operations and register windowing.
Most 29K instructions specify three addresses/two registers as inputs and a register for the result. Eight-bit addresses specify a register in the register file. The addresses divide into local (MSB=1) and global (MSB=0); 128 registers are in the local set.
A two-way, set-associative, branch-target cache (not available in all versions) holds as many as 128 instructions. The first four instructions of a branch target, which are held in the cache, wait for the branch to repeat. When the branch repeats, the cache provides the instructions, thus keeping the branch penalty to one cycle.
The 29K memory interface provides single-cycle burst access for DRAM, page-mode DRAM, or ROM. The interface can pump instructions at the CPU clock rate to deliver cache-like performance. A DRAM controller, the Am29C668, runs both standard and page-mode DRAM for the 29K but requires external control logic and DRAM buffers. The 29200 incorporates a DRAM and ROM controller with flash and SRAM support.
Power management: Only the 29040 supports power-management modes. A sleep mode stops the clock but maintains the internal cache and registers. Snooze mode keeps the clock running but shuts down the remaining internal circuitry.
Special instructions: The µPs can operate in either user or supervisor modes. In user mode, an illegal action by an executing program causes a protection-violation trap to occur. Special compare instructions put the results into a general-purpose register instead of using condition codes in a status register, making processing more general because the register can hold multiple conditions. Assert instructions compare conditions and, if not true, cause a software trap.
The 29K chips support multiprocessing through a LOADSET instruction that implements a binary semaphore: loads reg, locks memory, writes back all 1s. Load-and-lock and store-and-lock instructions read or write memory location with LOCK asserted during the memory access. A CLZ instruction counts leading 0s in byte/word and can speed bit-map processing by finding the first nonzero bit. CPBYTE compares words by byte and sets the Boolean result into register.
Advanced RISC Machines (ARM) designs and licenses µPs and µP cores for its partners (including GEC Plessey, Sharp, and VLSI Technology). The ARM 6/7 implements a load/store architecture and a three-stage pipeline (fetch, decode, and execute) to achieve single-cycle instruction execution. The ARM 6/7 has 31 general-purpose registers, with 16 visible at a time. The fast-interrupt mode has seven private registers to minimize state-saving overhead. All registers are general-purpose, including the PC, although a set of conventions called the ARM Procedure Call Standard governs their use for C compatibility.
ARM processors contain an MMU that handles translations between virtual and physical addresses as well as controlling access permissions. The MMU supports page sizes of 4 and 64 kbytes, with access control to page granularity. The ARM µPs support user and supervisor modes for controlling access; they handle four exception-processing modes: interrupt request, fast interrupt request, abort, and undefined. Modes use different register windows to overlay some of the 16 general-purpose registers.
The ARM 6/7 bus clock can be synchronous or asynchronous with respect to the internal cache clock. The ARM 6/7 also provides a write buffer, which lets execution continue while writes are pending. The buffer holds eight words at four independent addresses. ARM processors also incorporate a coprocessor interface (except the 610/710) to allow another CPU to load/store data or request a CPU operation.
ARM has defined a set of 32-bit instructions that, with a special compiler, can be compressed into 16 bits. At runtime, an internal module decompresses the 16-bit instructions back to 32-bit instructions. This architectural extension, called Thumb, helps increase code density and overcome waste associated with using 32-bit fixed-length instructions.
Special instructions: ARM 6/7 has 11 basic types of fixed-length instructions, which execute conditionally (not just branch) and reduce the need for short pipeline-flushing branches. A not-taken instruction executes in one cycle. Taken branches incur a three-cycle delay. The 16 execution condition codes include equal, not equal, always, negative, and overflow. The ARM has no explicit shift instructions; instead, all ALU operations can perform an optional shift operation in the same execution cycle. The processors have block-data-transfer instructions to load and store data from any subset of the 16 general-purpose registers.
ARM processors lack an integer-divide instruction; fast divide and divide-by-10 are provided by support software. However, the chips do have multiply and multiply-and-accumulate (MAC) instructions. The MAC instruction speeds math-intensive applications. A Booth hardware multiplier operates on two operand bits at a time to build a final product. A 32-bit multiply takes 16 cycles; smaller multipliers reduce the number of cycles. The ARM 7M variant contains a faster multiplier, which executes in four cycles and offers 64-bit multiplication. Division and multiplication by a constant can be performed quickly using the barrel shifter (for example, division by four takes one cycle, as does multiplication by five).
Fujitsu's MB8693X family (also known as SPARClite), which is based on V8E spec (SPARC International's embedded specification), features a 32-bit ALU and uses a load/store architecture with a register stack of 136 32-bit registers (the 86933 and 86933H chips have 104). Eight reserved registers hold global values. The remaining registers arrange into six or eight overlapping register windows, one window for each subroutine. This setup speeds procedure calls and interrupt processing. Multiple contexts can be present concurrently by limiting the number of registers for a task.
Fujitsu engineers extended the SPARC pipeline to five stages for the 8693X: fetch, decode, execute, memory, and write-back. The memory stage minimizes the effects of load/store operations and reduces a load/store to one-cycle execution. The stage is idle for nonload/store operations.
All 8693X µPs have separate data and instruction caches (the 86933 has instruction cache only). The caches are two-way set-associative and have 16- or 32-byte cache lines. Critical lines can be locked on chip and not swapped out. The µPs also incorporate a debug-support unit and emulator bus, which makes instruction streams visible even in on-chip cache. Debug registers hold data values or addresses for individual and range breakpoints.
The 8693Xs run with DRAM, SDRAM, SRAM, and ROM/EPROM. The memory interface handles page-mode DRAM for low-cost, high-speed access using a 32-byte burst mode. The memory interface includes a refresh generator for DRAMs, programmable wait states for slower memory, and programmable chip selects for memory banking. Boot-up memory interfaces are programmable; most 8693X CPUs can boot up from 8-, 16-, or 32-bit ROM/EPROM.
Power management: Shuts down FPU under control of power-management register.
Special instructions: The 8693X implements the SPARC V8 specification, which includes a full multiply instruction and division via software using divide-by-four. Other special instructions include scan word looking for first changed bit, or first 1 or 0; load/store double word; save/restore caller (uses register windows); tagged add/sub (generates overflow if MSB 0 and 1 are not 0); atomic math and swap; generate trap from conditions.
The SH family of RISC µPs/µCs employs a five-stage pipeline: fetch, decode, execute, memory access, and write back to register. The CPU is built around 25 32-bit registers that are accessed using load/store instructions. These registers consist of 16 general registers (SH3 has eight 32-bit shadow registers for context switching), five control registers, and four system registers. The chips use 32-bit data paths to move data internally, but all versions use a flexible external bus width to help satisfy different system-level budgets.
SH processors use a 16-bit instruction word to achieve fairly compact code, enabling larger on-chip programs. However, you'll face some tradeoffs when employing a 16-bit instruction word. The smaller word size restricts the 16/32-bit address field. Instead of normal three-address, 32-register RISC operations, SH devices are restricted to 16 register ranges (four bits/register field) and to specifying two registers per instruction. These restrictions can lead to larger programs because the hardware may have to do more work than a standard RISC architecture.
The processor can run from external memory or from on-chip program memory. The 16-bit-wide external memory bus can supply the CPU with instructions from SRAM or fast DRAM on each cycle. If the processor is running from external memory, access to external memory takes up to three additional cycles per data access.
Instead of on-chip program memory, the SH2 and SH3 have a four-way, set-associative on-chip cache, a 32-bit-wide memory bus for high CPU memory bandwidth, and a full 32-bit divide unit (replacing the first chip's bit-step-divide function). The cache can be reconfigured as a two-way, set associative cache and 2 or 4 kbytes of user-configurable RAM. The external memory bus supports multiprocessing; it has bus arbitration for multiple masters.
Power management: All SH devices have two low-power modes: sleep and standby. Sleep discontinues CPU processing but keeps peripherals active. Standby stops everything but maintains register/cache contents. The SH-2 and SH-3 provide several clock modes for reducing power. To reduce power consumption further, software can adjust the clock rate during program operation. The SH-3's unified cache has a special low-power design that dissipates only 100 mW in operation. The cache sense amps are only energized for the set that hits--while the other three sets stay switched off. The sense amps only respond to a 60-mV differential (versus the full 3.3V swing).
Special instructions: A 16x16-bit MAC instruction (42-bit accumulator) in the SH-1 and a 32x32-bit MAC instruction (64-bit accumulator) in the SH2 and SH3 provide a fast DSP function. Although classified as a load/store architecture, some instructions reference memory. Delayed branch instructions minimize pipeline disruption. An instruction swaps upper and lower bytes.
The 386 has all but disappeared from the PC and has developed a strong presence in embedded-PC applications (see EDN, June 22, 1995, pg 36). Register-based, the 80386 architecture has four general-purpose registers and four index/pointer registers, supplemented by six 16-bit segment registers and two 32-bit status and control registers. Intel 8086 designers used 64-kbyte segments to extend addressing to 1 Mbyte. The 80386 also uses segmentation; however, because the registers are now 32 bits (general-purpose and index/pointers), the segment limits are extended to the full 4-Gbyte addressing range, and a segment register references a segment descriptor with a 32-bit base address. These descriptors also carry addressing-range and protection limits to prevent data accesses into code, data being executed as code, and access to inner privilege levels by outer levels.
Hardware-descriptor registers hold segment-access rights along with segment-base address and size limits. In protected-mode addressing, a 16-bit selector points to a segment descriptor and furnishes a base address. The base address adds to the 32-bit effective address, producing a 32-bit linear address, which is then used as a physical address or as a linear-page address.
The 386 has six debug registers: four code/data breakpoint registers and two control registers. The breakpoint registers can be set with addresses for halting execution on a program or data access.
Power management: System-management mode (SMM), a power-management mechanism, enables code to control CPU power without having to rewrite or revamp existing operating software. The CPU enters SMM via a hardware- interrupt SMI (system-mode interrupt); the SMI interrupt code can set SMM operating modes to reduce chip power dissipation. Integrated versions of the 386 (for example, Intel's 386EX) have idle and powerdown modes: Idle discontinues CPU processing but keeps peripherals active, and powerdown shuts down the entire chip. AMD's 386SC300 chip has four power-saving modes: low speed (CPU goes to 0.5 MHz); doze, which stops CPU, system, and DMA clocks; sleep, which stops additional clocks and peripherals; and suspend, which stops everything except RTC and memory.
Special instructions: The 386 instruction set is a superset of the 8086/186. To support SMM, the 386 has seven additional instructions, such as RSM, which causes the processor to resume from SMM mode.
Second sources: AMD (Austin, TX), Cyrix (Richardson, TX), and Vadem (San Jose, CA).
The 486 builds on the 386 architecture by adding a more efficient memory bus, an on-chip FPU, an on-chip unified cache, and a RISC-like implementation for the core load/store instructions. The 80486 is a 32-bit RISC/CISC implementation, retaining the 386's complex instruction set but relying on a pipelined RISC-like implementation to speed execution for simple load/store instructions. The standard 486 microarchitecture has a five-stage pipeline and uses two of those stages (two decoder stages, D1 and D2) to decode the complex instruction set.
The 486 chips utilize variable-length instructions, ranging from 1 to 15 bytes for complex operations. The two decoder stages give the hardware time to delineate and decode the instructions waiting in the instruction queue. The instruction or byte-code queue holds 32 bytes for decoding. By fetching four words at a time from off-chip or local memory, the hardware minimizes contention between data and instruction accesses of the cache. To speed processing, the hardware loads and writes cache lines in four-word bursts.
The DX4 has a unified cache that is four-way set associative and implements a write-through policy: writes to cache pass through to memory, which raises memory bandwidth. The 486's bus and cache implement a bus-snooping protocol for multiprocessor operation. The bus is more efficient than that of the 386 and has a two-clock single read or write. Four-word read bursts take five cycles and constitute the majority of 486 bus accesses. The bus also supports secondary caches for both single and multiprocessor operation.
The 486 has six debug registers: four code/data breakpoint registers and two control registers. The breakpoint registers can be set with addresses for halting execution on a program or data access.
National Semiconductor offers an integrated 486, the NS486SXF, with a three-stage pipeline and a 16-byte prefetch queue. To reduce the core size further, National removed the 486's FPU, virtual memory support, and real-mode functionality.
Power management: The standard 486 employs SMM (system-management mode) for power management (see description under 386 listing). A halt instruction powers down most of the CPU's logic. National's version, although it doesn't support SMM, has several other power-saving features: a power-save mode divides the CPU clock, individual peripherals can be disabled, and an idle mode disables the CPU clock without affecting peripherals.
Special instructions: The 486 instruction set expands that of the 80386, adding instructions such as byte swap, exchange-and-add, compare-and-exchange, invalidate data cache, write-back and invalidate data cache, invalidate TLB entry, processor identification, and SMM resume.
Second sources: AMD (Austin, TX), Cyrix (Richardson, TX), IBM Microelectronics (Fishkill, NY), National Semiconductor (Santa Clara, CA), SGS-Thomson (Phoenix, AZ), and Texas Instruments (Denver, CO).
The range of i960s runs from the new superscalar HA/HD/HT to the 16-bit SA/SB variants. The i960 combines a Von Neumann architecture with a load/store architecture that centers on a core of 32 32-bit general-purpose registers divided into 16 local registers and 16 global registers. An on-chip register cache automatically caches the local register sets to speed context switching. If the cache is full, the oldest cached set moves to memory and the latest set caches. All i960s have multistage pipelines and use resource scoreboarding to track resource usage.
The i960CA upgraded to superscalar operation and five pipeline stages. The key to the Cx is its four-instruction-wide instruction decoder, which decodes up to four instructions per cycle. Current implementations dispatch up to three of these instructions for execution. The i960CF has 128-bit-wide buses to move instructions to the decoder and 128-bit-wide buses to move data between the cache and registers.
Superscalar i960s are built around a six-port register file, with execution units gathered into two groups: register and memory control. These units include integer, floating-point, and interrupt-control units (on the register side); and address-generation and bus-controller units on the memory side. Instructions are cached in a lockable cache; later versions add an instruction cache to supplement the register cache.
Special instructions: The i960 has uninterruptible atomic add and modify instructions. Other instructions flush local registers and provide cache locking control.
Intel designers tackled two problems when designing Pentium: achieving code compatibility with earlier x86 CPUs and attaining third-generation RISC performance. Designers handled both issues by implementing the complex x86 instruction set and emphasizing simple instruction execution over the more complex ones. With Pentium, the simple, RISC-like register-to-register instructions drive the implementation; the microcoded complex instructions are second priority.
Pentium achieves a two-instruction issue peak and has two five-stage pipelines (U and V) for each instruction. These pipelines are not symmetric; the U pipe takes precedence over the alternate pipe, V. If the second instruction does not cause interlocks (using results from the first instruction to write into the same register/data), then the second instruction is scheduled for the V pipe.
The U and V pipelines feed from a common instruction fetch/align stage that fetches multiple instructions from the cache. The CPU fetches and passes a full line (256 bits) to the instruction decoder. Each pipeline has two decoder stages to decode simple and complex instructions. The wide cache-to-decoder path, coupled with a two-stage decode, enables Pentium to decode the x86's variable-length instructions and deliver competitive performance.
For superscalar dual-instruction load/store operations, the Pentium data TLB and cache tags are dual ported for concurrent pipeline accesses. The data-cache SRAM is eight-way interleaved, allowing concurrent accesses to different memory banks (the cache is actually triple-ported, with an extra port for snooping). Cache hit rates range from 90 to 97%, depending on the application-code mix. The data cache handles both 4-kbyte and 4-Mbyte pages. It has two four-way, set-associative TLBs: one with 64 entries for 4-kbyte pages, and one with eight entries for 4-Mbyte pages. The code cache is also two-way set-associative, with a four-way, set-associative, 32-entry TLB that handles both 4-kbyte and 4-Mbyte pages.
The CPU uses dynamic branch prediction, allowing the CPU to determine which branch to take, as opposed to static branching, where the compiler predetermines potential branches. Pentium's 256-entry branch target buffer (BTB) holds branch target addresses for previously executed branches, unlike some implementations that hold the actual target instructions. The BTB supplies the next instruction address that the last execution of a branch instruction took. Each BTB entry integrates the target address with special history and operation bits. Intel claims that a correctly predicted branch will take a single pipeline cycle and won't cause a pipeline bubble. Simulations show performance increases 25% using the BTB.
Pentium's FPU features an eight-stage pipeline, which shares the first five stages of the U and V pipeline. Data transfers to or from the FPU use a wide 64-bit data path to the data cache to keep the FPU pipeline fed. Pentium adds a write buffer to each pipeline to avoid write contention.
Pentium uses burst reads to fill its 256-bit-wide cache line. It also has burst write-back writes. The memory interface is pipelined, allowing a second bus cycle to set up while the first bus cycle completes. Pentium reads or writes a 64-bit double word each cycle in burst mode.
AMD's K5, although software- and pin-compatible with Intel's Pentium, is a unique 586-class CPU. While issuing up to four instructions per cycle, this µP marks the boundaries of the x86 instructions, enabling multiple x86 instructions to be aligned and assigned issue positions for efficient instruction processing. Every byte of code entering the processor's instruction cache is tagged with bits of associated predecode information to determine how to break the x86 instruction into a number of RISC-like operations (termed ROPs). The K5 also has a dual-ported data cache that allows two cache lines to be accessed simultaneously. The K5 can execute instructions out of order and also has extra registers, allowing the CPU to perform register renaming.
Cyrix also offers a processor that is software- and pin-compatible with Pentium. Referred to as M1, this CPU performs register renaming, multibranch prediction, speculative execution, and out-of-order completion. Realizing that M1's die size would be too large to compete at the 586 level, Cyrix developed a scaled-down M1, called the 5x86, with a 64-bit internal architecture packed into a 486 footprint.
Another of the 586-class µPs is NexGen's Nx586. The company provides a 586 that is x86-code-compatible (not pin-compatible) with Pentium. The lack of pin compatibility has limited the acceptance of NexGen's product because the CPU has limited chip-set support. Similar to AMD's approach, the Nx586 dynamically translates x86 instructions into RISC-like operations. The Nx586 also features register renaming, out-of-order, and speculative execution.
Intel's next-generation Pentium, code-named P6, will focus heavily on dynamic execution. The main goal of dynamic execution is to keep the P6's multiple execution units as busy as possible by performing multiple branch prediction, data-flow analysis, and speculative execution. The P6 will also perform out-of-order execution and register renaming.
MIPS processors are built around a set of 32-bit, general-purpose registers in a central register file. To minimize control logic, the instruction set is reduced to 73 instructions, and addressing options are limited. The chip has a three-address, load/store architecture. Similarly, instruction sizes are fixed to one 32-bit word to minimize decoding and speed processing.
MIPS engineers used a five-stage pipeline for the R3000. The pipeline lets up to five instructions execute concurrently--each at a different stage of its instruction cycle, thus giving the effect of single-cycle execution. The pipeline stages are instruction fetch (IF), read operands and decode instruction (RD), execute (ALU), access data memory (MEM), and write-back results (WB). A branch-delay slot minimizes branch effects. The compiler fills the instruction slot, following the branch with a NOP or an instruction from the current thread that can be executed before the branch takes effect. Toshiba's R3900, an R3000 derivative, incorporates register scoreboarding to enable nonblocking loads and avoid pipeline stalls when there are no data dependencies in subsequent instructions.
The R3000 MMU includes a fully associative, 64-entry TLB that translates virtual addresses to 32-bit physical addresses. The µP uses a write-through cache policy and includes an on-chip write buffer. A small on-chip FIFO enables the CPU to refill the cache and execute instructions even while additional instructions are being read from memory, a process called instruction streaming.
Power management: The R3051 core supports a dynamic clock frequency (divide factors of 2 to 128), changed by writing a control register.
Special instructions: The R3000 uses the MIPS-I instruction set. Toshiba's R3900 adds a MAC instruction.
Second sources: IDT (Santa Clara, CA), LSI Logic (Milpitas, CA), NEC (Mountain View, CA), NKK (Santa Clara, CA), and Toshiba (Irvine, CA).
The 68000 serves as a base for the 680x0 and 683xx lines of 32-bit µPs. The 68000 is actually a 16- and 32-bit mix. It has 32-bit registers for easy addressing, a 16-bit data path and ALU to conserve silicon, and 16-bit instructions. Programmers get eight general-purpose, 32-bit data registers, which the CPU can address by bit, BCD, byte, word, or double word. In addition to user and supervisor stack pointers, 68000 chips have seven address registers. Other registers include the 32-bit PC and 16-bit status registers. The status register maintains status for the user and supervisor modes via a user byte and a supervisor byte. The processor's user and supervisor modes are implemented in hardware, which eases having a control kernel or OS manage multiple application tasks.
The 68000 has two microcode levels: microcode and second-level expanded nanocode. Instruction execution triggers a chain of 10-bit microcode words. Each microcode word can reference another word--for example, a jump in microcode or a string of 70-bit nanocode words that drive the CPU logic directly.
The CPU lacks a memory controller, but the separate address and data buses eliminate the need for buffering addresses. However, the CPU needs logic to generate the required DTACK* signal, which marks the successful completion of a memory cycle. An address decoder is necessary for multiple memory chips, and drivers may be needed to buffer bus address and data lines (integrated versions of the 68000 contain this logic). If DTACK* is late, wait states are generated.
Power management: Only the integrated versions provide variations of sleep and low-power stop modes.
Special instructions: The chip restricts privileged instructions to supervisor mode. These instructions include reset, stop, and moves and operations on the status register. To support user and supervisor modes, the hardware implements separate stacks and pushes and pops PC and status register onto the stack for exceptions. A link instruction lets you build link lists on private stacks. A special instruction lets you move up to 16 registers to or from an effective address, including blocks of data registers to or from address registers.
Second sources: Hitachi (Brisbane, CA), Philips (Santa Clara, CA), SGS-Thomson (Phoenix, AZ), and Toshiba (Irvine, CA).
Motorola's new type of RISC architecture, called ColdFire, evolved from the M68000. This architecture is also know as VL-RISC because, although the core is RISC-like, the instructions are variable length (VL). VL instructions help to attain higher code density. Another advantage of the ColdFire architecture is its reduced size (approximately 55k transistors), which is achieved by eliminating M68000 instructions that were used infrequently and by optimizing the pipeline. ColdFire continues to use the M68000 programmer's model.
ColdFire has a four-stage pipeline that consists of two subpipelines: a two-stage instruction prefetch pipeline (instruction address generation and instruction fetch cycle) and a two-stage operand execution pipeline (decode and select/operand fetch cycle and operand address generation/execute cycle). A 12-byte FIFO instruction buffer decouples the two pipelines. The prefetch pipeline calculates the next instruction address and then fetches 32 bits of instruction data. The operand pipeline has a dual read-ported register file feeding an arithmetic/logic unit.
The CPU core is separated from on-chip peripherals by using a modular, standard bus architecture; the core communicates with on-chip memories using a tightly coupled processor bus.
On-chip debug supports real-time trace, real-time and nonreal-time debug, and access to control registers to define types of memory regions, such as cacheable copyback, write through, and noncacheable.
Power management: A low-power stop instruction (LPSTOP) shuts down active circuits in the processor and halts instruction execution. Processing resumes via reset or valid interrupt.
Special instructions: ColdFire added the following instruction extensions to the 68000 architecture: 32x32-bit integer multiply, register sign-extension instructions, and multiple-word NOPs used by compilers to remove branch instructions.
The 680x0 architecture is built around 16 general registers with a 68000-compatible, orthogonal instruction set. The 680x0 has more registers than the original 68000. Control registers were added to control the MMU and the FPU as well as support additional processing capabilities. For example, the 68040 adds eight 80-bit floating-point registers and 12 control registers, which include a vector-base register (points to interrupt vector table), cache-control register, user and supervisor root pointers, and translation registers.
The superscalar 68060 heads the 680x0 lineup with its dual integer and floating-point pipelines. As instructions enter the CPU, they flow into a four-stage prefetch pipeline: instruction address generation, instruction cycle, instruction early decode, and instruction buffer. In this pipeline, the 68000-compatible, variable-length CISC instructions get converted to a 32-bit fixed-length instruction. Once converted, these instructions enter dual, four-stage integer-execution pipelines that operate synchronously. The four stages of the execution pipeline are decode, effective address calculation, fetch, and integer execution. This pipeline dispatches instructions to the FP and allows for some execution overlap between the integer and FP engines.
A Harvard architecture allows the 68060 to perform simultaneous instruction fetches and data accesses. The on-chip caches are four-way set associative with four-way interleaving to support simultaneous read and write operations. Portions of the caches can be frozen to prevent reallocation.
The 68040 implements a six-stage pipeline (fetch, decode, effective-address calculate, effective-address fetch, execute, and writeback). To speed processing, it has two on-chip 4-kbyte direct-mapped caches and separate data and instruction MMUs, which allow simultaneous address translations. Bus snooping is built into the 040's caches to ensure cache coherency for multiprocessing. Both write-through and copy-back modes are built into the cache. The 68020 and 68030 CISC implementations have smaller caches; the 030 and 040 versions implement burst mode, moving up to 16 bytes in a single addressing block between registers and memory.
The 040 and 060 deliver apparent single-cycle execution for some instructions, mainly register operations such as memory-to-register moves (if the data is in the data cache). A taken branch takes two cycles; a not-taken branch takes three cycles. On the 68060, a 256-entry, four-way, set-associative, on-chip branch cache allows taken and nontaken branches to execute in zero and one clock, respectively. The branch cache unit contains state bits that provide a history of branch executions, which helps to predict branch direction.
Unlike the 020/030, the 040 and 060 do not perform dynamic bus sizing. Instead, they have a highly reliable bus with a high-drive option capable of implementing a synchronous, two-clock R/W protocol. A four-word burst takes five clocks. Multiprocessor bus arbitration is built into the 040 but requires off-chip logic. Externally, the 68060 bus is a superset of the 68040's bus. Additional signals support higher performance system designs, but the processor can easily operate on an existing 68040-based bus. An on-chip MMU with separate instruction and data TLBs allows the 68060 to access up to 4 Gbytes of memory.
Power management: To support power management, the 68060's functional units respond to dynamically controlled clocking; the caches and execution units power down when not accessed. The static design allows the external clock to be reduced or stopped, and an LPSTOP instruction (low power stop) disconnects most of the chip from the CLK pin.
Special instructions: The CPUs have special instructions for variable-length bit fields, moving 16 registers, compare, and swap, which locks memory for multiprocessing. A scaling option addresses data by item size for table access, FPU, and MMU commands.
The 68060 has a special MOVE16 instruction to perform a 16-byte block move and a PLPA instruction that loads a physical address by translating a logical address. A TBL instruction performs a table lookup and interpolates the data.
For most of the 683xx family, Motorola combined a stripped-down 68020 core with a 16-bit (32-bit for CPU32+) on-chip InterModule bus, which links the CPU with a device's complex peripherals. The core processor, the CPU32 or CPU32+, is the 68020 CPU stripped down for embedded control (no MMU or FPU interface) combined with a 16- or 32-bit data bus, respectively. The 32-bit processor has eight general-purpose 32-bit registers; seven 32-bit address registers; a 32-bit ALU; and separate user and supervisor modes, each with its own stack and separate address and data spaces. The CPU32 is code-compatible with the 68020 but has enhanced addressing modes, including scaled index, address register indirect with base displacement and index, PC relative, and 32-bit branch displacements. Postincrement and preincrement/decrement options simplify iterative code. Peripheral-control registers and I/O are memory mapped; the CPU accesses them as addresses in memory.
Most 683xxs have a system-integration module featuring system configuration, oscillator and clock dividers, reset and powerdown-mode control, chip selects and wait states, parallel I/O with interrupt capability, interrupt configuration/ response, and a software watchdog. The external bus interface has up to 32 address and 16 data lines (32 for CPU32+) and up to 12 programmable chip-select lines.
Power management: LPSTOP instruction stops clock. Devices can run as low as 131 kHz with 32-kHz crystal.
Special instructions: 68020 instructions not supported include BCD pack/unpack, bit field, compare and swap, coprocessor, MMU, module call/return (memory indirect addressing also not supported). New instructions include a table look-up and interpolate, as well as the ability to put the chip into a low-power standby mode.
Serving as a base for a family of RISC chips, the PowerPC derives its core architecture from the POWER (performance-optimized with enhanced RISC) architecture. The instruction set supports multiple microarchitecture implementations that include the 32-bit 601, 602, 603, 604 and embedded processors (Motorola's MPC 505, MPC860, MPC821, and IBM's 400 series), and the 64-bit 620.
The PowerPC 620 consists of six execution units encompassing a five-stage pipeline: fetch, dispatch, execute, complete, and writeback. It uses a superscalar design to control six independent execution units: three integer, branch, floating-point, and load/store. Each unit contains two to four reservation units for holding instructions so the CPU can reduce data dependencies and pipeline bubbles.
The 620 can perform both static and dynamic branch prediction. It also performs out-of-order execution and in-order graduation.
The 620 has a 128-bit, level-two cache interface. The unified L2 cache can be clocked at one, one-half, or one-third the processor clock frequency. The 128-bit data-bus interface supports a split transaction, pipeline snoop bus protocol. An on-chip MMU converts 80-bit virtual addresses to 64-bit physical addresses and uses a 128-entry, two-way, set-associative shared TLB.
The PowerPC 604 can issue up to four instructions per cycle. It has the same types of execution units as the 620; however, each execution unit contains only two reservation units. The 604 uses a six-stage pipeline: fetch, decode, dispatch, execute, completion, and writeback. The 604 also performs dynamic branch prediction and speculative execution and, like the 620, performs out-of-order execution. Similar to the 620, the 604 supports the MESI protocol for cache coherency for multiprocessor systems.
The 603 comprises five parallel execution units: integer execution, floating point, branch, system, and load/store. Combined with a four-stage pipeline (fetch, dispatch, execute, and complete), the 603 can achieve three instructions per clock cycle. During the fetch stage, the 603 uses a six-instruction prefetch queue to hold pending instructions. Unlike other PowerPC derivatives, the 603 supports only static branch prediction. However, the architecture supports out-of-order execution and in-order retirement, similar to other PowerPC devices.
The PowerPC 602 is a cost- and power-reduced implementation of the 603. It has four parallel-execution units: fixed point, FPU, branch processing, and load/store. The scalar design of the 602 limits instruction issue to one instruction per cycle. The 602 performs branch folding to help eliminate branches. Its FPU stores and calculates only single-precision values.
The PowerPC 601 can issue up to three instructions/clock cycle using three major functional units: the instruction unit for integer operations, the FPU, and the branch unit (BU); all execute concurrently. The instruction unit fetches the instructions, queues eight instructions for decoding, and then issues them to the execution units.
To minimize TLB exceptions, the 601 has a large 256-entry TLB. It also has four-entry shadow TLBs for fast access to the most recently accessed entry. The 601 has branch prediction and branch folding, eliminating branch instructions where possible. The BU searches the bottom half of the instruction queue for branches and uses static prediction to cause the target instruction thread to be accessed for execution.
The embedded PowerPC processors include Motorola's MPC500 family and IBM's 400 series devices. Compared with other PowerPC devices, these devices have similar--but fewer--execution units and issue only one instruction at a time. The MCP500 uses an Intermodule bus (originally developed for Motorola's 683xx devices) as a backplane to connect all system modules. The MPC500 family includes a system-integration unit that enables simple integration with external memories, other CPUs, and peripheral devices.
Power management: The 603 has a dynamic power-management feature that includes clock stopping and reducing signal activity whenever possible. For example, when not in use, the FPU, system unit, load/store unit, or caches are automatically turned off. The CPU has three incremental and automatic power-management modes: doze (functional units are disabled except for time base/decrementer registers and bus snoop logic), nap (disables bus snoop), sleep (disables all internal functional units).
Special instructions: All PowerPC 6xx devices perform single-cycle floating-point MAC. PowerPC 604, 603, and 620 support graphics instructions: SQRT approximation and inverse SQRT approximation. The 403Gx has load/store for multiple registers and byte strings as well as extensive cache-manipulation and semaphore-handling operations.
NEC's V850 µC family is based on the company's proprietary 32-bit RISC architecture, which consists of a five-stage pipeline, 32 general-purpose registers, a 32-bit barrel shifter, and a hardware multiplier. The pipeline stages are fetch, decode, execute, memory access, and write back. Most instructions execute in one clock and are two bytes long, allowing smaller code size. The CPU has a pipeline-stall feature that automatically inserts a bubble into the pipeline to avoid data dependencies and hazards.
A bus-control unit (BCU) generates a prefetch address to prefetch an instruction code from external memory and store it in the four-doubleword prefetch queue. For accesses from internal ROM, instructions go straight to the CPU (that is, not through the prefetch queue). Instruction fetches from internal ROM consume one cycle; data fetches from ROM require three cycles. Therefore, you should shadow lookup tables and fixed data structures to the CPU's internal RAM, where data can be accessed in one clock. The BCU also provides a bus hold function, allowing other devices, such as DMA, to share and take control of the V851's external bus.
Peripherals are accessed as memory-mapped I/O and are connected to the CPU through a 16-bit bus. ROM and RAM communicate to the CPU using a 32-bit bus. Although the first member of this family, the V851, has 32 kbytes of ROM and 1 kbyte of RAM, the V850 architecture allows internal expansion to 1 Mbyte of ROM and 4 kbytes of RAM. Similarly, the external bus of the V851 addresses up to 16 Mbytes (the architecture allows access up to 4 Gbytes on future chips). The V850's memory space divides into 1-Mbyte unit blocks, and wait states can be inserted into a bus cycle for every two blocks.
Power management: The V851 supports the following power-save or standby modes: halt, idle, software stop, and clock-output inhibit. In halt mode, the clock generator continues to operate but the CPU clock stops, allowing the on-chip peripherals to function. Idle mode stops the CPU clock and internal system clock; however, because the clock generator continues to run, normal operation can resume without having to wait for oscillator and PLL stabilization. In stop mode, everything stops, but register and memory contents stay intact.
Special instructions: NEC's V850 devices support a software trap instruction. The CPUs also perform saturate operations where the CPU stores the maximum values if addition results overflow. For example, if the result exceeds the positive-value 7FFFFFFFh, 7FFFFFFFh is stored in the result registers and the CPU sets the saturation flag.
MicroSPARC processors are built around a large, multiported register file that breaks down into a small set of global registers for holding global variables and sets of overlapping register windows. Each 24-register window has a core of eight registers supplemented by eight registers overlapping the previous and next register windows. The overlapping registers eliminate the need to save and restore registers on function calls, returns, or context switches between tasks.
Sun's MicroSPARC has a five-stage pipeline: fetch, decode, memory access, execute, and write-back. It also has a four-entry write buffer to prevent write stalls. An integrated floating-point unit contains 32x32-bit FP registers, a general-purpose execution unit, and an FP multiplier. A three-instruction-deep queue of FP instructions helps increase concurrency with integer execution.
MicroSPARC's MMU uses three high-order bits of physical address to map eight address spaces. The MMU controls arbitration among I/O, data cache, instruction cache, and TLB references to memory. It contains a 64-entry fully associative TLB and supports 256 contexts.
The processors have a separate 64-bit memory interface that handles up to 128 Mbytes (256 Mbytes for MicroSPARC II) of 16-Mbit DRAM. An on-chip SBus interface and controller handles five SBus slots. (SBus is a 25-MHz, 32-bit synchronous bus.)
Special instructions: MicroSPARC processors comply with the instructions listed within the SPARC V8 specification.
SuperSPARC operations within Sun's SuperSPARC center around the 136-entry, eight-port register file. Registers group into eight global registers and eight overlapping register windows. The register file handles six reads (three two-operand reads) and two writes; the file can perform two reads and two writes concurrently but is time-shared to handle six reads and two writes in one system-clock cycle.
The superscalar SuperSPARC doubles the system clock to run its pipeline stages. The eight stages are grouped into four execution stages (fetch, decode, execute, and write-back) of different lengths. The eight stages include cache access; send matched instructions to scheduler; issue instructions; read address registers/evaluate branch-target address; read operand from register file; first, second ALU stages; and write-back result.
The CPU runs eight functional units, which include three integer ALUs, load/store, branch, floating-point multiply, floating-point add, and shift. The adder units are organized so that two can execute concurrently and return results to the register file or feed into the third ALU. That ALU can then operate on the results and return a value to the register file in one pipeline cycle. Thus, SuperSPARC can do three adds in one cycle, where one add is dependent on the first two results.
The multiply and add floating-point units are pipelined; they can accept a new instruction every clock cycle but have a three-cycle latency. The FPU has its own instruction queue and 16 64-bit registers.
The CPU addresses its instruction and data caches physically. The instruction-cache path is 128 bits to handle superscalar operation. Four instructions are presented simultaneously to the eight-deep prefetch queue. A single TLB supports both caches. It has 64 entries and does two TLB evaluations in one clock cycle.
SuperSPARC runs in stand-alone mode by interfacing to the MBus. The processor can run in cache-controller mode by interfacing to an external cache controller via the VBus, a nonmultiplexed, proprietary bus (CPU clock rate, 36-bit address, 64-bit data). The VBus links to a cache controller and up to 2 Mbytes of unified secondary-cache SRAM. The cache controller can handle multiprocessing (more than one SPARC CPU on an MBus).
Branch-delay slots and a branch-target queue minimize branch penalties. A branch-delay slot following the current set of instructions gives the hardware time to prefetch both the target set and the next sequential set of instructions.
Power management: SuperSPARC does not implement any power-saving features.
Special instructions: SuperSPARC processors comply with the instructions listed within the SPARC V8 specification.