EDN logo


Design Feature: September 12, 1996

32-BIT CHIPS


AMD 29000

Download Product Data
Sheet (MS Excel)
32-bit chips

The 29000 (29K) is a difficult processor family to classify, because the family comprises three product lines, including three-bus Harvard-architecture processors, two-bus processors, and µCs with on-chip peripheral support. The core of the 29K is built around a simple four-stage pipeline: fetch, decode, execute, and write-back. Processor hardware implements pipeline interlocks.

The 29K has a triple-ported register file of 192 32-bit registers, allowing the CPU to handle two operand-register reads and one register-result write in a single cycle. The register file can also perform on-chip stack operations and register windowing.

Most 29K instructions specify three addresses/two registers as inputs and a register for the result. Eight-bit addresses specify a register in the register file. The addresses divide into local and global; 128 registers are in the local set.

A two-way, set-associative, branch-target cache (not available in all versions) holds as many as 128 instructions. The first four instructions of a branch target, which are held in the cache, wait for the branch to repeat. When the branch repeats, the cache provides the instructions, thus keeping the branch penalty to one cycle.

The 29K memory interface provides single-cycle burst access for DRAM, page-mode DRAM, or ROM. The interface can pump instructions at the CPU clock rate to deliver cache-like performance. A DRAM controller, the Am29C668 runs both standard and page-mode DRAM for the 29K but requires external control logic and DRAM buffers. The 29200 incorporates a DRAM and ROM controller with flash and SRAM support.

Power management: Only the 29040 supports power-management modes. A sleep mode stops the clock but maintains the internal cache and registers. Snooze mode keeps the clock running but shuts down the remaining internal circuitry.

Special instructions: The µPs can operate in either user or supervisor modes. In user mode, an illegal action by an executing program causes a protection-violation trap to occur. Special compare instructions put the results into a general-purpose register instead of using condition codes in a status register, making processing more general because the register can hold multiple conditions. Assert instructions compare conditions and, if not true, cause a software trap.

ARM Processors

Advanced RISC Machines (ARM) designs µPs and cores for its licensees (including Atmel, Cirrus Logic, Digital Semiconductor, GEC Plessey, NEC, Oki, Samsung, Sharp, and VLSI). The ARM cores implement a load/store architecture and have 31 general-purpose registers with 16 visible at a time. The fast-interrupt mode has seven private registers to minimize state, which saves on processing overhead. All registers are general-purpose, including the program counter, although a set of conventions, called the ARM Procedure Call Standard, governs their use for C compatibility.

The bus clock for most ARM µPs can be synchronous or asynchronous with respect to the internal cache clock. All ARM µPs contain a write buffer, which lets execution continue while writes are pending. The buffer holds eight words at four independent addresses. ARM µPs also incorporate a coprocessor interface, although this interface is not brought out on the pins ARMx10 models to reduce die size.

The ARM µPs support user and supervisor modes for controlling access; they handle four exception-processing modes: interrupt request, fast interrupt request, abort, and undefined. Modes use different register windows to overlay some of the 16 general-purpose registers.

There are three main architectural variations in the ARM processor family: ARM6/7, ARM8, and StrongARM1. The ARM6 and ARM7 µP cores have a three-stage pipeline (fetch, decode, and execute) to achieve single-cycle instruction execution. Both cores use a Booth hardware multiplier that operates on two operand bits at a time to build a final product. A 32-bit multiply takes up to 16 cycles, though smaller numbers terminate early. The ARM 7M variant contains a faster 8-bit Booth multiplier, which executes in four cycles or fewer for 32×32-bit multiply and offers 64-bit multiplication.

The new ARM and StrongARM cores are implementations of the ARM Version 4 architecture. The ARM8 doubles the performance of the ARM7, and the StrongARM provides a fourfold performance increase over the ARM7. The ARM8 and StrongARM are similar in that they both implement a five-stage pipe (fetch, decode, ALU, cache, and write-back). However, both cores differ in their bus architectures; Strong-ARM uses a Harvard approach, and the ARM8 uses a Von Neuman architecture to save die area. Both cores try to avoid excess pipeline flushes; StrongARM uses early branch execution, and ARM8 uses static-branch prediction (always taking the rearward branch, as in a loop).

The first silicon implementation of StrongARM is the SA-110. The µP has separate instruction and data memory-management units. The translation look-aside buffers (TLBs) have 32 entries that can each map a segment, large page, or small page, and uses a round-robin replacement algorithm. The data TLB supports both the flush-all- and the flush-single-entry function, and the instruction TLB supports only the flush-all function.

Special instructions: There are 11 basic types of fixed-length instructions that execute conditionally (not just branch) and reduce the need for short pipeline-flushing branches. A not-taken instruction executes in one cycle. Taken branches incur a three-cycle delay. The 16 execution-condition codes include equal, not equal, always, negative, and overflow. The ARM lacks explicit shift instructions; instead, all ALU operations can perform an optional shift operation in the same execution cycle. The processors have block-data-transfer instructions to load and store data from any subset of the 16 general-purpose registers.

ARM processors lack an integer-divide instruction; fast divide and divide-by-10 are provided by support software. However, the chips do have MAC instructions. The MAC instruction speeds math-intensive applications. Division and multiplication by a constant can be performed quickly using the barrel shifter (for example, division by four takes one cycle, as does multiplication by five).

ARM has developed an architectural extension, Thumb (TDMI), that is primarily a 16-bit subset of the 32-bit instruction set. At runtime, the Thumb module, residing within the instruction pipeline, decompresses the 16-bit instructions back to 32-bit instructions without added delay. Although the Thumb module adds about 6% to the core's die size, it helps increase code density and overcome waste associated with using 32-bit fixed-length instructions.

Fujitsu 8693X

Fujitsu's MB8693X family (also known as SPARClite) is based on V8E spec (SPARC International's embedded specification). The family features a 32-bit ALU and uses a load/store architecture with a register stack of 136 32-bit registers. (The 86933 and 86933H chips have 104.) Eight reserved registers hold global values. The remaining registers arrange into six or eight overlapping register windows, one window for each subroutine. This setup speeds procedure calls and interrupt processing. Multiple contexts can be present concurrently by limiting the number of registers for a task.

Fujitsu engineers extended the SPARC pipeline to five stages for the 8693X: fetch, decode, execute, memory, and write-back. The memory stage minimizes the effects of load/store operations and reduces a load/store to one-cycle execution. The stage is idle for nonload/store operations.

All 8693X µPs have separate data and instruction caches. (The 86933H has instruction cache only.) The caches are two-way, set-associative and have 16- or 32-byte cache lines. Critical lines can be locked on chip and not swapped out. The µPs also incorporate a debug-support unit and emulator bus, which makes instruction streams visible even in on-chip cache. Debug registers hold data values or addresses for individual and range breakpoints.

The 8693Xs run with DRAM, SDRAM, SRAM, and ROM/EPROM. The memory interface handles a 32-byte burst mode into page-mode DRAM. The memory interface includes a refresh generator for DRAMs, programmable wait states for slower memory, and programmable chip selects for memory banking. Boot-up memory interfaces are programmable; most 8693X CPUs can boot up from 8-, 16-, or 32-bit ROM/EPROM.

Power management: The device shuts down FPU under control of power-management register.

Special instructions: The 8693X implements the SPARC V8 specification, which includes a full multiply instruction and division via software using divide-by-four. Other special instructions include scan word looking for first changed bit, or first 1 or 0; load/store double word; save/restore caller (uses register windows); tagged add/sub (generates overflow if MSB 0 and 1 are not 0); atomic math and swap; generate trap from conditions.

Hitachi SH Series

The SH family of RISC µPs/µCs uses a five-stage pipeline: fetch, decode, execute, memory access, and write-back to register. The CPU is built around 25 32-bit registers that are accessed using load/store instructions. These registers comprise 16 general registers (SH3 has eight 32-bit shadow registers for context switching), five control registers, and four system registers. The chips use 32-bit datapaths to move data internally, but all versions use a flexible external bus width.

SH processors use a 16-bit instruction word to achieve fairly compact code. The 16-bit instruction width limits the number of basic opcodes, handles only 16 general registers, and can address only two operands. Additionally, only 12 bits are available for an immediate offset; jumps must be in 2048 hops. Although these instruction-set restrictions lead to an increased number of instructions per task, overall, there is a significant reduction in code size.

The SH-1 µPs can run from external memory or from on-chip program memory. The 16-bit-wide external memory bus can supply the CPU with instructions from SRAM or fast DRAM on each cycle. If the processor is running from external memory, each data access to external memory may take one to two additional cycles.

Instead of on-chip program memory, the SH2 (SH7604) and SH3 (SH7702, SH7708) have a four-way, set-associative on-chip cache, a 32-bit-wide memory bus for high CPU memory bandwidth (only 16 bits wide for SH7702), and a full 32-bit divide unit (replacing the first chip's bit-step-divide function) on the SH-2. The cache can be reconfigured as a two-way, set-associative cache and 2 (SH7604) or 4 kbytes (SH7708) of user-configurable RAM. The external memory bus supports multiprocessing; the bus has bus arbitration for multiple masters.

Power management: Sleep mode discontinues CPU processing but keeps peripherals active. Standby stops everything but maintains register/cache contents. The SH-2 and SH-3 provide several clock modes for reducing power; software can adjust the clock rate during program operation. The SH-3's unified cache has a special low-power design that dissipates only 100 mW in operation. The cache-sense amps are energized for the set that hits--while the other three sets stay switched off. The sense amps only respond to a 60-mV differential (vs the full 3.3V swing).

Special instructions: A 16316-bit MAC instruction (42-bit accumulator) in the SH-1 and a 32×32-bit MAC instruction (64-bit accumulator) in the SH2 and SH3 provide a fast DSP function. Although classified as a load/store architecture, some instructions reference memory. Delayed branch instructions minimize pipeline disruption. An instruction swaps upper and lower bytes.

hyperstone E1-32

The hyperstone E1-32 combines RISC and DSP technology in a unified core. The E1-32 has a load/store architecture built around a register set that includes 64 general-purpose local and 22 global registers. Local registers are organized into a 64-word, circular register stack to hold function/subroutine stack frames. The stack comprises frames of up to 16 words; current frames are kept on chip and are automatically pushed to off-chip memory as the register stack fills up. So, conversely, frames that are off chip are placed back on chip when memory is available. For fast parameter passing, the current stack frame can overlap with the previous one with a variable range.

Instruction length varies among 16, 32, and 48 bits. The variable-length instructions, which the E1-32 automatically prefetches, provide up to 32-bit constants and 32-bit native addresses.

The 4-Gbyte address space divides into four blocks; you can configure each block individually for bus width and timing. The E1-32 integrates a fast-page-mode DRAM controller in one of the block spaces. The other blocks can be used for glueless connection of SRAM, EPROM, or other memory devices, each with its own timing and bus width. A separate I/O address space also allows each I/O device to have its own timing.

The integrated DSP unit, working in parallel with the ALU, can perform DSP-calculations while the ALU is performing loop counts, address calculations, or load/store operations. The ALU executes its instructions during the latency cycles of DSP instructions. The DSP unit shares the E1-32's functional blocks, including the register set; however, it does provide dedicated result registers and 32- and 64-bit hardware accumulators. The DSP unit supports 16- and 32-bit data types.

Power management: In automatic power-down mode, only the interrupt-logic, clock, and DRAM-refresh logic remain active. Sleep mode also disables DRAM refresh.

Special instructions: DSP instructions include multiply and MAC (complex and real), and multiply-subtract, complex add/subtract. Other instructions include test-leading zeros.

Intel 386

The 386 has all but disappeared from the PC and has developed a strong presence in embedded-PC applications (see EDN, June 22, 1995, pg 36). Register-based, the 80386 architecture has four general-purpose registers and four index/pointer registers, supplemented by six 16-bit segment registers and two 32-bit status and control registers. Intel 8086 designers used 64-kbyte segments to extend addressing to 1 Mbyte. The 80386 also uses segmentation; however, because the registers are now 32 bits (general-purpose and index/pointers), the segment limits are extended to the full 4-Gbyte addressing range, and a segment register references a segment descriptor with a 32-bit base address. These descriptors also carry addressing-range and protection limits to prevent data accesses into code, data's being executed as code, and access to inner privilege levels by outer levels.

Hardware-descriptor registers hold segment-access rights along with segment-base address and size limits. In protected-mode addressing, a 16-bit selector points to a segment descriptor and furnishes a base address. The base address adds to the 32-bit effective address, producing a 32-bit linear address, which is then used as a physical address or as a linear-page address.

The 386 has six debug registers: four code/data breakpoint registers and two control registers. The breakpoint registers can be set with addresses for halting execution on a program or data access.

Power management: System-management mode (SMM), a power-management mechanism, enables code to control CPU power without having to rewrite or revamp existing operating software. The CPU enters SMM via a hardware interrupt, SMI (system-mode interrupt); the SMI interrupt code can set SMM operating modes to reduce chip power dissipation. Integrated versions of the 386 (for example, Intel's 386EX) have idle and power-down modes: Idle discontinues CPU processing but keeps peripherals active, and power down shuts down the entire chip. AMD's 386SC300 chip has four power-saving modes: low speed (CPU goes to 0.5 MHz); doze, which stops CPU, system, and DMA clocks; sleep, which stops additional clocks and peripherals; and suspend, which stops everything except RTC and memory.

Special instructions: The 386 instruction set is a superset of the 8086/186. To support SMM, the 386 has seven additional instructions, such as RSM, which causes the processor to resume from SMM mode.

Second sources: AMD.

Intel 486

The 486 builds on the 386 architecture by adding a more efficient memory bus, an on-chip FPU, an on-chip unified L1 cache, and a RISC-like implementation for the core load/store instructions. The 80486 is a 32-bit RISC/CISC implementation, retaining the 386's complex instruction set but relying on a pipelined RISC-like implementation to speed execution for simple load/store instructions. The standard 486 microarchitecture has a five-stage pipeline and uses two of those stages (two decoder stages, D1 and D2) to decode the complex instruction set.

The 486 chips use variable-length instructions, ranging from 1 to 15 bytes for complex operations. The two decoder stages give the hardware time to delineate and decode the instructions waiting in the instruction queue. The instruction or byte-code queue holds 32 bytes for decoding. By fetching four words at a time from off-chip or local memory, the hardware minimizes contention between data and instruction accesses of the cache. To speed processing, the hardware loads and writes cache lines in four-word bursts.

The DX4 has a unified cache that is four-way, set-associative and implements a write-through policy: Writes to cache pass through to memory, which raises memory bandwidth. The 486's bus and cache implement a bus-snooping protocol for multiprocessor operation. The bus is more efficient than that of the 386 and has a two-clock single read or write. Four-word read bursts take five cycles and constitute the majority of 486 bus accesses. The processors also support secondary caches for both single and multiprocessor operation, as well as write-through/write-back protocols.

The 486 has six debug registers: four code/data breakpoint registers and two control registers. The breakpoint registers can be set with addresses for halting execution on a program or data access.

National Semiconductor offers an integrated 486, the NS486SXF, with a three-stage pipeline and a 16-byte prefetch queue. To reduce the core size further, National removed the 486's FPU, virtual-memory support, and real-mode functionality (precluding DOS support).

Power management: The standard 486 employs SMM for power management. (See description under 386 listing.) A halt instruction powers down most of the CPU's logic. Although it doesn't support SMM, National's version has several other power-saving features: A power-save mode divides the CPU clock, individual peripherals can be disabled, and an idle mode disables the CPU clock without affecting peripherals.

Special instructions: The 486 instruction set builds upon that of the 80386, adding instructions such as byte swap, exchange-and-add, compare-and-exchange, invalidate data cache, write-back and invalidate data cache, invalidate TLB entry, processor identification, and SMM resume.

Second sources: AMD, IBM Microelectronics, National Semiconductor, SGS-Thomson, and Texas Instruments.

Intel i960

The range of i960s runs from the new superscalar HA/HD/HT to the 16-bit SA/SB variants. The i960 combines a Von Neumann architecture with a load/store architecture that centers on a core of 32 32-bit general-purpose registers divided into 16 local registers and 16 global registers. An on-chip register cache automatically caches the local register sets to speed context switching. If the cache is full, the oldest cached set moves to memory and the latest set caches. All i960s have multistage pipelines and use resource scoreboarding to track resource usage.

The i960CA upgraded to superscalar operation and five pipeline stages. The key to the Cx is its four-instruction-wide instruction decoder, which decodes up to four instructions per cycle. Current implementations dispatch up to three instructions for execution. The i960CF has 128-bit-wide buses to move instructions to the decoder and 128-bit-wide buses to move data between the cache and registers.

Superscalar i960s are built around a six-port register file with execution units gathered into two groups: register and memory control. These units include integer, floating-point, and interrupt-control units (on the register side) and address-generation and bus-controller units on the memory side. Instructions are cached in a lockable cache; later versions add an instruction cache to supplement the register cache.

The i960RP, based on the i960 Jx series processor core, is an I/O processor. The chip is used in server-motherboard and adapter-card applications where it creates an intelligent I/O subsystem. Intel and others have developed an Intelligent I/O (I2O) specification to speed I/O processing and simplify driver development.

Special instructions: The i960 has uninterruptible atomic add and modify instructions. Other instructions flush local registers and provide cache-locking control.

Intel Pentium

Intel designers tackled two problems when designing Pentium: achieving code compatibility with earlier x86 CPUs and attaining third-generation RISC performance. Designers handled both issues by implementing the complex x86 instruction set and emphasizing simple instruction executions over the more complex ones. With Pentium, the simple, RISC-like register-to-register instructions drive the implementation; the microcoded complex instructions are second priority.

Pentium achieves a two-instruction issue peak and has two five-stage pipelines (U and V) for each instruction. These pipelines are not symmetric; the U pipe takes precedence over the alternate pipe, V. If the second instruction does not cause interlocks (using results from the first instruction to write into the same register/data), then the second instruction is scheduled for the V pipe.

The U and V pipelines feed from a common instruction fetch/align stage that fetches multiple instructions from the cache. The CPU fetches and passes a full line (256 bits) to the instruction decoder. Each pipeline has two decoder stages to decode simple and complex instructions. The wide cache-to-decoder path, coupled with a two-stage decode, enables Pentium to decode the x86's variable-length instructions and deliver competitive performance.

For superscalar, dual-instruction load/store operations, the Pentium data TLB and cache tags are dual-ported for concurrent pipeline accesses. The data-cache SRAM is eight-way- interleaved, allowing concurrent accesses to different memory banks. (The cache is actually triple-ported, with an extra port for snooping.) Cache hit rates range from 90 to 97%, depending on the application-code mix. The data cache handles both 4-kbyte and 4-Mbyte pages. It has two four-way, set-associative TLBs: one with 64 entries for 4-kbyte pages and one with eight entries for 4-Mbyte pages. The code cache is also two-way set-associative, with a four-way, set-associative, 32-entry TLB that handles both 4-kbyte and 4-Mbyte pages.

The CPU uses dynamic-branch prediction, allowing the CPU to determine which branch to take, as opposed to static branching, in which the compiler predetermines potential branches. Pentium's 256-entry branch-target buffer (BTB) holds branch-target addresses for previously executed branches, unlike some implementations that hold the actual target instructions. The BTB supplies the next instruction address that the last execution of a branch instruction took. Each BTB entry integrates the target address with special history and operation bits. Intel claims that a correctly predicted branch takes a single pipeline cycle and doesn't cause a pipeline bubble. Simulations show performance increases 25% when using the BTB.

Pentium's FPU features an eight-stage pipeline, which shares the first five stages of the U and V pipeline. Data transfers to or from the FPU use a wide, 64-bit datapath to the data cache to keep the FPU pipeline fed. Pentium adds a write buffer to each pipeline to avoid write contention.

Pentium uses burst reads to fill its 256-bit-wide cache line. It also has burst write-back writes. The memory interface is pipelined, allowing a second bus cycle to set up while the first bus cycle completes. Pentium reads or writes a 64-bit double word each cycle in burst mode.

The PentiuµPro (PPro) barely resembles the Pentium or any x86 processor. With a single, decoupled 12- to 18-stage pipeline, the PPro trades less work per pipe stage for more stages. Three independent engines comprise the PPro: fetch/decode, dispatch/execute, and retire. The fetch/
decode engine converts instructions into one or more micro-operations (mops). The mops improve performance by representing fixed-length, fixed-field, easy-to-execute operations. You can individually schedule the mops, to facilitate the out-of-order execution of instructions within the PPro.

After the decoder creates mops, it sends them to a 40-deep reorder buffer (ROB). The mops then await dispatch to the execute portion of the pipeline. At this point, the mops either are ready for execution or are waiting for data from a memory access or a result from a previous mop. To avoid register dependencies, the PPro performs renaming: Extra registers represent the x86's programmer-visible registers. The dispatch/execute engine queues ready-for-execution mops within a 20-entry, distributed-reservation station. The PPro determines the data flow by analyzing which mops are dependent on other mops' results. The device creates an optimized schedule of mops. The processor dispatches mops from anywhere (or in any order) within the reservation station.

The PPro speculatively executes and returns these mops to the ROB, where the retire engine evaluates them. Although the PPro executes mops (or instructions) out of order, the device must complete the instructions in the original program order. Furthermore, speculative execution implies that the device executes some instructions that never retire. This situation occurs if the device mispredicts a program branch. When the PPro encounters a mispredicted branch, the PPro must flush its deep pipelines and remove mops from the ROB. To minimize the potential of a mispredicted branch, Intel designers increased the BTB to 512 entries and added extra history bits to provide more intelligence to the prediction algorithm.

Other 586-class processors: AMD's K5, although software- and pin-compatible with Intel's Pentium, is a unique 586-class CPU. While issuing up to four instructions per cycle, this µP marks the boundaries of the x86 instructions, enabling multiple x86 instructions to be aligned and assigned issue positions for efficient instruction processing. Every byte of code entering the processor's instruction cache is tagged with bits of associated predecode information to determine how to break the x86 instruction into a number of RISC-like operations (termed "ROPs"). The K5 also has a dual-ported data cache that allows two cache lines to be accessed simultaneously. The K5 can execute instructions out of order and also has extra registers, allowing the CPU to perform register renaming.

Cyrix also offers a processor, the 6x86, that is software- and pin-compatible with Pentium. This CPU performs register renaming, multibranch prediction, speculative execution, and out-of-order completion. Realizing that the initial 6x86 die size would be too large to compete at the 586 level, Cyrix developed a scaled-down 6x86, called the 5x86, with a 64-bit internal architecture packed into a 486 footprint.

MIPS R3000

MIPS R3000 processors, primarily used in embedded applications, are built around a set of 32-bit, general-purpose registers in a central register file. To minimize control logic, the instruction set is reduced to 73 instructions, and addressing options are limited. The chip has a three-address, load/store architecture. Similarly, instruction sizes are fixed to one 32-bit word to minimize decoding and speed processing.

MIPS engineers used a five-stage pipeline for the R3000. The pipeline lets up to five instructions execute concurrently--each at a different stage of its instruction cycle, thus giving the effect of single-cycle execution. The pipeline stages are instruction fetch (IF), read operands and decode instruction (RD), execute (ALU), access data memory (MEM), and write-back results (WB). A branch-delay slot minimizes branch effects. The compiler fills the instruction slot, following the branch with an NOP or an instruction from the current thread that can be executed before the branch takes effect. Toshiba's R3900, an R3000 derivative, incorporates register scoreboarding to enable nonblocking loads and avoid pipeline stalls when there are no data dependencies in subsequent instructions.

The R3000 MMU includes a fully associative, 64-entry TLB that translates virtual addresses to 32-bit physical addresses. The µP uses a write-through cache policy. A small on-chip FIFO enables the CPU to refill the cache and execute instructions even when additional instructions are being read from memory. This process is called "instruction streaming."

Special instructions: The R3000 uses the MIPS-I instruction set. Toshiba's R3900 adds a MAC instruction.

Second sources: IDT, LSI Logic, NEC, NKK (Santa Clara, CA), and Toshiba.

Mitsubishi M32R/D

The Mitsubishi M32R/D contains a RISC CPU, a 32316-bit MAC, a bus-interface unit (BIU), and a large on-chip DRAM. A single 128-bit internal bus, which operates at 66 MHz, connects the CPU, the DRAM, the cache, and the BIU.

The M32R/D has 16 32-bit general-purpose registers and supports 83 instructions of 16-bit- and 32-bit-wide instruction formats. The CPU executes most instructions in one clock cycle by using a five-stage pipeline: instruction-fetch, decode, execute, memory-access, and write-back. The decode stage dispatches instructions in order, and the remaining stages execute the instructions out of order to hide memory-access latency. The MAC contains a single-cycle, 32316-bit multiplier and a 56-bit adder.

The CPU has an instruction queue of two 128-bit entries. The cache is mapped directly to the address space and has two caching modes: one for caching internal DRAM and the other for caching external program ROM area. If a cache miss occurs, the CPU fetches one 128-bit data line in five cycles. The BIU has 128-bit data buffers and supports burst transfers on 128-bit boundary data.

A 16.67-MHz bus clock and a 4× digital PLL generate the internal 66-MHz clock. The PLL contains a digital-frequency multiplier. Four cascaded, 64-tap inverter chains generate four timing edges in half a clock cycle. A phase detector and an up/down counter adjust the pulse width to one-fourth of the half-clock cycle to keep the duty cycle of the 4× clock at 50%. The generated clock is then fed into a digital-phase shifter to reduce the phase difference between the external and internal clocks.

Power management: When entering standby mode, the M32R/D purges the internal cache, sets the internal DRAM to self-refresh mode and stops the clock and PLL circuit; the M32R/D consumes current only for DRAM refreshment. Recovery from standby mode requires a reset.

Special instructions: DSP-function instructions include move to/from accumulator, multiply half-word, multiply-accumulate, and round accumulator instructions. The move to/from accumulator instructions move 32-bit data to and from the 64-bit accumulator. Multiply half-word instructions multiply 16-bit data with 16- or 32-bit data and save the result in the accumulator. Round accumulator instructions round 64-bit data in the accumulator to 16- or 32-bit data.

Motorola ColdFire

Motorola's ColdFire evolved from the M68000. This architecture is also known as VL-RISC, because, although the core is RISC-like, the instructions are variable length (VL). VL instructions help to attain higher code density. Another advantage of the ColdFire architecture is its reduced size, which is achieved by eliminating M68000 instructions, which were used infrequently in embedded applications, and by optimizing the pipeline. ColdFire continues to use the M68000 programmer's model.

ColdFire has a four-stage pipeline that consists of two subpipelines: a two-stage instruction-prefetch pipeline (instruction-address generation and instruction-fetch cycle) and a two-stage operand execution pipeline (decode and select/operand-fetch cycle and operand-address generation/execute cycle). A 12-byte FIFO instruction buffer decouples the two pipelines. The prefetch pipeline calculates the next instruction address and then fetches 32 bits of instruction data. The operand pipeline has a dual-read-ported register file feeding an arithmetic/logic unit.

The CPU core is separated from on-chip peripherals by using a modular, standard bus architecture; the core communicates with on-chip memories using a tightly coupled processor bus, the KBus. This bus lets the core perform a 32-bit fetch from internal memory in a single clock cycle by pipelining the address and data. A controller interface on the KBus indirectly attaches the core to user-selectable cache, ROM, and RAM modules. Another ColdFire bus, the MBus, offers centralized arbitration. A special module connects the MBus to the KBus. The SBus interfaces to standard on- and off-chip peripherals and attaches to the MBus through a system-bus controller.

On-chip debug supports real-time trace and real-time and non-real-time debug. The feature also supports access to control registers to define types of memory regions, such as cacheable copyback, write-through, and noncacheable. Real-time trace reflects the processor's status and indicates events such as instruction completion and monitor of change-of-flow target addresses. Real-time debug supports three hardware breakpoints: program counter relative, operand address, and operand data; and non-real-time debug, which is similar to background-debug mode on current 683xx products. You can use a three-pin serial interface in this mode to read register contents, generate an infinite-priority interrupt, and force the CPU to halt.

Power management: A low-power stop instruction (LPSTOP) shuts down active circuits in the processor and halts instruction execution. Processing resumes via reset or valid interrupt.

Special instructions: ColdFire added the following instruction extensions to the 68000 architecture: 32×32-bit integer multiply, register-sign-extension instructions, and multiple-word NOPs, used by compilers to remove branch instructions.

Motorola 68000

The 68000 serves as a base for the 680x0 and 683xx lines of 32-bit µPs. The 68000 is actually a 16- and 32-bit mix. It has 32-bit registers for easy addressing, a 16-bit datapath and ALU to conserve silicon, and 16-bit instructions. Programmers get eight general-purpose, 32-bit data registers, which the CPU can address by bit, BCD, byte, word, or double word. In addition to user and supervisor stack pointers, 68000 chips have seven address registers. Other registers include the 32-bit program counter and 16-bit status registers. The status register maintains status for the user and supervisor modes via a user byte and a supervisor byte. The processor's user and supervisor modes are implemented in hardware, which eases having a control kernel or OS manage multiple application tasks.

The 68000 has two microcode levels: microcode and second-level expanded nanocode. Instruction execution triggers a chain of 10-bit microcode words. Each microcode word can reference another word--for example, a jump in microcode or a string of 70-bit nanocode words that drive the CPU logic directly.

The CPU lacks a memory controller, but the separate address and data buses eliminate the need for buffering addresses. However, the CPU needs logic to generate the required DTACK* signal, which marks the successful completion of a memory cycle. An address decoder is necessary for multiple memory chips, and drivers may be needed to buffer bus address and data lines (integrated versions of the 68000 contain this logic). If DTACK* is late, wait states are generated.

Power management: Only the integrated versions provide variations of sleep and low-power stop modes.

Special instructions: The chip restricts privileged instructions to supervisor mode. These instructions include reset, stop, and moves and operations on the status register. To support user and supervisor modes, the hardware implements separate stacks and pushes and pops PC and status register onto the stack for exceptions. A link instruction lets you build link lists on private stacks. A special instruction lets you move up to 16 registers to or from an effective address, including blocks of data registers to or from address registers.

Second sources: Hitachi, Philips, SGS-Thomson, and Toshiba.

Motorola 680x0

The 680x0 architecture is built around 16 general registers with a 68000-compatible, orthogonal instruction set. The 680x0 has more registers than the original 68000. Control registers were added to control the MMU and the FPU as well as to support additional processing capabilities. For example, the 68040 adds eight 80-bit floating-point registers and 12 control registers, which include a vector-base register (points to interrupt vector table), cache-control register, user and supervisor root pointers, and translation registers.

The superscalar 68060 heads the 680x0 lineup with its dual-integer and floating-point pipelines. As instructions enter the CPU, they flow into a four-stage prefetch pipeline: instruction-address generation, instruction cycle, instruction early decode, and instruction buffer. In this pipeline, the 68000-compatible, variable-length CISC instructions get converted to a fixed-length instruction. Once converted, these instructions enter dual four-stage integer-execution pipelines that operate synchronously. The four stages of the execution pipeline are decode, effective address calculation, fetch, and integer execution. This pipeline dispatches instructions to the floating point and allows for some execution overlap between the integer and floating-point engines.

A Harvard architecture allows the 68060 to perform simultaneous instruction fetches and data accesses. The on-chip caches are four-way set-associative with four-way interleaving to support simultaneous read and write operations. Portions of the caches can be frozen to prevent reallocation.

The 68040 implements a six-stage pipeline (fetch, decode, effective-address calculate, effective-address fetch, execute, and write-back). To speed processing, the 68040 has two on-chip, 4-kbyte direct-mapped caches and separate data and instruction MMUs, which allow simultaneous address translations. Bus snooping is built into the 68040's caches to ensure cache coherency for multiprocessing. Both write-through and copy-back modes are built into the cache. The 68020 and 68030 CISC implementations have smaller caches; the 68030 and 68040 implement burst mode, moving up to 16 bytes in a single addressing block between registers and memory.

The 040 and 060 deliver apparent single-cycle execution for some instructions, mainly register operations, such as memory-to-register moves (if the data is in the data cache). A taken branch takes two cycles; a non-taken branch takes three cycles. On the 68060, a 256-entry, four-way, set-associative, on-chip branch cache allows taken and nontaken branches to execute in zero and one clock, respectively. The branch-cache unit contains state bits that provide a branch-execution history, which helps to predict branch direction.

Unlike the 68020/030, the 68040 and 68060 do not perform dynamic bus sizing. Instead, they have a highly reliable bus with a high-drive option capable of implementing a synchronous, two-clock R/W protocol. A four-word burst takes five clocks. Multiprocessor bus arbitration is built into the 040 but requires off-chip logic. Externally, the 68060 bus is a superset of the 68040's bus. Additional signals support higher performance system designs, but the processor can easily operate on an existing 68040-based bus. An on-chip MMU with separate instruction and data TLBs allows the 68060 to access up to 4 Gbytes of memory.

Power management: To support power management, the 68060's functional units respond to dynamically controlled clocking; the caches and execution units power down when not accessed. The static design allows the external clock to be reduced or stopped, and an LPSTOP (low power stop) instruction disconnects most of the chip from the CLK pin.

Special instructions: The CPUs have special instructions for variable-length bit fields, moving 16 registers, compare, and swap, which locks memory for multiprocessing. A scaling option addresses data by item size for table access and for FPU, and MMU commands.

The 68040 and 68060 have a special MOVE16 instruction to perform a 16-byte block move and a PLPA instruction that loads a physical address by translating a logical address. A TBL instruction performs a table look-up and interpolates the data.

Motorola 683xx

For most of the 683xx family, Motorola combined a stripped-down 68020 core with a 16-bit (32-bit for CPU32+) on-chip InterModule bus, which links the CPU with a device's complex peripherals. The core processor, the CPU32 or CPU32+, is the 68020 CPU stripped down for embedded control (no MMU or FPU interface) combined with a 16- or 32-bit data bus, respectively. The 32-bit processor has eight general-purpose, 32-bit registers; seven 32-bit address registers; a 32-bit ALU; separate user and supervisor modes, each with its own stack; and separate address and data spaces. The CPU32 is code-compatible with the 68020 but has enhanced addressing modes, including scaled index, address register indirect with base displacement and index, program counter relative, and 32-bit branch displacements. Postincrement and prein-crement/decrement options simplify iterative code. Peripheral-control registers and I/O are memory-mapped; the CPU accesses them as addresses in memory.

All 683xxs have a system-integration module featuring system configuration, oscillator and clock dividers, reset and power-down-mode control, chip selects and wait states, parallel I/O with interrupt capability, interrupt configura-tion/response, and a software watchdog. The external bus interface has up to 32 address and 16 data lines (32 for the CPU32+) and up to 12 programmable chip-select lines.

Power management: The LPSTOP instruction stops the clock. Devices can run at low frequencies.

Special instructions: The68020 instructions not supported include BCD pack/unpack, bit field, compare and swap, coprocessor, MMU, module call/return (memory indirect addressing also not supported). New instructions include a table look-up and interpolate, as well as the ability to put the chip into a low-power standby mode.

Apple/IBM/Motorola PowerPC

Serving as a base for a family of RISC chips, the PowerPC derives its core architecture from the performance-optimized-with-enhanced-RISC (POWER) architecture. The instruction set supports multiple microarchitecture implementations that include the 32-bit 602, 603, and 604 embedded processors (Motorola's MPC 505, MPC860, MPC821, and IBM's 400 series) and the 64-bit 620.

The PowerPC 620 comprises six execution units encompassing a five-stage pipeline: fetch, dispatch, execute, complete, and write-back. It uses a superscalar design to control six independent execution units: three integer and a branch, a floating-point, and a load/store. Each unit contains two to four reservation units for holding instructions so that the CPU can reduce data dependencies and pipeline bubbles.

The 620 can perform both static- and dynamic-branch prediction. It also performs out-of-order execution and in-order completion.

The 620 has a 128-bit, L2 cache interface. The unified L2 cache can be clocked at one-half, one-third, or at the processor-clock frequency. The 128-bit data-bus interface supports a split transaction, pipeline snoop-bus protocol. An on-chip MMU converts 80-bit virtual addresses to 64-bit physical addresses and uses 128-entry, two-way, set-associative shared instruction and data TLBs.

The PowerPC 604 can issue up to four instructions per cycle. It has the same types of execution units as the 620; however, each execution unit contains only two reservation units. The 604 uses a six-stage pipeline: fetch, decode, dispatch, execute, completion, and write-back. The 604 also performs dynamic-branch prediction and speculative execution and, similar to the 620, performs out-of-order execution. Similar to the 620, the 604 supports the MESI protocol for cache coherency for multiprocessor systems.

The 603 comprises five parallel execution units: integer execution, floating point, branch, system, and load/store. Combined with a four-stage pipeline (fetch, dispatch, execute, and complete), the 603 can achieve three instructions per clock cycle. During the fetch stage, the 603 uses a six-instruction prefetch queue to hold pending instructions. Unlike other PowerPC derivatives, the 603 supports only static-branch prediction. However, the architecture supports out-of-order execution and in-order retirement, similar to other PowerPC devices.

The PowerPC 602 is a cost- and power-reduced implementation of the 603. The 602 has four parallel-execution units: fixed point, FPU, branch processing, and load/store. The scalar design of the 602 limits instruction issue to one instruction per cycle. The 602 performs branch folding to help eliminate branches. Its FPU stores and calculates only single-precision values.

The PowerPC 601 can issue up to three instructions/clock cycle using three major functional units: the instruction unit for integer operations, the FPU, and the branch unit (BU); all execute concurrently. The instruction unit fetches the instructions, queues eight instructions for decoding, and issues them to the execution units.

To minimize TLB exceptions, the 601 has a large, 256-entry TLB. It also has four-entry, shadow TLBs for fast access to the most recently accessed entry. The 601 has branch prediction and branch folding, eliminating branch instructions where possible. The BU searches the bottom half of the instruction queue for branches and uses static prediction to cause the target-instruction thread to be accessed for execution.

The embedded PowerPC processors include Motorola's MPC500 family and IBM's 400 series devices. Compared with other PowerPC devices, these devices have similar--but fewer--execution units and issue only one instruction at a time. The MPC500 uses an Intermodule bus (originally developed for Motorola's 683xx devices) as a backplane to connect all system modules. The MPC500 family includes a system-integration unit that enables simple integration with external memories, other CPUs, and peripheral devices.

Power management: The 603 has a dynamic power-management feature that includes clock stopping and reducing signal activity whenever possible. For example, when not in use, the FPU, system unit, load/store unit, or caches are automatically turned off. The CPU has three incremental and automatic power-management modes: doze (functional units are disabled except for timebase/decrementer registers and bus-snoop logic), nap (disables bus snoop), and sleep (disables all internal functional units).

Special instructions: All PowerPC 6xx devices perform single-cycle floating-point MACs. PowerPC 604, 603, and 620 support graphics instructions: SQRT approximation and inverse SQRT approximation. The 403Gx has load/store for multiple registers and byte strings as well as extensive cache-manipulation and semaphore-handling operations.

NEC V850

NEC's V850 µC family is based on the company's proprietary 32-bit RISC architecture, which comprises a five-stage pipeline, 32 general-purpose registers, a 32-bit barrel shifter, and a hardware multiplier. The pipeline stages are fetch, decode, execute, memory access, and write-back. Most instructions execute in one clock and are two bytes long, allowing smaller code. The CPU has a pipeline-stall feature that automatically inserts a bubble into the pipeline to avoid data dependencies and hazards.

A bus-control unit (BCU) generates a prefetch address to prefetch an instruction code from external memory and store it in the four double word, prefetch queue. For accesses from internal ROM, instructions go straight to the CPU (that is, not through the prefetch queue). Instruction fetches from internal ROM consume one cycle; data fetches from ROM require three cycles. Therefore, you should shadow look-up tables and fixed data structures to the CPU's internal RAM, where data can be accessed in one clock. The BCU also provides a bus-hold function, allowing other devices, such as DMA, to share and take control of the V851's external bus.

Peripherals are accessed as memory-mapped I/O and are connected to the CPU through a 16-bit bus. ROM and RAM communicate to the CPU using a 32-bit bus. Although the first member of this family, the V851, has 32 kbytes of ROM and 1 kbyte of RAM, the V850 architecture allows internal expansion to 1 Mbyte of ROM and 4 kbytes of RAM. Similarly, the external bus of the V851 addresses up to 16 Mbytes. (The architecture allows access up to 4 Gbytes on future chips.) The V850's memory space divides into 1-Mbyte unit blocks, and wait states can be inserted into a bus cycle for every two blocks.

Power management: In halt mode, the clock generator continues to operate, but the CPU clock stops, allowing the on-chip peripherals to function. Idle mode stops the CPU clock and internal system clock; however, because the clock generator continues to run, normal operation can resume without having to wait for oscillator and PLL stabilization. In stop mode, everything stops, but register and memory contents stay intact.

Special instructions: NEC's V850 devices support a software-trap instruction. The CPUs also perform saturate operations in which the CPU stores the maximum values if addition results overflow. For example, if the result exceeds the positive-value 7FFFFFFFh, 7FFFFFFFh is stored in the result registers, and the CPU sets the saturation flag.

Sun microSPARC/Ross hyperSPARC

The microSPARC and Ross hyperSPARC processors are built around a large, multiported register file that breaks down into a small set of global registers for holding global variables and sets of overlapping register windows. Each 24-register window has a core of eight registers supplemented by eight registers overlapping the previous and next register windows. The overlapping registers eliminate the need to save and restore registers on function calls, returns, or context switches between tasks.

The microSPARC has a five-stage pipeline: fetch, decode, memory access, execute, and write-back. It also has a four-entry write buffer to prevent write stalls. An integrated FPU contains 32×32-bit floating-point registers, a general-purpose execution unit, and an floating-point multiplier. A three-entry queue of floating point instructions helps increase concurrency with integer execution.

The hyperSPARC's pipeline consists of an instruction- fetch unit, two integer ALUs, a load/store unit, and an FPU. Instructions are fetched in pairs and, when possible, dispatched to functional units in pairs. The FPU contains a four-entry instruction queue, helping to eliminate stalls due to the execution of multiple-cycle floating-point instructions. In addition, when data dependencies exist, data generated by one instruction in the floating-point queue is forwarded to the second instruction without having to go through the register file.

Both microSPARC and hyperSPARC include SPARC-compliant MMUs. The microSPARC's MMU uses three high-order bits of physical address to map eight address spaces. The MMU controls arbitration among I/O, data cache, instruction cache, and TLB references to memory. It contains a 64-entry, fully associative TLB and supports 256 contexts. The hyperSPARC's MMU uses a context register to identify up to 4096 contexts.

The microSPARC µPs have a separate 64-bit memory interface that handles up to 128 Mbytes (256 Mbytes for microSPARC II) of 16-Mbit DRAM. An on-chip SBus interface and controller handles five SBus slots. (SBus is a 25-MHz, 32-bit synchronous bus.) The hyperSPARC interfaces to the system via the SPARC-standard Mbus (a 40-, 50-, or 66-MHz, 64-bit multiplexed synchronous bus). Ross' processors operate synchronously or asynchronously to the Mbus clock. The microSPARC-II replaces the SBus interface with a 32-bit, 33- MHz, PCI interface.

Special instructions: The microSPARC and hyperSPARC µPs comply with instructions listed within the SPARC V8 specification. Temic Semiconductor's SPARClet µP includes DSP capabilities by extending the SPARC instruction set and accessing special hardware operators using the coprocessor opcode.

Sun SuperSPARC

SuperSPARC operations within Sun's SuperSPARC center around the 136-entry, eight-port register file. Registers group into eight global registers and eight overlapping register windows. The register file handles six reads (three two-operand reads) and two writes; the file can concurrently perform two reads and two writes but is time-shared to handle six reads and two writes in one system-clock cycle.

The superscalar SuperSPARC pipeline has eight stages grouped into four execution stages (fetch, decode, execute, and write-back) of different lengths. The eight stages are cache access, send matched instructions to scheduler, issue instructions, read address registers/evaluate branch-target address, read operand from register file, first and second ALU stages, and write-back result.

The CPU runs eight functional units: three integer ALUs, load/store, branch, floating-point multiply, floating-point add, and shift. The adder units are organized so that two can execute concurrently and return results to the register file or feed into the third ALU. That ALU can then operate on the results and return a value to the register file in one pipeline cycle. Thus, SuperSPARC can do three adds in one cycle, in which one add is dependent on the first two results.

The multiply and add floating-point units are pipelined; they can accept a new instruction every clock cycle but have a three-cycle latency. The FPU has its own instruction queue and 16 64-bit registers.

The CPU physically addresses its instruction and data caches. The instruction-cache path is 128 bits to handle superscalar operation. Four instructions are presented simultaneously to the eight-deep prefetch queue. A single TLB supports both caches. It has 64 entries and does two TLB evaluations in one clock cycle.

SuperSPARC runs in stand-alone mode by interfacing to the MBus. The processor can run in cache-controller mode by interfacing to an external cache controller via the VBus, a nonmultiplexed, proprietary bus (CPU clock rate, 36-bit address, 64-bit data). The VBus links to a cache controller and up to 2 Mbytes of unified secondary-cache SRAM. The cache controller can handle multiprocessing (more than one SPARC CPU on an MBus).

Branch-delay slots and a branch-target queue minimize branch penalties. A branch-delay slot following the current set of instructions gives the hardware time to prefetch both the target set and the next sequential set of instructions.

Power management: SuperSPARC does not implement any power-saving features.

Special instructions: SuperSPARC processors comply with the instructions listed within the SPARC V8 specification.


| EDN Access | feedback | subscribe to EDN! |
| design features | out in front | design ideas | departments | products |


Copyright © 1996 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.