EDN logo


Design Feature: September 15, 1994

EDN's 21st Annual Microprocessor Directory - 32-bit chips

Marcus Levy,
Technical Editor

32-BIT


AMD 29000

AMD positions the 29000 family for embedded systems ranging from laser printers, X-terminals, graphics, RAID (redundant array of inexpensive disks), telecommunications, and networking. The 29000 is a difficult processor to classify because it has three product lines within the same family. The three lines include three-bus Harvard-architecture processors, two-bus processors, and microcontrollers with on-chip peripheral support.

The 29000's core is built around a simple four-stage pipeline: fetch, decode, execute, and writeback. Processor hardware implements pipeline interlocks. The pipeline allows the CPU to maintain a sustained performance of 1.26 to 1.6 cycles/instruction. What makes the 29000 unique is its large register file of 192 32-bit registers. The triple-ported register file handles two operand register reads and one register-result write in a single clock. The register file can also do on-chip stack operations and register windowing. Most 29000 instructions specify three addresses/two registers as inputs, and a register for the result. Eight-bit addresses specify a register in the register file. The addresses are divided into local (MSB=1) and global (MSB=0); 128 registers are in the local set.

Another unique feature is the branch-target cache (not available on all versions). The two-way, set-associative cache holds up to 128 instructions in four-word blocks. The first four instructions of a branch target are held in the cache, waiting for the branch to repeat. When the branch repeats, the cache furnishes the instructions, thus keeping the branch-penalty time to one cycle. The branch-target cache is effective for loops, which pay a penalty for only the first branch pass.

The 29000 memory interface provides single-cycle access for DRAM, page-mode DRAM, or ROM. Some members of the 29000 µP family have a Harvard architecture, and others implement the Von Neumann method. Unlike many RISC processors, the 29000's external memory can pump instructions at the CPU clock rate to deliver cache-like performance. A DRAM controller, the Am29C668 runs both standard and page-mode DRAM for the 29000 but requires some external control logic and DRAM buffers. The 29200 incorporates a DRAM and ROM controller with flash and SRAM support.

The 29000 µPs can operate in either user or supervisor modes. In user mode, an illegal action by an executing program causes a protection-violation trap to occur. Special compare instructions put the results into a general-purpose register instead of using condition codes in a status register, which makes processing more general because the register can hold multiple conditions. Asserts compare conditions and, if not true, cause a software trap. Asserts include tests for equality/nonequality.

The 29000 supports multiprocessing through a LOADSET instruction that implements a binary semaphore: loads reg, locks memory (LOCK* asserted), writes back all 1s. Load-and-lock and store-and-lock instructions read or write memory location with LOCK* asserted during the memory access. A CLZ instruction counts leading 0s in word and can speed bit-map processing by finding the first nonzero bit. CPBYTE compares words by byte and sets the Boolean result into register.

VARIATIONS/SPECIAL FEATURES

Advanced Micro Devices developed the 29000 family and is the sole source.
29005/000/050: Three-bus µP; 16 to 40 MHz; 0/0.5/1-kbyte instruction cache1; timer facility2; MMU(except 29005); on-chip FPU(only 29050). $44/$52/$175.
29030/035/040: Two-bus µP; 16 to 50 MHz with scalable clocking; 4/8-kbyte instruction cache; 4-kbyte data cache and 32x32 integer multiplier; timer facility2; MMU3; JTAG debug support. $64/$45/$104.
29205/200/245/240/243: 12.5 to 33 MHz; 3.3V operation (only 240/243); 4-kbyte instruction cache and MMU(except 200/205/245); 2-kbyte data cache (240/243 only); timer; ROM/DRAM controllers; one to four DMA channels; 16 I/Os (205 has 8); serial port or UART; two to seven external interrupts; parallel port; JTAG debug support (except 205); two- to six-port PIA5. $18/$39/$63/$79/$91.

Notes:

  1. Instruction cache is actually a branch target cache (BTC).
  2. Timer facility provides counter for real-time clock or other software timing functions.
  3. MMU translates 32-bit virtual byte address and translates it to a 32-bit physical byte address.
  4. FPU also handles integer multiply and divide.
  5. Peripheral interface adapter allows for additional system features implemented via external peripheral chips.

Support
HARDWARE ICEs, logic analyzers, and development boards are available from third-party vendors. Plug-in evaluation boards are available from AMD. PC-form-factor 29000 motherboards are also available for development. SOFTWARE C, C++, Ada, Pascal, and Fortran compilers as well as source-code debuggers are available from major software vendors. Cross-development tools run on PCs as well as most Unix-based workstations, including Sun, HP, and DEC. Real-time OS kernels are also available.


ARM 6/7

The ARM (Advanced RISC Machine) microprocessor is directed toward low-power, portable, and embedded applications. Recent ARM processors are built around the ARM7-a small, static RISC core that can be fitted with 8 kbytes of unified cache and an on-chip MMU.

The ARM implements RISC techniques and a three-stage pipeline (fetch, decode, and execute) to achieve single-cycle instruction execution. The ARM also has a load/store architecture. The processor has block-data-transfer instructions to load and store data from any subset of the 16 general-purpose registers.

The ARM has 11 basic types of fixed-length instructions, which execute conditionally (not just branch) and reduce the need for short pipeline-flushing branches. A not-taken instruction executes in one cycle. Taken branches incur a three-cycle delay. There are 16 condition codes, including equal, not equal, always, negative, and overflow. The ARM has no explicit shift instructions; instead, all ALU operations can perform an optional shift operation in the same execution cycle.

The ARM has no integer-divide instruction; fast divide and divide by 10 are provided by support software. It does, however, have multiply and multiply-and-accumulate (MAC) instructions. The MAC instruction speeds math-intensive applications. A Booth hardware multiplier operates on two operand bits at a time to build a final product. A 32-bit multiply takes 16 cycles; smaller multipliers reduce the number of cycles. The ARM7M variant contains a faster multiplier, which executes in four cycles and offers 64-bit multiplication. Division and multiplication by a constant can often be performed quickly using the barrel shifter (eg, division by four takes one cycle, as does multiplication by five).

The ARM700 and 710 include an 8-kbyte, four-way set-associative, unified cache. An MMU handles translations between virtual and physical addresses as well as controlling access permissions. It supports page sizes of 4 and 64 kbytes, with access control to page granularity. The ARM µPs support user and supervisor modes for controlling access; they handle four exception-processing modes: interrupt request, fast interrupt request, abort, and undefined. Modes use different register windows to overlay some of the 16 general-purpose registers.

The ARM has 31 general-purpose registers, with only 16 visible at a time. The fast-interrupt mode has seven private registers to minimize state-saving overhead. All registers are general-purpose, including the PC, although a set of conventions called the ARM Procedure Call Standard governs their use for C compatibility.

All ARM processors have nonmultiplexed buses. The ARM6x0/7x0 bus clock can be synchronous or asynchronous with respect to the internal cache clock. The ARM6x0/7x0 also provides a write buffer, which lets execution continue while writes are pending. The buffer holds eight words at four independent addresses. ARM processors also incorporate a coprocessor interface (except the 610/710) to allow another CPU to load/store data or request a CPU operation.

VARIATIONS/SPECIAL FEATURES
ARM, a joint venture of Acorn, Apple, and VLSI Technology, owns the design and licenses it for production. Licensees include Cirrus Logic, GEC Plessey Semiconductors, Samsung (for consumer products), Sharp, TI, and VLSI. Cirrus Logic embeds the ARM core in ICs for computing, communications, and consumer electronics. TI uses the ARM core in customizable DSP devices and incorporates it into its configurable µC library. Sharp and VLSI Technology sell ARM chips and CPU macrocells for ASICs.

ARM6: Static core; 20 to 45 MHz; also available as ASIC core.

ARM60: ARM6 core, up to 40 MHz; JTAG; $12.50.

ARM600/610: ARM6 core; up to 33 MHz (VY86C610C-45); 4-kbyte cache; four-word write buffer; MMU with 32-entry TLB and coprocessor bus (600 only); JTAG port. $17 to $260.

ARM 7: Static; 33 MHz; 3.3V operation; also available as ASIC core.

ARM 700/710: ARM7 processor core; up to 50 MHz (VY86C710); 8-kbyte unified cache; eight-word write buffer; MMU with 64-entry TLB; coprocessor interface (700 only) static; JTAG boundary scan. $20.

ARM650 (VLSI): General-purpose ARM6-based µC; 50 MHz; DMA controller and buffer; 32-bit timer; video interface; interrupt controller; coprocessor; serial I/O port; incorporates typical functional system blocks useful for ASIC development and evaluation; samples only.

Ruby (VLSI): Communication controller for wireless and networks; ARM6 core; 30 MHz; variable-width memory (8, 16, 32) and wait-state capability with 512x32 zero wait-state internal SRAM; 16-bit counter with three 8-bit timers; contains PCMCIA controller compatible with Intel 365; UART; SCC; PIO and serial controller; SPI. $29.

Support
HARDWARE Platform-independent development and evaluation cards for ARM chips link to a development host via an RS-232C port. The card includes a boot module that links to a host symbolic debugger. Running on a Sun SPARCstation 10, the instruction-set simulator offers approximately 0.5 to 1 MIPS. You can single-step ARM7-based µPs by manipulating the CPU clock. SOFTWARE Cross-development tools are available for PCs and Sun workstations (support for Macintosh available by arrangement). Tools include a C compiler, an assembler/linker, a library with stand-alone runtime kernel, a source-level debugger, and an instruction-set simulator. The C compiler conforms to ANSI C and Unix PC conventions. C++ compiler planned for early 1995. VLSI offers "Jump-start" graphical software-development tools.


Fujitsu 86930

The SPARC RISC architecture targets workstation and server computing. Fujitsu reworked the SPARC V8 design architecture to target embedded systems. The SPARC embedded specification, called V8E, is officially part of SPARC International. Fujitsu's MB8693X family (alias SPARClite), which is based on V8E, adds on-chip data and program caches, a cleaner memory interface, timers, UARTs, an interrupt controller, and a debug unit. Unlike the original SPARC µPs, which required lots of support logic, 86930 µPs minimize design-in time and require little glue logic. The chips have an on-chip, programmable DRAM interface that handles page-mode DRAM for fast, burst-mode accesses. Currently, the 8693Xs are the only single-chip SPARC µPs with on-chip cache dedicated to embedded-system applications.

Following the SPARC V8 definition, 8693X processors feature a 32-bit ALU and use a load/store architecture with a register stack of 136 32-bit registers (86933 and 86933H have 104). Eight registers are set aside to hold global values. The remaining registers arrange into six or eight overlapping register windows, one window for each subroutine. This setup speeds procedure calls and interrupt processing. Multiple contexts can be present concurrently by limiting the number of registers for a task.

The 8693X implements the SPARC V8 specification, which includes a full multiply instruction (unlike earlier SPARCs) and division via software using divide by four. The 8693X also incorporates a 32-bit HW multiplier. User and supervisor operating modes and a JTAG test-vector interface let you test the chips in and out of a system.

Fujitsu engineers extended the SPARC pipeline to five stages for the 8693X: fetch, decode, execute, memory, and write-back. Designers added the memory stage to minimize the effects of load/store operations. The memory stage reduces a load/store to one-cycle execution. The stage is idle for nonload/store operations.

Most 8693X µPs have on-chip caches to hold critical data and processing routines for faster operation. There are separate data and instruction caches. Cache sizes run from 1 to 8 kbytes of instruction cache and up to 2 kbytes of data cache. The caches are two-way set- associative and have 16-byte (four-word) or 32-byte (eight-word) cache lines. The lines can be locked on chip and not swapped out for critical code or data. Fujitsu engineers also added debug registers to hold data values or addresses for individual and range breakpoints.

The 8693Xs run with DRAM, SDRAM, SRAM, and ROM/EPROM. The memory interface handles page-mode DRAM for low-cost, high-speed access using a 32-byte burst mode. The memory interface includes a refresh generator for DRAMs, programmable wait states for slower memory, and programmable chip selects for memory banking. One chip, the MB86932, includes an MMU for memory protection and paging and a 16-entry TLB. Boot-up memory interfaces are programmable; most 8693X CPUs can boot up from 8-, 16-, or 32-bit ROM/EPROM.

Special instructions include scan word looking for first changed bit, or first 1 or 0; load/store double word; save/restore caller (uses register windows); tagged add/sub (generates overflow if MSB 0 and 1 are not 0); and generate trap from conditions.

VARIATIONS/SPECIAL FEATURES
Fujitsu is the developer and sole supplier of 8693X µPs.

MB86930: 20/30/40 MHz, 2-kbyte instruction cache, 2-kbyte data cache, 16-bit simple timer[super{1}]. $31/$36/$46.

MB86931: 20/40 MHz, 2-kbyte instruction cache, 2-kbyte data cache, four 16-bit timers1, two USARTs[super{2}], 15-channel interrupt controller. $42/$63.

MB86932: 20/40 MHz, 8-kbyte instruction cache, 2-kbyte data cache, 16-bit timer[super{1}]; 16-entry TLB[super{3}], two-channel DMA[super{4}]. $67/$95.

MB86933: Stripped-down version with 28-bit address bus; 20 MHz, no data cache, 1-kbyte instruction cache; 16-bit timer[super{1}]; four-channel interrupt controller. $19.

MB86934: 30/60 MHz, 3.3V operation; 8-kbyte instruction cache, 2-kbyte data cache, 16-bit timer[super{1}]; two-channel DMA[super{4}], FPU; vector/DSP support FIFOs; SDRAM interface. $90/$100.

MB86940: 40-MHz companion chip providing additional I/O and peripheral support; 4/16-bit timers[super{1}]; two UARTs[super{2}]; 15-channel interrupt controller. $17.

Notes:

  1. Counter/timers with prescalar and compare/capture register. Each can generate periodic interrupt and square waves; each has two watchdog modes.
  2. USART: Asynchronous/synchronous operation; 5- to 8-bit character length selection; parity-bit option; internal or external synchronous-mode options; one or two synchronous character options.
  3. TLB supports 4k, 256k, and 16M pages with subpage protection to 1k.
  4. Both channels can operate concurrently; Fly-By or Flow-Thru modes; block transfer and buffer chaining modes. Transfers occur while processor is executing out of cache.

Support
HARDWARE Most chips have a debug-support unit, which has six debug registers (two code address, two data address, and two data values). A 10-pin emulation bus for in-circuit testing available on some family members. Fujitsu provides evaluation boards. SOFTWARE Third-party tools are available, including compilers, debuggers, emulators, libraries, and real-time OSs. The 86930s are also compatible with SPARC software. Most chips won't run Unix because they lack an MMU.


Hitachi SH Series

The SH7000, or SH-1, is the first family of a series of 32-bit RISC controllers/processors from Hitachi. The SH7000 is aimed at embedded control applications ranging from process and motor control to fixed-function PDAs and entertainment systems. The chip has up to 64 kbytes of program ROM and a chip full of embedded µC peripherals.

The second generation of this processor family, the SH7600, or SH-2, targets more traditional RISC applications, such as laser printers, X-terminals, and communications servers. The SH7600 builds on the SH7000, having a more standard RISC implementation with an on-chip unified four-way set-associative program/code cache and a 32-bit-wide memory bus.

Both CPUs have internal 32-bit data paths and registers. A 16-bit instruction word helps compact code, allowing ROM to hold more code. The CPUs have sets of 16 32-bit general registers.

The 16-bit instruction word reduces program memory requirements and leads to fairly compact code, enabling larger on-chip programs. However, there are some tradeoffs when employing a 16-bit instruction word. The smaller word size restricts the range of registers an instruction can address as well as the number of registers available for each address. Instead of normal three-address, 32-register RISC operations, the SH7000 is restricted to 16 register ranges (four bits/register field) and to specifying up to two registers per instruction. These restrictions can lead to larger programs because the hardware may have to do more work than a standard RISC architecture.

The SH7000 employs a five-stage pipeline. The processor can run from on-chip program memory and external memory. The 16-bit-wide external memory bus can supply the CPU with instructions for each cycle with SRAM or fast DRAM. The chip has up to 8 kbytes of RAM. If the processor is running from external memory, access to external memory takes up to three additional cycles per data access.

Instead of on-chip program memory, the SH7600 has an on-chip cache, a 32-bit-wide memory bus for high CPU memory bandwidth, and a full 32-bit divide unit (replacing the first chip's bit-step-divide function). The external memory bus supports multiprocessing; it has bus arbitration for multiple masters and does bus snooping to maintain system cache coherency.

The SH-2's clock runs to 28.5 MHz. The chip has a PLL and needs only a 7-MHz input clock. Additionally, the software can adjust clock rate during program operation to reduce power consumption. To reduce power, the unified cache's special low-power design dissipates only 100 mW in operation.

A 16x16-bit MAC instruction (42-bit accumulator) in the SH-1 and a 32x32-bit MAC instruction (64-bit accumulator) in the SH-2 provide a fast DSP function. Although classified as a load/store architecture, some instructions reference memory.

VARIATIONS/SPECIAL FEATURES
Hitachi is the developer and sole supplier.

SH7032/34 (SH-1): 3.3 to 5V operation; 12.5 to 20 MHz; 64-kbyte ROM/ROMless, OTP, or EPROM (34 only); 4- to 8-kbyte RAM, watchdog timer; five 16-bit timers[super{1}]; SRAM/DRAM controller[super{2}], two-channel SCI, four-channel DMA and I/O controller[super{3}], ADC[super{4}], 40 I/Os; nine external interrupts and 31 internal interrupts. $33/$41.

SH7020/21 (SH-1): 3.3 to 5V operation; 12.5 to 20 MHz; 16/32-kbyte ROM or OTP; 1-kbyte RAM, watchdog timer; five 16-bit timers[super{1}]; SRAM/DRAM controller[super{2}], two-channel SCI, four-channel DMA and I/O controller[super{3}], 32 I/Os; nine external interrupts and 30 internal interrupts. $19.95.

SH7604 (SH-2): 3.3 to 5V operation; 16.6 to 28.5 MHz; 4-kbyte cache, watchdog timer; 16-bit, free-running timer; SRAM/DRAM/ SDRAM controller[super{2}], one-channel SCI, two-channel DMA and I/O controller[super{3}]; five external interrupts and 11 internal interrupts. $50.75 (25,000).

Notes:

  1. Complex Timer Unit each with capture/compare registers; PWM generator.
  2. DRAM controller supports direct interface to DRAM and SRAM. Programmable wait states, DRAM refresh, chip selects, selectable bus size (8- or 16-bit), eight memory segments, single-cycle operation with 70-nsec DRAM at 16 MHz.
  3. Takes DMA request from external pins, serial ports, timers (transfer rate to 40 MHz); bypasses the CPU to move data.
  4. Eight input channels, 10-bit conversion (6.7-µsec/channel at 20 MHz), single or scan conversions, internal or external triggers, variable reference voltage.

Support
HARDWARE SH-series chips have built-in ICE functions, allowing external hardware to set a breakpoint or breakpoint range (with masked address bits). Cycle type, CPU or DMA, R/W, instruction/data, or operand sizes can be specified for breaks. The E7000 development platform for the SH-Series is available from Hitachi. SOFTWARE Software packages and support for the C development environment available from Hitachi, Cygnus (GNU), and Green Hills. Platform support for PC, SPARC, and HP.


Intel 386

Introduced in 1985, the 386 once dominated the PC desktop; it still has a future, but as a second-order processor for lower end PCs. Special versions of the 386, such as the 16-bit-bus 386SX and 386SXL, have emerged for lower cost and low-power systems. But the 386 is really building steam as a processor for embedded systems. The 386’s low cost, resulting primarily from competition between Intel and AMD, allows engineers to design 386s—or even PCs—into nondesktop applications. Intel introduced the 386EX, a 386CX core with peripherals designed for embedded applications. AMD’s AM386SC has 386s tailored for the growing handheld device market.

Register-based, the 80386 doesn’t offer the classic set of general-purpose registers. Instead, the architecture has four general-purpose and four index/pointer registers, supplemented by six 16-bit segment registers and two 32-bit status and control registers.

Intel 8086 designers used 64-kbyte segments to extend addressing to 1 Mbyte. The 80386 also uses segmentation; however, because the registers are now 32 bits, the segment limits are extended to the full 4-Gbyte addressing range, and the segment register references a segment descriptor with a 32-bit base address. These descriptors also carry addressing-range and protection limits to prevent data accesses into code, data being executed as code, and access to inner privilege levels by outer levels.

Software developers have used the 386’s protected-mode privilege-level features. For example, OS/2 relies on privilege levels to create a more fail-safe runtime system. Hardware descriptor registers hold segment-access rights along with segment-base address and size limits. In protected-mode addressing, a 16-bit selector points to a segment descriptor, furnishes a base address. The base address adds to the 32-bit effective address, producing a 32-bit linear address, which is then used as a physical address or as a linear-page address.

Intel introduced a power-management mechanism, SMM (system-management mode), that enables code to control CPU power without having to rewrite or revamp existing operating software. SMM enters via a hardware interrupt, SMI (system-mode interrupt); the SMI interrupt code can set SMM operating modes to reduce chip power dissipation.

By dropping-in an embedded PC, a 386-based hardware design that functions like a PC, you can reduce hardware and software development costs for many nondesktop applications. Hardware design is minimized, and the hardware runs PC software. An embedded PC won’t work for real-time applications requiring high-speed interrupts but will work for applications requiring fast design turnaround with minimum design effort. An embedded PC also enables developers to use low-cost, highly interactive PC-development tools. Intel and AMD have embedded PC chips, the i386EX and Am386EM, respectively, that combine a low-power static core with a full set of PC-class peripherals, such as an interrupt controller, DMA, timers, and serial I/O. AMD’s Am386SC integrates its 386 static core with a PC peripheral set for handheld applications.

VendorµP Speed (MHz)Price (10,000) Comments
AMDAm386SXL25/40 $19.95SX with power management
Am386SXLV25 $19.95SX with power management; cache controller with onchip tag RAMs; DRAM/SRAM control; 3.3V operation.
Am386DXLV25/33 $28.953.3V operation
Am386DX33/40 $29 to $38Standard 386
Am386SC33 $24386SXLV with system-management mode; DRAM controller; PC/AT peripherals; serial controller; PS/2 bidirectional parallel port; PCMCIA controller; LCD controller.
Am386SE/DEN/S $15/$20Stripped-down versions of SX/DX for embedded apps
Inteli386SX25/33/40 $22.50 to $27 MSRP16-bit bus
i386DX25/33 $25/$45 MSRPStandard 386
i386CX25/33/40 $23.50 to $29Static 386SX with power management; 26 address lines; 3.3/5.5V operation.
i386EX25 $303/3.3/5V; two DMA channels; three 16-bit timer/counters; three serial I/O ports; two interrupt controllers; DRAM/SRAM refresh logic and 10 chip selects; watchdog; up to 24 I/O lines; power management.
Note: MSRP=manufacturer's suggested retail price; NS=not specified.

VARIATIONS/SPECIAL FEATURES
Intel developed the 386 architecture as part of the X86 family. Other vendors (through licensing, second-source, or clean-room designs) have developed 386-compatible µPs. AMD developed its own static version of the 386 and now dominates the 386 chip market for desktops. Cyrix took a different tack. It implemented the 486 ISA for 386 portable applications by using 16-bit 386SX and 32-bit 386SL pinouts in a static design. The chips have a small 1-kbyte cache to boost performance. Chips & Technologies has two 386 implementations, one with a 512-byte on-chip cache.

Support
HARDWARE The 386 has six debug registers: four code/data breakpoint registers and two control registers. The breakpoint registers can be set with addresses for halting execution on a program or data access. A set of hardware-development tools, including ICEs and logic analyzers, is available for the processors. SOFTWARE The 386 has a huge development and application software base. However, the bulk of this software, including most compilers, is written for 16-bit operations. Thirty-two-bit compilers are now available, including a C compiler from Intel. Embedded PCs are supported with ROMable DOS and Windows, as well as a number of ROMable, real-time OSs.


Intel i80486

Introduced in 1989, the Intel 486 processor now dominates desktop applications, taking the CPU socket for a majority of desktop products. Intel's superscalar Pentium aims to replace the 486, but the 486 will dominate through 1995.

The 486 builds on the 386 architecture; it implements the 386 ISA but adds a more efficient memory bus, an on-chip FPU, an on-chip unified cache, and a RISC-like implementation for the core load/store instructions. Even though it's not in the performance class of SPARC and Mips RISC chips, the 486 delivers respectable throughput, especially at higher clock rates. Intel developed clock-doubled implementations that take a system-bus clock and double or triple it for internal processor operations: the core runs at two to three times the system bus rate, doubling or tripling performance (at least when operating out of its on-chip cache). The 486's slower execution of complex instructions limits its performance.

The 80486 is a 32-bit RISC/CISC implementation, retaining the 386's complex instruction set but relying on a pipelined RISC-like implementation to speed execution for the simple load/store instructions. The 486 microarchitecture has a five-stage pipeline and uses two of those stages (two decoder stages, D1 and D2) to decode the complex instruction set.

Unlike most RISCs, the 386/486 has variable-length instructions, ranging from 1 to 15 bytes for complex operations. The two decoder stages give the hardware time to delineate and decode the instructions waiting in the instruction queue. The instruction or byte-code queue holds 32 bytes for decoding. By fetching four words at a time from off-chip or local memory, the hardware minimizes contention between data and instruction accesses of the cache. To speed processing, the hardware loads and writes cache lines in four-word bursts.

The 486 is the first of the x86 processors to have an on-chip cache. The DX4 has a 16-kbyte unified cache. The cache is four-way set associative and implements a write-through policy: writes to cache pass through to memory, which raises the memory bandwidth. The 486's bus and cache implement a bus-snooping protocol for multi-processor operation. The bus is more efficient than that of the 386 and has a two-clock single read or write. Four-word read bursts take five cycles and constitute the majority of 486 bus accesses. The bus also supports secondary caches for both single and multiprocessor operation.

VendorµPSpeed (MHz) Price (1000)Features
AMDAm 486DX33/40 $306Standard device
Am 486DX2-5050 $417Clock-doubled
Am 486DXLV33 $3063.3V
Am 486SX30/40 $18516-bit bus
Am 486SXLV33 $1853.3V SX with 16-bit bus
CyrixCx486DRX216/20/25 $299 to $399386 upgrade, clock-doubled end-user product
Cx486DX2/6666 $24933-MHz bus, 5/3.3V versions
Cx486DX2/5050 $19425-MHz bus, 5/3.3V versions
Cx486DX/5050 $289DX bus
Cx486DX/4040 $18420-MHz DX bus
Cx486DX/3333 $142 to $15833-MHz DX bus, 5/3.3V versions
Inteli486DX475/100 $429 to $516Clock-doubled/tripled, power management
i486DX240/50/66 $204 to $271Clock-doubled, power management, 3.3V version
i486DX33/50 $204 to $2813.3V version (33 MHz), power management
i486SX250 $168Clock-doubled, no FPU
i486SX25/33 $75 to $87No FPU, 3.3V version, power management
i486SL25/33 $190 to $29132-bit bus, 3.3V version, power management
SGS-ThomsonST486SLC/e 20/25/33/40$65 to $68386SX pinout, 1-kbyte cache 3.3/5V, external FPU
ST486SLC250 $67386SX pinout, 1-kbyte cache clock-doubled, external FPU
ST486DLC25/33/40 $68 to $70386DX pinout, 1-kbyte cache external FPU
ST486DX25/33/40/50 $135 to $173Power management, 3.3V
ST486DX240/50 $192 to $229Clock-doubled, power management
Texas InstrumentsTI486SLC 25/33$553/5V, 16-bit bus, 1-kbyte cache
TI486DLC33/40 $653/5V, 1-kbyte cache
TI486SXLC2/SXLC20/25/40/50 $80 to $1005/3.3V, power management 8-kbyte cache
TI486SXL220/25/40/50 $100 to $1255/3.3V versions, power management, 8-kbyte cache

VARIATIONS/SPECIAL FEATURES
Intel developed and sells 486 chips. Intel has competition for the 486 sockets. Vendors that produce 486-class devices include AMD, Cyrix, IBM, SGS-Thomson, and Texas Instru- ments. By the middle of next year, SGS-Thomson plans to introduce the 486 as a core for ASICs. TI is introducing a three-chip CPU system, the Rio Grande, for portable computers; the three chips include an integrated CPU, a PCMCIA card controller, and a PCI combo chip.

Support
HARDWARE The 486 has six debug registers: four code/data breakpoint registers and two control registers. The breakpoint registers can be set with addresses for halting execution on a program or data access. A full set of hardware-development tools, including ICEs and logic analyzers, is available. SOFTWARE The 486 has a huge development and application software base. However, the bulk of this software, including most compilers, is written for 16-bit operations. Thirty-two-bit compilers are now available, including a C compiler from Intel. Most programming languages have a compiler for the 486.


Intel i960

Intel's 32-bit i960 RISC µCs suit both military and commercial embedded applications. The family has penetrated the embedded-systems world in data-staging applications such as laser printers, graphics processing, and telecommunications. The range of i960s runs from bare-bones, minimal RISC chips with 16-bit external buses to superscalar CAs and CFs to military-specific processors, such as the MX--with dual external buses, MMUs, and built-in security and multiprocessing features. Superscalar i960s achieve two to three sustained instruction executions per clock cycle.

The i960 is a hard architecture to classify. It combines a Von Neumann architecture, RISC-implementation techniques with superscalar operation, CISC instructions, microcode, complex addressing modes, a register file, register-window caching, a sophisticated I/O controller, and a complex interrupt controller. The load/store architecture centers on a core of 32 32-bit general-purpose registers divided into 16 local and 16 global registers. An on-chip register cache automatically caches the local register sets to speed context switching. If the cache is full, the oldest cached set moves to memory and the latest set caches.

The 960SA/SB/KA/KB low-end chips have small register and instruction caches; they have a five-stage pipeline and use register scoreboarding to track resource usage and the instructions that can execute. The CPUs allow out-of-order execution, ie, a later instruction can execute while a prior one waits for a critical resource to become free. The i960CA moved up to superscalar operation. The key to the CA is its four-instruction-wide instruction decoder, which decodes up to four instructions per cycle. Current implementations dispatch up to three of these instructions for execution.

The i960CF has 256-bit-wide buses to move instructions to the decoder and 128-bit-wide buses to move data between the cache and registers. The chip has three pipeline stages: fetch, decode, and execute.

Superscalar i960s are built around a six-ported register file, with execution units gathered into two groups--register and memory control. These units include floating-point integer, floating-point MPY/DIV, integer MPY/DIV, integer, and interrupt control units (on the register side); and address-generation, MMU, and bus-controller units on the memory side. Instructions are cached in a lockable cache, later versions adding an instruction cache to supplement the register cache.

Military i960s include an MMU for protection and virtual addressing. Superscalar MX and MM versions opened up the CA's 32-bit multiplexed/nonmultiplexed bus. The controllers implement two buses: a slow 32-bit multiplexed system bus (1/2 system clock) for I/O and a 64-bit backside bus for local-memory access (23-bit address, up to 1 Mbyte).

The CA and later versions have a four-channel DMA controller that also functions as a classic I/O controller to optimize I/O. Controller handles data chaining (gather/scatter), data packing, and unpacking.

Military and first-generation i960s have hardware multi- tasking/multiprocessing capabilities. Interagent communications define messaging systems for multiple CPUs. MX CPU queues tasks for execution; communications ports asynchronously exchange parameters between CPUs. Atomic instructions synchronize CPUs.

VARIATIONS/SPECIAL FEATURES
Intel is the developer and sole supplier.

i960KA/KB/MC: 16 to 25 MHz, 512-byte instruction cache, 64x2-bit register cache, 32-bit multiplexed external bus, four-word burst, four external interrupts, and FPU (KB only). MC is a MIL-TEMP-range version. $18 to $25 (excludes MC version).

i960SA/SB: Stripped-down KA/KB. 10 to 16 MHz, 512-byte instruction cache, 64x32-bit register cache, multiplexed external bus (32-bit address, 16-bit data), four-word burst, four external interrupts. SB adds FPU. $9.75 to $16.

i960CA/CF: Superscalar i960. 16 to 40 MHz, up to 4-kbyte instruction cache, 1-kbyte data RAM, 1-kbyte data cache, register cache (up to 15 sets in data RAM), four DMA channels, DMA controller with data chaining, packing, and unpacking. A 32-bit nonmultiplexed external bus has dynamic sizing, interrupt controller (248 possible intr). 16 separately configurable physical-memory regions. $44 to $176.

i960MM/MX: Expanded CA for military only. Superscalar; has FPU and MMU. 25 MHz, 2-kbyte instruction cache, 2-kbyte data cache, MX has 512-byte register cache (eight local sets), two external buses (32-bit system bus, 64-bit backside bus), MIL-TEMP range. MX is also JIAWG-compatible, has multitasking capability; MM available on evaluation board.

i960JA/JF/JD: Latest version. Low-power versions with KA bus interface and clock-doubling capability, 16 to 50 MHz, 4-kbyte instruction cache, 2-kbyte data cache. From $25 to $58.

Support
HARDWARE The i960 has built-in debugging features. The chips have two to four hardware breakpoint registers and two instructions for setting software breakpoints. ICEs and logic analyzers are available for the processors. Intel provides evaluation boards for all i960s. SOFTWARE The i960 has a range of development and operating software, including real-time kernels and operating systems. Compilers include C, Ada, and Fortran. Intel has developed a C compiler that builds a profile of each program and task and then optimizes the code based on the code's operational characteristics.


Intel Pentium Processor

Pentium is Intel's superscalar challenge to workstation RISCs: It bridges the gap between low-end PCs and high-end workstations. Pentium delivers superscalar, second-generation RISC performance. Running at 100 MHz, Pentium benchmarks at 100 SPECint92 and 80.6 SPECfp92, lagging only RISC CPUs such as DEC Alpha and SGI Mips R4400. Pentium also supplements 386 and 486 processors, which is especially timely as both are under heavy assault by x86-chip-clone vendors.

Pentium comes with support chips for the PCI graphics/local bus and multiprocessor implementations. Intel's 82430 PCIset implements an ISA or EISA system bus, with a 132-Mbyte/sec PCI local bus. For basic implementations, Intel supplies a cache controller and special SRAM chips that optimize the interconnect to the CPU and bus controller. Other vendors supply a raft of support chips, including those for local-bus implementations. Pentium is code-compatible with earlier x86 generations, including 386 and 486 CPUs.

Pentium designers faced two problems: code compatibility to earlier x86 CPUs and achieving third-generation RISC performance. Designers handled both issues by implementing the complex x86 instruction set but stressing simple instruction execution over the more complex ones. In Pentium, the simple, RISC-like register-to-register instructions drive the implementation; the microcoded complex instructions are second priority. Intel engineers also raised floating-point performance-- multiplies and divides take 3 and 39 clocks, respectively.

Pentium achieves a two-instruction issue peak, it has two four-stage pipelines, U and V, for each instruction. These pipes are front-ended by a common instruction fetch/align stage. These pipelines are not symmetric; the U pipe takes precedence over the alternate pipe (V). If the second instruction does not cause interlocks (using results from the first instruction to write into the same register/data), then the second instruction is scheduled for the V pipe.

The U and V pipelines feed from a common instruction fetch/align stage that fetches multiple instructions from the cache. A full line (256 bits) fetches and passes to the instruction decoder. Each pipeline has two decoder stages to decode simple and complex instructions. The wide cache-to-decoder path, coupled with a two-stage decode, enables Pentium to decode the x86's variable-length instructions and still deliver competitive performance. Instead of the 486's unified 8-kbyte cache, Pentium has two separate 8-kbyte, two-way set-associative data and instruction caches.

For superscalar dual-instruction load/store operations, the Pentium data TLB and cache tags are dual ported for concurrent pipeline accesses. The data cache SRAM is eight-way interleaved, allowing concurrent accesses to different memory banks (the cache is actually triple-ported, with an extra port for snooping). Cache hit rates range from 90 to 97%, depending on the application code mix. The data cache handles both 4-kbyte and 4-Mbyte pages. It has two four-way set-associative TLBs: one with 64 entries for 4-kbyte pages, and one with eight entries for 4-Mbyte pages. The code cache is also two-way set-associative, with a four-way set-associative, 32-entry TLB that handles both 4-kbyte and 4-Mbyte pages.

For the first time x86 designers are using dynamic branch prediction to help speed execution. This feature allows the CPU to determine the branch to take as opposed to static branching, where the compiler predetermines potential branches. In Pentium, a 256-entry branch target buffer (BTB) holds branch target addresses for previously executed branches, unlike some implementations that hold the actual target instruction(s). The BTB supplies the next instruction address that the last execution of a branch instruction took. Each BTB entry integrates the target address with special history and operation bits. Intel claims that a correctly predicted branch will take a single pipeline cycle and won't cause a pipeline bubble. Simulations show performance increases to 25% using the BTB.

Pentium's FPU features an eight-stage pipeline, which shares the first five stages of the U and V pipeline. Data transfers to or from the FPU use a wide 64-bit data path to the data cache to keep the FPU pipeline fed. Pentium doubles the 486's memory bus width from 32 to 64 bits. Pentium's bus has a 32-bit address bus and a 64-bit data bus. Pentium adds a write buffer to each pipeline to avoid write contention.

Similar to the 486, Pentium uses burst reads to fill a cache line (4), in this case, 256 bits wide. It also has burst write-back writes. The memory interface is pipelined, allowing a second bus cycle to set up while the first bus cycle completes. Pentium reads or writes a 64-bit double word each cycle in burst mode. Similar to the 486, most simple instructions appear to execute in a single cycle (pipelined).

VARIATIONS/SPECIAL FEATURES
Intel is the developer and sole supplier of Pentium chips, which sell for $675 and $995 at 60 and 100 MHz, respectively (1000). NexGen's Nx586, although similar to the Pentium, contains a 16-kbyte I cache, a 16-kbyte D cache, and a built-in level-2 cache controller. The Nx586 lacks an FPU but provides a dedicated external bus to attach the optional Nx587 FPU. The 60- and 66-MHz versions cost $299 and $389, respectively; the accompanying FPU (587) sells for $99.

Support
HARDWARE Similar to the 486, the Pentium has on-chip debug features that enable external hardware to set breakpoints to monitor program execution. SOFTWARE The Pentium is code-compatible with earlier 386 and 486 processors. However, for optimized performance, the object code should be compiled with Pentium compilers.


Mips R3000

Developed initially at Stanford University, the 32-bit Mips RISC architecture was one of the first commercial RISC µPs. A classic RISC design, Mips µPs were the speed demons of first-generation RISC processors but burdened system designers with a complex off-chip dual-caching scheme. These µPs are known for their sophisticated design and advanced code optimization that their compilers achieve. Licensed chip vendors have used the Mips core and architecture as a base for embedded processors, many adding on-chip caches and reducing pinouts. R2000/3000 Mips CPUs generally use an off-chip FPU; some R3000 versions have an on-chip FPU.

Mips processors are built around a set of 32-bit, general-purpose registers in a central register file. To minimize control logic, the instruction set is reduced to 73 instructions, addressing options are limited, and the chip has a three-address, load/store architecture. Similarly, instruction sizes are fixed to one 32-bit word to minimize decoding and speed processing.

Like many early RISC µPs, Mips CPUs balanced throughput against complex instructions, including multiply and divide. The R2000 had limited multiply and divide capabilities. Later Mips CPUs added a full multiply but had limited divide capability. The CPUs generated all addresses and handled the memory interface control for up to three additional external coprocessors. Initially, Mips chips used an external FPU but later R3000 versions have brought the FPU on chip.

Mips engineers used a five-stage pipeline for the R3000. The pipeline lets up to five instructions execute concurrently--each at a different stage of its instruction cycle, thus giving the effect of single-cycle execution. The pipeline stages are instruction fetch (IF), read operands and decode instruction (RD), execute (ALU), access data memory (MEM), and write-back results (WB). A branch-delay slot minimizes branch effects. The compiler fills the instruction slot following the branch with an NOP or an instruction from the current thread that can be executed before the branch takes effect.

Even though the original Mips R3000 processor lacked an on-chip cache, all of the R3000 derivatives include on-chip instruction and data cache. The on-chip caches help exploit the high-speed pipeline while maintaining a cost-effective DRAM-based memory system. Instruction caches vary from 2 to 16 kbytes. Data caches vary from 512 bytes to 8 kbytes. The new R4400 extends this concept with 16-kbyte instruction and 16-kbyte data caches.

VARIATIONS/SPECIAL FEATURES
Mips Technologies Inc, now part of Silicon Graphics, develops Mips chips and licenses them for manufacturing. Licensed vendors include IDT, LSI Logic, NEC, NKK, Siemens, and Toshiba. Some vendors have architectural licenses, allowing them to modify chip designs.

LSI Logic sells R3000 as an embedded processor or core along with other library cores, targeting low-power or multiple core designs. Library includes basic CPU core and building blocks: configurable-instruction and data cache (directed mapped or two-way set associative); four-deep write buffer, DRAM controller (supports synchronous DRAM and interleaving); timers/counters; and wait-state generator. LSI also introduced a family for the ATM marketplace, called "Atomizer."

LR33300/33310: Embedded R3000; static design; 20 to 50 MHz, 4- to 8-kbyte instruction cache, 2- to 4-kbyte data cache, three counter/timers, DRAM controller (supports synchronous DRAM and interleaving), four-deep write buffer, configurable DRAM or SRAM, lockable instruction cache entries, nonmultiplexed bus. $30 to $75.

LR33020/33120: Embedded R3000 for X-terminals; reimplemented core; static design; 25 to 40 MHz; 4-kbyte instruction cache; 1-kbyte data cache; graphics coprocessor with bitblt processor and DMA channel. $55 to $90.

LR33050: Embedded R3000, on-chip FPU, 25 to 33 MHz; 4-kbyte instruction cache, 1-kbyte data cache, timer/counters, DRAM controller, four-word write buffer, burst mode, nonmultiplexed external bus. $129 to $135.

Integrated Device Technology's R3041: Embedded R3000, 16 to 33 MHz, programmable memory interface and PROM boot options (8, 16, or 32 bit), 2-kbyte instruction cache, 512-byte data cache; multiplexed bus. $12 to $22.

R3051/52: Embedded R3000, 20 to 40 MHz, 4/8-kbyte instruction cache, 2-kbyte data cache, four-deep R/W buffer to memory, multiplexed external bus. $28 to $52.

R3071/81: Embedded R3000, 20 to 50 MHz, configurable cache (16-kbyte instruction/4-kbyte data or 8-kbyte each), four-deep R/W buffer, multiplexed bus, floating point unit (81 only). $46 to $99.

Support
HARDWARE The Mips design is a straightforward minimal architecture with bare-bones test features. Hardware tools include ICEs or logic analyzers for most Mips chips. Most chip vendors sell evaluation boards for test and development. SOFTWARE Mips provides native- and cross-development tools, including a system simulator, cache-design and optimization tools, and host/target debugger software. Mips compilers effectively optimize code execution on the RISC CPUs. Unix and real-time kernels are available for Mips processors. Application software for embedded designs includes page-description languages, X-servers, and PROM monitors. LSI Logic provides a Coreware program.


Motorola 68000

Introduced in 1979, the 68000 line set new standards for 16/32-bit µPs and was the base for the original Apple Macintosh. Its straightforward register-based architecture, orthogonal instruction set, flexible addressing, and advanced memory interface molded a generation of hardware designers. The 68000 is actually a 16- and 32-bit mix: It has 32-bit registers for easy addressing, a 16-bit data path and ALU to conserve silicon, and 16-bit instructions. The nonmultiplexed external bus drives 24 address lines and addresses up to 16 Mbytes of memory. Motorola uses the core in gate arrays as well as the 68302 and 68306.

The 302 is a communications µC that includes a RISC CPU to process serial line data; the 306 services systems requiring serial communication and a DRAM interface. Low-power CMOS versions of the µP and application-specific chips are available. Toshiba and Philips/Signetics also offer µCs based on the 68000 CPU. The 68000 serves as a base for the 680x0 line of 32-bit µPs, which culminates with the 68060 (see the 68060 entry on pg 135).

Programmers get eight general-purpose, 32-bit data registers, which the CPU can address by bit, BCD, byte, word, or double word. In addition to user and supervisor stack pointers, there are seven address registers. Other registers include the 32-bit PC and 16-bit status registers. The SR maintains status for the user and supervisor modes via a user byte and a supervisor byte.

The processor's user and supervisor modes are implemented in hardware, which eases having a control kernel or OS manage multiple application tasks. The supervisor mode handles interrupt servicing and system functions; the user mode handles application processing. The chip restricts privileged instructions to supervisor mode. These instructions include RESET, STOP, and moves and operations on the status register. To support the user and supervisor modes, the hardware implements separate stacks and pushes and pops PC and SR onto the stack for exceptions. A link instruction lets you build link lists on private stacks.

The CPU has no memory controller, but the separate address and data buses eliminate the need for buffering addresses. However, the CPU needs logic to generate the required DTACK* signal, which marks the successful completion of a memory cycle. An address decoder is necessary for multiple memory chips, and drive buffers may be needed to buffer bus address and data lines (integrated versions of the 68000 contain this logic). If DTACK* is late, wait states are generated. The 68000 (except EC and integrated versions) has three control signals for interfacing synchronous M6800 peripheral devices with the asynchronous MC68000.

A classic CISC machine, the 68000 has two microcode levels: microcode and second-level expanded nanocode. Instruction execution triggers a chain of 10-bit microcode words. Each microcode word can reference another word for, say, a jump in microcode or a string of 70-bit nanocode words that directly drive the CPU logic. A special instruction lets you move up to 16 registers to or from an effective address, including blocks of data registers to or from address registers.

VARIATIONS/SPECIAL FEATURES
Motorola developed and sells the 68000. Second-source licensees include Hitachi, Philips/Signetics, SGS-Thomson, and Toshiba.

Motorola 68HC000: 8 to 20 MHz, $4.50 to $9. 68EC000: 68000 without M6800 support, 8 to 20 MHz, static versions available, $3 to $7.25.

Toshiba 68301: 12 to 16.7 MHz, three UARTs, 16-bit parallel interface, three 16-bit timers, three external interrupts, address decoder with chip selects, $11 (50,000).

68303: UART, five-channel timer with watchdog, three external interrupts, address decoder with external chip selects, three-channel DMA, DRAM controller, programmable wait states, two-channel, four-phase stepper motor controller, $16 (50,000).

68305: Two UARTs, 16-bit timer, four external interrupts, address decoder with external chip selects, two-channel DMA controller, $15 (50,000).

Note:

  1. Serial processor is an integrated RISC processor with three serial communications controllers (SCCs), two serial management controllers, SPI, six serial DMA channels for the SCCs, baud-rate generators, and 1.1-kbyte dual-ported RAM.

Support
HARDWARE The 68000 has tools ranging from evaluation boards to low- and midrange ICEs and logic analyzers. The best book describing 68000 hardware is Motorola's The 68000 Family, Volume One by Werner Hilf and Anton Nausch, Prentice-Hall, Englewood Cliffs, NJ, 1989. SOFTWARE Integrated cross-development environments, assemblers, compilers, debuggers, and simulators are all available. Operating software, ranging from small real-time kernels to sophisticated operating systems, runs on the 68000; many of these packages have networked development tools. Architecture supports more than 6000 32-bit software applications.



Motorola 680x0

Motorola's 68k µPs were early leaders in mid- to high-end µP applications. In the late 80s, the 68k lost workstations and servers to RISC processors, but the 32-bit 680x0 family is now firmly rooted in mid- to high-end embedded monitor/control systems.

Heading the 680x0 lineup is the superscalar 68060 with its dual integer and floating-point pipelines. The 68040, which combines RISC and CISC technologies, has dual 4-kbyte caches, an FPU, and an MMU. Still in the game are the 68020 and 68030 CISC implementations with smaller caches. Motorola has introduced stripped-down 68EC0x0 versions and the 68LC040, both of which have a simpler external bus and no FPU.

The 680x0 architecture, a 32-bit extension of the 16-bit 68000 µP, is built around 16 general registers with a 68000-compatible, orthogonal instruction set. Like the 68000, the family handles multiple register moves, bit and stack processing, and 18 addressing modes.

The processing features the 68020/30/40 add to the 68000 include floating-point processing, memory management, dynamic bus sizing, and on-chip caches. The 68020 and the 68030 rely on FPU coprocessors (68881, 68882); the 68040 integrates an FPU on chip.

The 680x0 has more registers than the original 68000--control registers were added to control the MMU and the FPU as well as support additional processing capabilities. For example, the 68040 adds eight 80-bit floating-point registers and 12 control registers, which include a vector-base register (points to interrupt vector table), cache-control register, user and supervisor root pointers, and translation registers.

The 680x0 is basically a CISC architecture having many complex instructions, and the ALU operates on both register and memory data. The 68040 also has RISC-like implementation features: It has a six-stage pipeline (fetch, decode, effective-address calculate, effective-address fetch, execute, and writeback). To speed processing, it has two on-chip 4-kbyte direct-mapped caches and separate data and instruction MMUs, which allow simultaneous address translations. Bus snooping is built into the 040's caches to ensure cache coherency for multiprocessing. Both write-through and copy-back modes are built into the cache. The 030 and 040 implement burst mode, moving up to 16 bytes in a single addressing block between registers and memory.

The 040 delivers apparent single-cycle execution for some instructions, mainly register operations such as memory-to-register moves (if the data is in the data cache). A taken branch takes two cycles; a not-taken branch takes three. Unlike the 020/030, the 040 doesn't do dynamic bus sizing. Instead, it has a highly reliable bus with a high-drive option and implements a synchronous, two-clock R/W protocol. A four-word burst takes five clocks. Multiprocessor bus arbitration is built into the 040 but requires off-chip logic. The CPUs have special instructions for variable-length bit fields, moving 16 registers, compare, and swap, which locks memory for multiprocessing. A scaling option addresses data by item size for table access, FPU, and MMU commands.

VARIATIONS/SPECIAL FEATURES
Motorola developed the 680x0 family and is the sole source.

68020/68EC020: 16 to 25 MHz; 256-byte instruction cache; nonmultiplexed, 32/24-bit data and address bus; dynamic bus sizing; three-clock synchronous bus cycle; coprocessor interface for FPU (68881, 68882). $21 to $33/$9 to $11.

68030/68EC030: 16 to 50/25 to 40 MHz; 256-byte instruction cache; 256-byte data cache; paged MMU (not EC); nonmultiplexed, 32-bit data and address bus; dynamic bus sizing; bus controller supports three-clock asynchronous, two-clock synchronous, and burst mode (to 16 bytes); coprocessor interface for FPU (68881, 68882). $35 to $45/$25 to $35.

68040/68EC040/68LC040/040V: 20 to 40 MHz; 040V is 3.3V and static; Harvard architecture; 4-kbyte instruction cache; 4-kbyte data cache; two-clock synchronous bus cycle; independent MMUs for instruction and data (not EC); nonmultiplexed, 32-bit bus; FPU (only 68040); multimaster/multiprocessor support with bus snooping for cache coherency; selectable output impedance. $154 to $184/$50 to $83/$80 to $128/$62 to $92, respectively.

Support
HARDWARE The 68000 family, including the 680x0 µPs, has built up a large tool base, including ICEs and logic analyzers. Evaluation boards are available from Motorola and third-party vendors. 680x0 workstation/servers are also available for cross development. 680x0 boards are a standard VMEbus platform and are used as both target and host-development systems. SOFTWARE The 680x0 family has a plethora of operating and development software. Unix and other operating systems, as well as real-time kernels, run on 680x0s. PC, workstation, and native development systems, compilers, and simulators are available. Compilers include C, C++, Fortran, Ada, Forth, Pascal, Modula-2, Basic, Cobol, Lisp, PL/I, Prolog, and RPG. Native- and cross-development assemblers and linkers are also available.


Motorola 68060

The newest member in Motorola's continuum of 680x0 devices, the 68060 diverges from the rest of the family by adopting a superscalar architecture that allows it to execute up to two instructions per clock. Although originally targeted for Apple Macintosh and VME board upgrades, Motorola is now focusing the 68060 on embedded applications such as networking, telecommunications, and control. The 68060 although maintaining software compatibility with the well recognized 68000 ISA, delivers up to three times the performance of its predecessor, the 68040.

As instructions enter the CPU, they flow into a four-stage prefetch pipeline: instruction address generation, instruction cycle, instruction early decode, and instruction buffer. It's in this pipeline that the 68000-compatible, variable-length CISC instructions convert to a 32-bit fixed-length instruction. Once converted, these instructions enter dual, four-stage integer execution pipelines that operate synchronously. The four stages of the execution pipeline are decode, effective address calculation, fetch, and integer execution.

To increase the likelihood of a cache hit, the 68060 has doubled the cache size of the 68040--there are separate 8-kbyte instruction and 8-kbyte data caches. The Harvard architecture allows simultaneous instruction fetches and data accesses. The on-chip caches are four-way set associative with four-way interleaving to support simultaneous read and write operations. Portions of the caches can be frozen to prevent reallocation.

A 256-entry, four-way set-associative, on-chip branch cache allows taken and nontaken branches to execute in zero and one clock, respectively. The branch cache unit contains state bits that provide a history of branch executions, which helps to predict branch direction.

The 68060 also contains a IEEE 754-compliant floating point unit (FPU) that is 100% compatible with the 68040 FPU programming model. The operand-execution pipeline dispatches instructions to the FP and allows for some execution overlap between the integer and FP engines.

Externally, the 68060 bus is a superset of the 68040's bus. Additional signals support higher-performance system designs, but the processor can easily operate on an existing 68040-based bus. An on-chip MMU with separate instruction and data TLBs allows the 68060 to access up to 4 Gbytes of memory.

To support power management, the 68060's functional units respond to dynamically controlled clocking: the caches and execution units power down when not accessed. The static design allows the external clock to be reduced or stopped. And an LPSTOP instruction (low power stop) disconnects most of the chip from the CLK pin.

The 68060 has a special MOVE16 instruction to perform a 16-byte block move and an PLPA instruction that loads a physical address by doing a logical address translation.

VARIATIONS/SPECIAL FEATURES
Motorola developed the 68060 and is the sole source. All versions operate at 3.3V.

68060: 50/66 MHz. $263.

68LC060: 50/66 MHz; no FPU. $169.

68EC060: 40/50/66 MHz; no FPU; no MMU. $150.

Support
HARDWARE HP and Huntsville Microsystems provide support tools for the 68060 that include an ICE and logic analyzer. As they have done with all their other 680x0 products, Motorola provides an evaluation board. SOFTWARE Compilers and assemblers are available from companies such as Diab Data, Intermetrics Microsystems Software, JMI Software Systems, and Microtec Research/Ready Systems. Microware Systems has ported OS-9 to the 68060.


Motorola 683xx

The 683xx family represents a popular option for embedded µPs: integrating existing CPUs with mixtures of peripherals--some extremely complex. The peripherals handle classic peripheral functions and many also offload the CPU by doing their own processing. For most of the 683xx family, Motorola combined a stripped-down 68020 core with a 16-bit on-chip InterModule bus that links the CPU with a device's complex peripherals. Five family members, the 68302, 306, 307, 322, and 356, use the 16-bit 68EC000 CPU as their core.

The core processor, the CPU32 or CPU32+, is the 68020 CPU stripped down for embedded control--no MMU or FPU interface--combined with a 16- or 32-bit data bus, respectively. The 32-bit processor has eight general-purpose 32-bit registers; seven 32-bit address registers; a 32-bit ALU; separate user and supervisor modes, each with its own SP and address space; and separate address and data spaces. The CPU32 is code-compatible with the 68020 but has enhanced addressing modes, including scaled index, address register indirect with base displacement and index, PC relative, and 32-bit branch displacements. Post- and preincrement/decrement options simplify iterative code. Peripheral-control registers and I/O are memory mapped; the CPU accesses them as addresses in memory.

All 683xxs have a system-integration module featuring system configuration, oscillator and clock dividers, reset and power-down mode control, chip selects and wait states, 8051-compatible external bus interfaces, parallel I/O with interrupt capability, interrupt configuration/response, software watchdog, and a JTAG port (only some). The external bus interface has up to 32 address and 16 data lines (32 for CPU32+) and up to 12 programmable chip-select lines.

The 683xx isn't easily classified as a µP or µC. The 68332 µP with 2 kbytes of on-chip RAM needs external memory, but the 68F333 has 48 kbytes of flash memory and 4 kbytes of RAM, making it a µC. The 68340 has no on-chip memory. Most 683xx chips run from external memory, which classifies them as µPs rather than µCs. The memory interfaces are 8, 16, or 32 bits wide for data and from 24 to 32 bits wide for external addressing lines; furthermore, the chips have dynamically sizable data bus widths.

68020 instructions not supported include BCD pack/unpack, bit field, compare and swap, coprocessor, MMU, module call/return (memory indirect addressing also not supported). New instructions include a table look-up and interpolate and the ability to put the chip into a low-power standby mode.

VARIATIONS/SPECIAL FEATURES
Motorola developed the 683xx family and is the sole source. 68302: Protocol-control engine with serial processor module1, 16 to 25 MHz, 3.3V version available, three multiprotocol serial communication channels, low-power modes, wait-state generator, memory chip select, 28 I/O pins. $15 to 22. 68306: 16.7 MHz, watchdog timer, two UARTs, DRAM interface, eight programmable chip selects, programmable interrupt controller. $9.

68307: 0 to 16 MHz (static), 3.3/5V operation, 16-bit parallel I/O, I[super{2}]C serial bus, UART. $11.

68322: 16 MHz. $18.

68330: 8 to 25 MHz (static), 3.3/5V. $10.

68331: Queued serial module[super{1}], general-purpose timer (two 16-bit timers). $18.

68332: 2-kbyte SRAM, queued serial module1, TPU[super{2}]. $25.

68F333: 4-kbyte SRAM, 64-kbyte flash, queued serial module1, TPU[super{2}], ADC. $81.25.

68334: 1-kbyte SRAM, TPU ADC. $26.73. 68340: 16 to 25 MHz (static), watchdog timer[super{2}], 16-bit timer, two-channel DMA, two USARTs, 16 I/Os. $13.50.

68341: Integrated compact-disk interactive engine, 0 to 16 or 25 MHz, 3.3V, two-channel DMA, queued serial module1, 16-bit timer/counter, two USARTs, RTC, seven external interrupts, power management. $17.50.

68349: 16 to 25 MHz (static), 3.3/5V, configurable 1-kbyte I cache, 4-kbyte RAM, two-channel DMA, two USARTs. $26.

68356: 0 to 25 MHz, watchdog, communication processor[super{3}], PCMCIA support, 16550 emulation block, power management, DSP56K CPU, full-speed memory-expansion port, synchronous serial interface, SCI. $65.

68360: 0 to 25 MHz (static), 2.5-kbyte RAM, DRAM controller, four 16-bit timers, two-channel DMA, seven external interrupts, communications processor module[super{3}]. $45.

Notes:

  1. Queued serial module provides queued serial peripheral interface (a full-duplex synchronous line) and full-duplex UART. Queued peripheral interface queues and processes data. Module can repeat cycles, providing polling and loop processing. Has 64-byte SRAM for local storage of receive, transmit, and control data.
  2. Time processor unit (TPU) has its own microengine and ROM control store. Runs independently from CPU32. Handles 16 fully independent timer channels (each with its own I/O pin) and lets you program control and timing functions not using CPU resources--including PWM, stepper-motor control, input capture and output compare, pulse accumulation, frequency measurement, and other timer functions. The CPU passes data to the time processor unit via dual-ported SRAM.
  3. Integrated RISC processor with four serial communications controllers (SCCs), two UARTs, SPI, 14 serial DMA channels for the SCCs, four baud-rate generators, two TDM channels, parallel interface port, 2.5-kbyte dual-ported RAM.

Support
HARDWARE The CPU32 and CPU32+ based members of the 683xx family have a background mode that gives a host control via a built-in test port. In background mode, tools can control the processor and read and set register and memory values. Users can also patch in code and have the processor execute a code patch in another location for a given PC value. Essentially, the background mode builds some ICE-like features into the hardware. Some ICEs and logic emulators are available (Motorola has evaluation boards and an ICE for the chips). SOFTWARE Development software includes cross assem-blers/linkers, C and Modula-2 compilers, and a simulator and software branch analyzer. Operating software includes at least four real-time kernels, including one full operating system, and an integrated host/target development environment.


Motorola/Apple/IBM PowerPC

Serving as a base for a family of RISC chips, the PowerPC inherits its core architecture from the POWER (Performance Optimized With Enhanced RISC) architecture. The PowerPC is the core for Apple's next-generation Macintosh. Both Motorola and IBM produce PowerPC RISC chips, and both have µC versions such as IBM's 403GA and Motorola's MPC505.

The ISA supports multiple microarchitecture implementations (both 32 and 64 bits). PowerPC 601 implements the 32-bit portion of the PowerPC architecture. Other family members include the 603 for low-power applications; the 32-bit 604 for mainstream workstations; and the upper-end, 64-bit 620 (for high-end servers).

The superscalar PowerPC 601 can issue up to three instructions/clock cycle (four for the 604). Three major functional units make up the 601 CPU: the Instruction Unit for integer operations, the Floating Point Unit, and the Branch Unit (BU); they all execute concurrently. The 603 adds load/store and system units, and the 604 adds two additional integer units plus a load/store unit. The Instruction Unit fetches the instructions, queues eight instructions for decoding, and then issues them to the execution units. Supplementing the functional units and the IU are an MMU, a 32-kbyte unified I and D cache, and a memory interface.

To minimize TLB exceptions, the 601 has a 256-entry TLB. It also has four-entry shadow TLBs for fast access to the latest entries accessed. The 601 PowerPC has a large on-chip cache; however, a 32-kbyte unified cache holds both instructions and data (the 603 and 604 are Harvard architectures with divided caches). The 601 gets high performance from this cache by queuing the cache line and minimizing sequential cache accesses. Unlike earlier RISC architectures, a prefetch instruction programs and loads the cache with code or data needed downstream. This tactic avoids the high-performance penalties extracted by cache misses for RISC CPUs.

The 601's cache comprises two eight-word sectors. The cache supports a write-back cache policy but can be programmed for other policies; it has a bus-snooping port. The memory unit supports single-byte R/Ws, as well as four double-word (64-bit) writes. It can queue up to two reads and three writes. And the system supports the MESI cache-coherency protocol for multiprocessing.

The 601 has both branch prediction and branch folding. The BU searches the bottom half of the instruction queue for branches and uses static prediction to cause the target instruction thread to be accessed for execution. The 604 provides dynamic branch prediction in the fetch, decode, and dispatch stages. The BU does branch folding, eliminating branch instructions where it can. The BU handles unconditional branches directly. Similar to the POWER architecture, the 601 has a condition code register (a 32-bit register that holds eight sets of 4-bit condition codes) that enables code to hold off on branch decisions based on condition codes from one or more operations.
The 403GA and MPC505 have similar, but fewer, execution units and only issue one instruction at a time. The two devices save power by not clocking idle functional units. Both have a typical interrupt latency of three to eight clocks, with a maximum of 39 clocks.

VARIATIONS/SPECIAL FEATURES

The PowerPC was jointly developed by Apple, IBM, and Motorola. A 604 version, which will be available in late 1994, has an estimated 160 SPECint92 and 165 SPECfp92
at 100 MHz.
601: 50/66/80 MHz; 32-kbyte, eight-way set associative; 85 SPECint92 and 105 SPECfp92 at 80 MHz. $154/$186/$280 (20,000).
603: Static; 66/80 MHz; separate instruction and data caches--each 8 kbytes, two-way set associative; dynamic power management with internal clock multiplier from one to four; 75 SPECint92 and 85 SPECfp92 at 80 MHz. $160/$199 (20,000).
MPC505: Static; 3.3V operation; dc to 25 MHz; 4-kbyte instruction cache; 4-kbyte RAM; watchdog; three timers; dynamically scalable clock; interrupt controller; programmable bus sizing; 12 programmable chip selects; JTAG; FPU; BIU controls DRAM, SRAM, ROM, and memory-mapped I/O; full-duplex serial port. $75 (1000).
403GA: 3.3V operation; dc to 25 MHz; 2-kbyte instruction cache; 1-kbyte data cache; programmable bus sizing; JTAG; BIU controls DRAM, SRAM, ROM, and memory-mapped I/O; RS-232C serial communications; four-channel DMA; memory protection unit; 32-bit programmable interval timer; fixed interval timer; watchdog. $49 (1000).

Support
HARDWARE Evaluation boards are available for the 601--Motorola sells one, and Yarc Systems (Newbury Park, CA) has a coprocessor evaluation system that links to PCs (Windows or DOS) and Macintosh computers. IBM has a PowerPC embedded tools program that involves 20 third-party vendors supporting both hardware and software development. SOFTWARE A set of development and application software is available, much of it ported from IBM's RS/6000 POWER-architecture line. Development tools include Ada, C/C++, and Fortran compilers. IBM's AIX (Unix) OS runs on the 601. Windows NT and Sun Solaris OSs also run on the PowerPC.


Ross HyperSPARC

Ross Technology, a subsidiary of Fujitsu Ltd, took the SPARC architecture in a new direction with hyperSPARC, a 32-bit superscalar RISC SPARC µP. The hyperSPARC was designed as a tightly coupled chip set and implemented as an MBus module or 131-pin PGA multidie package. Ross sells hyperSPARC to OEMs for integration into a variety of boards and systems and as an upgrade product for end users of SPARCstation 10, SPARCstation 20, and SPARCserver 600MP systems (MBus modules only).

Each MBus module or MDP has an integrated CPU that has a six-stage integer and six-stage FPU pipeline and can issue up to two instructions per clock cycle. The hyperSPARC follows version eight of the SPARC ISA spec. Operations center on a multiported register file that holds eight global registers and eight overlapping register windows; the file has 136 32-bit registers.

The hyperSPARC pipeline drives four functional units that can execute in parallel: branch/call, integer, load/store, and an FPU. The FPU has its own scheduler and instruction queue, as well as 16 64-bit registers. Floating-point instructions queue and issue to the floating-point add or multiply functional units. These floating-point units are pipelined and can accept a new instruction each cycle.

To minimize branch delays, the hyperSPARC relies on a branch-taken prediction algorithm. All branches are assumed to be taken, and the hardware fetches the branch-target instruction pair next for execution. If the branch is not taken, a two-cycle penalty incurs. This algorithm works well for loops, where most branches are taken except for the last iteration.

The CPU contains an on-chip 8-kbyte, two-way set-associative instruction cache that has a four-word line size instruction cache. A 128- or 256-kbyte off-chip unified secondary cache is an integral part of the integer pipeline, generating a one-cycle penalty for missing the on-chip cache. The hyperSPARC CPU, cache controller, and cache chips interconnect via the Ross proprietary Intra-Module Bus (IMB), a 3.3V, 64-bit bus that runs at the CPU clock rate. The cache controller acts as an MMU with a 64-entry TLB by translating 32-bit virtual addresses to 36-bit physical addresses. Because the cache controller interfaces to the MBus, it acts as a bridge between the CPU and the system MBus, synchronizing operations between the CPU clock and the MBus clock. The cache controller also provides full symmetric multiprocessing support. Moreover, hyperSPARC's level of integration allows two complete CPU chip sets to fit on a single MBus-standard CPU module. This offers quad-processor performance for systems with two MBus sockets, such as the SPARCstation 10.

The MBus is the standard SPARC system bus: 40 or 50 MHz, 64-bit synchronous, multiplexed, and 36-bit addresses and 64-bit data. SBus, a 25-MHz synchronous I/O or mezzanine bus for small peripheral and function cards, supplements MBus. MBus modules offer users an easy way to upgrade systems and provide a tightly controlled environment for CPU and cache layout. MBus modules can hold multiple CPUs, and the MBus Level 2 spec defines using multiple masters and arbitration for shared-memory multiprocessing. The spec also defines the MOSEI cache-coherency protocol for multiprocessing.

VARIATIONS/SPECIAL FEATURES

Ross Technology is the sole developer and supplier of hyperSPARC. Currently, Ross offers hyperSPARC to OEMs for system integration and as upgrades for Sun SPARCserver 600 and SPARCstation 10 and 20 systems. The upgrades include an installation kit with software patches (required for Solaris 1.1 only) and a substitute boot PROM.
RT6221K/6224K: MBus module with single CPU and cache controller; 128/256-kbyte secondary cache. $1784/$1982 (1000).
RT6226K: MBus module with two CPUs; two cache controllers; 256-kbyte secondary cache. $3767 (1000).
RT629: Multidie package with single CPU; 256-kbyte second level cache. $1843 (1000).
Upgrade kits at 72 MHz/list price: RTx00S (single CPU), $4190; RTx00D (dual CPU), $7660; RTx00Q, $14,350.

Support
HARDWARE At least three vendors supply VMEbus boards carrying the hyperSPARC CPU. SOFTWARE hyperSPARC complies with the SPARC version eight spec, MBus, and Reference MMU. The CPU runs all SPARC application software. SunSoft's Solaris 1.1 (SunOS 4.1.3), 1.1.1B, 2.3, and 2.4 work with hyperSPARC's upgrade product. Also ported to hyperSPARC are some real-time operating kernals for embedded operation..


SGS-Thomson Transputer

Introduced in 1986, the Transputer was one of the first 32-bit µPs--and one of the last stack-oriented machines. A minimal, microcoded implementation, the Transputer was also the first µP specifically designed for multiprocessing. The Transputer initially had trouble getting a foothold in the market: It was ahead of its time in tackling 32-bit multiprocessing, and it was bundled with Occam, a high-level, parallel-processing language. Since then, the Transputer has moved to standard languages like C and built up a solid applications base, mainly in multiprocessing, that is unmatched by other µPs.

Unlike most µPs, the Transputer is a stack-oriented machine. Instead of using a bank of general-purpose registers to hold local variables and interim processing results, Transputer processing revolves around a three-register stack. A local on-chip SRAM workspace supplements the stack by holding other variables. A workspace register references the workspace.

The Transputer differs from mainstream processors in that it was designed for multiprocessing applications and tasks running concurrently in more than one CPU. Each Transputer has two to four two-wire, serial ports for point-to-point links to other Transputer CPUs or I/O devices. These ports implement a byte-oriented protocol; they ship one serial byte plus a 2-bit header and trailer for each transaction. The second line acknowledges all communications. Each link has a raw bit capacity of 20 Mbps.

The Transputer was designed in conjunction with Occam, a parallel-processing language based on intercommunicating (nonblocked) processes. The CPU hardware schedules and maintains these processes, which communicate via the CPU's channels. The intercommunication between two processes is the same whether both processes run on the same CPU or on different ones. Communication between processes is either built into the high-level languages themselves, like Occam, or provided in libraries for languages like C.

Transputers are microcoded machines designed for high code density. Unlike most 32-bit processors, the basic instruction size is only a byte and is divided into op code and operand. Additional instructions are added using the full eight bits for the instruction op code and extending the instruction to sequential bytes. Also, a prefix byte before the instruction can modify the instruction or build an immediate constant for it. One or more prefix bytes can also build immediate operand constants in the operand. Up to 4 kbytes of on-chip SRAM supplies local fast memory, but the program must manage the memory.

Later this year, the T9000--a 30-MHz superscalar version of the Transputer--will make its debut. The T9000 keeps the basic Transputer ISA model; T9000 has a five-stage pipeline, four functional units, a process-based memory manager, a 32-word workspace cache for local variables, and a standard 16-kbyte on-chip cache. The functional units are an ALU, two address generators (instruction and data), and a 64-bit FPU. T9000 also adds a programmable external-memory interface with a 64-bit data bus.

The T9000's upgraded serial links changed from byte-oriented protocol to packet-oriented protocol, which allows dynamic-switching and virtual routing. The change essentially provides infinite connectivity. Data packets consist of 32 bytes and a header; the header provides an address that is used to route the packet to the corresponding Transputer. Links are now four wires, two wires for each direction (data+strobe). Maximum data rate is 20 Mbytes/sec per link. Packets can hold up to 32 bytes of data.

The Transputer's built-in OS kernel schedules and times processes. There are two operation-priority levels, each with its own scheduling timer. Lower priority processes are scheduled as time slices; higher-priority tasks retain control until they relinquish it or time out. Tasks are stored in linked list hardware and maintained in memory.

Some CPUs have 2D pixel block moves for graphics applications. The graphics instructions include 2D block copy, 2D block copy nonzero/zero bytes, and zero block.

VARIATIONS/SPECIAL FEATURES
SGS-Thomson developed the Transputer and is its sole supplier.
IMST9000: 50 MHz; 32-bit superscalar Transputer; 16-kbyte cache; pipelined operations with crossbar bus; routed packet links achieve 20-Mbyte/sec data transfer; $450.
IMST805: 20/30 MHz; 32-bit CPU; 64-bit FPU; 4-kbyte SRAM; four links; graphics instruction; DRAM controller; $56.
IMST400: Cost-reduced version; 20 MHz; 32-bit CPU; 2-kbyte SRAM; two links; graphics instruction; DRAM controller; $20.
IMST425: 20/30 MHz; 32-bit CPU; 4-kbyte SRAM; four links; DRAM controller; $33.
IMST225: 20/30 MHz; 16-bit CPU; 4-kbyte SRAM; four links; nonmultiplexed bus: 16-bit address/data; $11.

Support
HARDWARE SGS-Thomson and third-party vendors supply a range of development and evaluation cards for the Transputer. An ICE isn't available, but the Transputer's communication links provide a path to drive and monitor the CPU from an external host. Some CPUs have provisions for debug control. Additionally, many CPUs can boot from a serial link, thus giving an external host control over the CPU for debugging. Support chips include the IMSC011 that provides a serial link to parallel bus or serial I/0; it links µP or peripheral to a Transputer serial link. SOFTWARE Several development tool kits are available from both SGS-Thomson and third-party vendors. Languages such as C, C++, and Ada have been grafted onto the the Transputer. Parallel-language development tools take advantage of the Transputer's process-oriented structure. These languages build code that can run in multiple CPUs.


Sun Microsystems MicroSPARC I/II

Descending from Berkely RISC I and II research, the 32-bit SPARC is the leading workstation/server µP. SPARC has a classic RISC architecture: a minimal instruction set, few addressing modes, hard-wired logic, three-address operations, and a load/store architecture. Sun Microsystems and Texas Instruments developed SPARC but opened the architecture to chip and system vendors, hoping to repeat the success of Intel 80x86s in PCs. SPARC International, an industry group that licenses the SPARC architecture, now controls SPARC. Sun's SPARC roadmap details new SPARC implementations until 1998, and ends with Ultra SPARC II, which will deliver 1000+ SPECint92. The latest SPARC, microSPARC II, moves up the performance curve from microSPARC I.

Following Berkeley's lead, the SPARC processor is built around a large, multiple-ported register file. The register file breaks down into a small set of global registers for holding global variables and sets of overlapping register windows. Each 24-register window has a core of eight registers supplemented by eight registers overlapping the previous and next register windows. The overlapping registers eliminate the need to save and restore registers on function calls, returns, or context switches between tasks.

Initial SPARC implementations have a four-stage pipeline: fetch, decode, execute, and write-back. Later chips, such as Sun's microSPARC, expand to five-stage pipeline by adding a memory stage to speed loads and stores. Instead of using Sun's earlier workstation technology SPARC chips now use an on-chip Sun Reference Model MMU.

SPARC International has just released version nine of the SPARC ISA. This version defines a 64-bit architecture with headroom for future version. Additionally, version nine has full integer mulitply and divide instructions as well as data prefetches to minimize cache misses, conditional register move, and branch prediction.

MicroSPARC defines a new direction for SPARC processors: high integration, low-cost, easy-to-design-in chips. Fitted with at least 4-kbyte-instruction and 2-kbyte-data cache, the µP suites low-to-mid-end desktop computing. A separate 64-bit memory interface handles up to 128 Mbytes of 16-Mbit DRAM. The processor also has an on-chip SBus interface and controller that can handle five SBus slots (SBus is a 20-MHz, 32-bit synchronous bus).

Sun also developed MicroSPARC II. Similar to MicroSPARC, MicroSPARC II is a redesign using a basic, five-stage single-issue pipeline. It runs with 100-MHz clock and delivers 70 SPECint92, 61SPECfp93 performance. According to Sun, the design clock rates will move to 125 MHz in the future. To increase performance, the CPU has 16-kbyte instruction and 8-kbyte data caches; loads and stores take only one pipelined cycle. It also has a four-entry write buffer to prevent write stalls. To minimuze support chips, a microSPARC has a DRAM controller, and SBus controller, and support for graphics (AFX) bus. The chip runs at 3.3v; it interfaces to 5.5v chips.

VARIATIONS/SPECIAL FEATURES

Sun Microsystems developed and sell microSPARC I. Sun developed (and is the sole supplier for) MircoSPARC II. SPARC International licenses SPARC vendors.
STP1010TAB (microSPARCI): 50 MHz, $139.
STP102PGA (microSPARC II): 70/85/100 MHz, $589/$649/$850 (for chip sets).
Weitek sells a clock-doubled SPARC chip, the SPARC Power µP, which aims at the SPARC-chip doubles performance for SPARCstation IIs and IPXs. $12,00 (with installation kit)

Support
HARDWARE ICEs and logic-analyzer pods are available for some SPARC chips. Check with chip vendors for details. SOFTWARE All chips are SPARC ISA compatible and run Solaris operating software from SunSoft as well as several real-time operating systems. You can select from more than 8000 applications (up from 5000 in 1993) for Sun SPARC-based systems, as well as development tools and operating software, including real-time systems and kernels. Windows NT is being ported to SPARC.


Sun SuperSPARC


Sun's SuperSPARC is the first superscalar SPARC implementation; it can issue up to three instructions per clock cycle, moving the selected instructions as a group through the logic. Designed by Sun Microsystems and manufactured by Texas Instruments, the CPU integrates 20-kbyte instruction cache, 16-kbyte data cache, an MMU, and on-chip floating-point execution units.

Operations center around the 136-entry, eight-ported register file. Registers are grouped into eight global registers and eight overlapping register windows. The register file handles six reads (three two-operand reads) and two writes (two third operands). Actually, the file can perform two reads and two writes concurrently but is time-shared to handle six reads and two writes in one system-clock cycle.

SuperSPARC doubles the system clock to run its pipeline stages. The eight stages are grouped into four execution stages--fetch, decode, execute, and write-back--of different lengths. The eight stages are cache access; send matched instructions to scheduler; issue instructions; read address registers/evaluate branch-target address; read operands from register file; first, second ALU stages; and write-back result.

The CPU runs eight functional units. These units include three integer ALUs, load/store, branch, floating-point multiply, floating-point add, and shift. The adder units are organized so that two can execute concurrently and return results to the register file or feed into the third ALU. That ALU can then operate on the results and return a value to the register file in one pipeline cycle. Thus, the SuperSPARC can do three adds in one cycle, where one add is dependent on the first two results. The multiply and add floating-point units are pipelined--they can accept a new instruction every clock cycle but have a three-cycle latency. The FPU has its own instruction queue and 16 64-bit registers. It does single- and double-precision IEEE-standard floating-point operations.

The SuperSPARC has large on-chip caches to boost performance and minimize cache misses. The instruction cache is five-way set associative; the data cache is four-way set-associative. The CPU addresses both caches physically. The cache instruction path is 128 bits (four words wide) to handle superscalar operation. Four instructions are presented simultaneously to the eight-deep prefetch queue. The data cache can operate in either write-back or write-through mode. Access time is 11 nsec, which lets the CPU use a cache hit in the next cycle. A single TLB supports both caches. It has 64 entries and does two TLB evaluations in one clock cycle.

SuperSPARC runs in stand-alone mode by interfacing to the MBus. The processor can run in cache-controller mode by interfacing to an external cache controller via the VBus, a nonmultiplexed, proprietary bus (CPU clock rate, 36-bit address, 64-bit data). The VBus links to a cache controller and up to 2 Mbytes of unified secondary-cache SRAM. The cache controller can handle multiprocessing (more than one SPARC CPU on an MBus).

Branch-delay slots and a branch-target queue minimize branch penalties. A branch-delay slot following the current set of instructions gives the hardware time to prefetch both the target set and the next sequential set of instructions.

VARIATIONS/SPECIAL FEATURES
Sun Microsystems developed and sells the SuperSPARC. Sun will announce a 90-MHz device later this year.
STP1020PGA: SuperSPARC--Superscalar SPARC µP; 50/60 MHz. $529/$849.
STP1090PGA: Cache controller to link module CPU via VBus to MBus; handles up to 1 Mbyte of cache; 50/60 MHz. $475/$590.
STP5010MBus: MBus module with SuperSPARC 50-MHz CPU; cache controller; no cache. $1069.
STP5011MBus: MBus module with SuperSPARC 60-MHz CPU; cache controller; 1-Mbyte external cache. $2269.

Support
HARDWARE Using the JTAG port, you can set breakpoints and single-step execution and monitor or change memory or register data. A breakpoint register matches on a 32- or 36-bit code or data address. Address bits can be masked for larger address ranges. Two 16-bit counters handle instruction and cycle counts. A software instruction lets SuperSPARC enter emulation mode. Special pins detail pipeline operation. Logic analyzers can use the strobe pin. SOFTWARE SuperSPARC is compatible with existing Sun operating software, SPARC development tools, and a number of real-time operating systems. TI supplies a SuperSPARC simulator that simulates instruction execution as well as the effects of cache, MMU, and store buffers. A Verilog HDL model of the chip and cache controller is available, as is the SuperSPARC Scantool for controlling board- and system-level test via the JTAG port. Sun provides a SPARC Builder catalog to developers.


| EDN Access | feedback | subscribe to EDN! |
| design features | design ideas | columnist |

Copyright © 1995 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.