|
||
September 25, 1997 32-BIT
The ARM cores implement a load/store architecture and have 31 general-purpose registers, with 16 simultaneously visible. The fast-interrupt mode has a minimum latency of four processor cycles and uses seven private registers to minimize state-saving overhead. All registers are general-purpose, including the program counter, although a set of conventions, the ARM Procedure Call Standard, governs the registers' use for C compatibility. The bus clock for most ARM µPs can be synchronous or asynchronous with respect to the internal cache clock. All ARM µPs contain a write buffer, which lets execution continue while writes are pending. The buffer holds 8 words at four independent addresses. ARM µPs also incorporate an optional coprocessor interface. The ARM µPs support user and supervisor modes for controlling access; they handle interrupt-request, fast-interrupt-request, abort, and undefined exception-processing modes. Modes use register windows to overlay some of the 16 general-purpose registers. The main architectural variations in the ARM processor family are ARM6/7, ARM8, and StrongARM1, which Digital Semiconductor primarily designed. The ARM6 and ARM7 µP cores have a three-stage, fetch, decode, and execute pipeline to achieve single-cycle instruction execution. Both cores use a Booth hardware multiplier that simultaneously operates on 2 operand bits to build a final product. A 32-bit multiply takes as many as 17 cycles, though smaller numbers terminate early. The ARM 7M variant contains a faster 8-bit Booth multiplier, which executes in five cycles or less for 32×32-bit multiply and offers 64-bit multiplication. The Thumb architectural extension is primarily a 16-bit subset of the 32-bit instruction set. On execution, the Thumb module, residing within the instruction pipeline, decompresses the 16-bit instructions back to 32-bit instructions without added delay. The Thumb module adds about 6% to the core's die size but helps increase code density and overcome the waste from using 32-bit fixed-length instructions. The first Thumb-aware processor macrocell, the ARM7TDMI, is an ARM7 processor with a Thumb decompressor, a 64-bit DSP multiplier and an embedded in-circuit-emulator macrocell. The new ARM8 and StrongARM cores are implementations of the ARM Version 4 architecture. The ARM8 doubles the performance of the ARM7, whereas StrongARM improves performance fourfold over the ARM7. The ARM8 and StrongARM are similar in that they both implement a five-stage, fetch, decode, ALU, cache, and write-back pipeline. However, both cores differ in their bus architectures; StrongARM uses a Harvard approach, and ARM8 uses a Von Neuman architecture to save die area. Both cores try to avoid excess pipeline flushes--Strong-ARM by using early branch execution and ARM8 by using static branch prediction, always taking the rear branch as in a loop. The first silicon implementation of StrongARM is the SA-110. The µP has separate instruction and data memory-management units. The translation-look-aside buffers (TLBs) have 32 entries that can each map a segment, large page, or small page and uses a round-robin replacement algorithm. The data TLB supports both the flush-all and the flush-single-entry function, and the instruction TLB supports only the flush-all function. Special instructions: ARM has 11 basic types of fixed-length instructions, which execute conditionally--not just branch--and reduce the need for short pipeline-flushing branches. A not-taken instruction executes in one cycle. Taken branches incur a three-cycle delay. The 16 execution-condition codes include equal, not equal, always, negative, and overflow. The ARM lacks explicit shift instructions; instead, all ALU operations can perform an optional shift operation in one execution cycle. The processors have block-data-transfer instructions to load and store data from any subset of the 16 general-purpose registers. ARM processors lack an integer-divide instruction; support software provides fast divide and divide-by-10. However, the chips do have multiply and multiply-and-accumulate (MAC) instructions. The MAC instruction speeds math-intensive applications. The processor can quickly perform division and multiplication by a constant using the barrel shifter. (For example, division by 4 and multiplication by 5 each take one cycle.) Special on-chip peripherals: ARM has developed a DSP coprocessor module, Piccolo. Piccolo adds a 32-bit DSP instruction set and shares the host µP's memory bus. Piccolo's interface to the µP includes a tagged input-queue structure and an output FIFO buffer. The coprocessor also includes its own instruction cache; four nestable, zero-overhead, looping constructs; a 16×16-bit, single-cycle multiplier; a 32-bit barrel shifter; four 48-bit, extended-precision accumulators; and register-based storage for 64 16-bit data items. Piccolo also has a split ALU that provides single-cycle, dual 16-bit arithmetic and logical operations in 1 instruction word. For more details, see EDN's 1997 DSP-architecture directory (May 8, 1997, pg 42). Development tools: See Web site at www.ednmag.com.
Fujitsu engineers extended the SPARC pipeline for the SPARClite family to fetch, decode, execute, memory, and write-back stages. The memory stage minimizes the effects of load/store operations and reduces a load/store to one-cycle execution. The stage is idle for nonload/store operations. All SPARClite µPs have separate data and instruction caches. (The 86933H has an instruction cache only.) The caches are two-way set-associative and have 16- or 32-byte cache lines. You can lock and not swap out critical cache lines on chip. The µPs also incorporate a debugging-support unit and an emulator bus, which makes instruction streams visible even in on-chip cache. Debugging registers hold data values or addresses for individual and range breakpoints. The SPARClite processors run with DRAM, SDRAM, SRAM, and ROM/EPROM. The memory interface handles page-mode DRAM for low-cost, high-speed access using a 32-byte burst mode. The memory interface includes a refresh generator for DRAMs, programmable wait states for slower memory, and programmable chip selects for memory banking. Boot-up memory interfaces are programmable; most SPARClite CPUs can boot from 8-, 16-, or 32-bit ROM/EPROM. Power management: A power-management register controls shutdown of the FPU. Special instructions: The SPARClite implements the SPARC V8 specification, which includes a multiply instruction and division via software using divide by 4. Other special instructions include scan word looking for first changed bit or first one or zero, load/store double word, save/restore caller (uses register windows), tagged add/sub (generates overflow if most significant bits 0 and 1 are not 0), atomic math and swap, and generate trap from conditions. Special on-chip peripherals: SPARClite processors come with a 24-bit timer with an 8-bit prescaler and a 16-bit counter. You can program this counter to operate in periodic-interrupt, time-out-interrupt, or square-wave-generator mode. The SPARClite DMA controller supports both "fly-by" and "flow-through" modes. The µP's debug-and-support unit (DSU) comprises two 4-bit emulator buses for data and status and two control signals that enable and set the breakpoint of an in-circuit emulator for hardware debugging and software development. The SPARClite's DSU has six breakpoint-descriptor registers and supports five hardware-monitoring debugging modes. The SPARClite MB86936 has a video interface with an 8-word-deep, 32-bit-wide FIFO buffer and a 1-word-deep, 32-bit-wide holding buffer. LineWidth, BlockHeight, TopMargin, and LeftMargin registers are all 16 bits. Development tools: SPARClite shares many of the development tools that support the SPARC architecture, including compilers and debuggers. Fujitsu supplies evaluation boards and monitors. Fujitsu works with Wind River Systems (Alameda, CA), Chorus Systems (Campbell, CA), Accelerated Technology (Mobile, AL), Microtec (San Jose, CA), JMI (Spring House, PA), and Lynx (San Jose, CA) for RTOS support. These vendors also supply system calls and library routines, many device drivers, and network protocols. Cygnus (Mountain View, CA), Wind River, and Green Hills (Santa Barbara, CA) development environments also support SPARClite. Step Engineering (Sunnyvale, CA), Yokogawa (Tokyo), and Orion (Sunnyvale, CA) in-circuit emulators support SPARClite-based system development. US Software (Portland, OR) and Log Point (Mountain View, CA) offer floating-point libraries for SPARClite. Second sources: There are no second sources for SPARClite.
Although the SH-1, 3, and 4 have a similar core, significant differences exist. The major differences between SH-1 and SH-2 are that the SH-2 features on-chip cache memory, higher speeds, and a 32×32-bit multiply-accumulate (MAC) unit. (SH-1's MAC unit is 16×16 bits.) To build the SH-3, Hitachi added to the SH-2 a memory-management unit (MMU), a barrel shifter, and the ability for conditional-branch instructions to enable or disable the pipeline's delay slot. Disabling the delay slot, although decreasing performance, allows the processor to run more deterministically and reduces the effects of pipeline flushes. The 167-MHz, two-way-superscalar SH-4 µP includes a 3-D graphics accelerator that Hitachi claims can perform at 1.2 Gflops. This µP has four 32×32-bit multipliers fed by two 128-bit buses; it also has four adders. You can load the multipliers with eight operands in one cycle; the µP then adds the results in the next cycle. This hardware performs rotations and transformations on 32-bit, single-precision, floating-point vectors. SuperH processors use a 16-bit instruction word to achieve compact code. The instruction width limits the number of basic operation codes, handles only 16 general registers, and addresses only two operands. Additionally, only 12 bits are available for an immediate offset; jumps with immediate data must be in 2048-byte hops. However, the SH-3 supports FAR-relative branched to support position-independent code. Although these restrictions lead to more instructions per task, the overall result is significantly smaller code. The SH-1 µPs can run from external memory or from on-chip program memory. The 16-bit-wide external-memory bus can supply the CPU with instructions from SRAM or fast DRAM on each cycle. If the processor is operating from external memory, each data access to external memory may take an additional one to two cycles. Instead of on-chip program memory, the SH-2 (SH7604) and SH-3 (SH7702 and SH7708) have a four-way, set-associative on-chip cache (4 kbytes for the SH-2 and 8 kbytes for the SH-3), a 32-bit-wide memory bus for high CPU memory bandwidth (16 bits for the SH7702), and a 32-bit divide unit (replacing the first chip's bit-step-divide function) on the SH-2. You can reconfigure the cache as a two-way, set-associative cache and 2 kbytes (SH7604) or 4 kbytes (SH7708) of user-configurable RAM. The external-memory bus supports multiprocessing; it has bus arbitration for multiple masters. The SH-3 also has a unique RTOS feature: If a task or thread crashes, the operating system can gracefully recover and not have the errant task corrupt other tasks or RTOS environments. Power management: Sleep mode discontinues CPU processing but keeps peripherals active. Standby stops everything but maintains register and cache contents. The SH-2 and 3 provide several clock modes for reducing power; software can adjust the clock rate during program operation. The SH-3's unified cache has a special low-power design that dissipates only 100 mW in operation. The cache sense amps are energized for the cache set that hits while the other three sets stay switched off. The sense amps respond to only a 60-mV differential vs the full 3.3V swing. Special instructions: A 16×16-bit MAC instruction (42-bit accumulator) in the SH-1 and a 32×32-bit MAC instruction (64-bit accumulator) in the SH-2 and SH-3 provide a fast DSP function. Although Hitachi classifies the architecture as load/store, some instructions reference memory. Delayed branch instructions minimize pipeline disruption. An instruction swaps upper and lower bytes. The SH-4 includes a set of 3-D, floating-point instructions. The SH-DSP supports 23 32-bit DSP instructions. Special on-chip peripherals: A version of the SH-2, the SH-DSP, contains a DSP as an "on-chip peripheral." This DSP unit shares the five-stage pipeline with the integer unit; the DSP is not a coprocessor. The CPU contains a fetch-and-decode unit, which manages the instruction stream for both the integer and DSP units, routing instructions to the appropriate unit (see EDN's DSP-architecture directory, May 8, 1997, pg 42). Other, more conventional peripherals include memory controllers, complex multifunction timers, and an LCD controller. The SH-3 contains an MMU with a 128-entry translation-look-aside buffer (TLB). The TLB caches virtual-to physical-address translations from user-created page tables to external memory, providing both data protection and virtual memory. Address translation employs a paging system that supports 1- or 4-kbyte pages. The MMU also handles multitasking by providing multiple virtual-memory modes. Thus, each process has its own virtual memory and cannot access the resources of another process or the OS kernel. Development tools: Hitachi and a number of third-party vendors offer development tool support for the SuperH. (See Web site at www.ednmag.com for more details.) Second sources: VLSI has licensed ASIC-core versions of the SH7000, SH7700, and SH-4.
Instructions are 16, 32, or 48 bits wide. The variable-length instructions, which the E1-32 automatically prefetches, provide constants and native addresses as large as 32 bits. The 4-Gbyte address space divides into four blocks; you can configure each block individually for bus width and timing. The E1-32 integrates a fast-page-mode DRAM controller in one of the block spaces. You can use the other blocks for glueless connection of SRAM, EPROM, or other memory devices, each with its own timing and bus width. A separate I/O-address space also allows each I/O device to have its own timing. The integrated DSP unit, working in parallel with the ALU, can perform DSP calculations while the ALU is performing loop counts, address calculations, or load-and-store operations. The ALU executes its instructions during the latency cycles of DSP instructions. The DSP unit shares all the E1-32's functional blocks, including the register set; however, it provides dedicated result registers and 32- and 64-bit hardware accumulators. The DSP unit supports 16- and 32-bit data types. Power management: In automatic power-down mode only the interrupt-logic, clock, and DRAM-refresh logic remain active. Sleep mode also disables DRAM refresh. Special instructions: DSP instructions include multiply, complex and real multiply-accumulate, multiply-subtract, and complex addition/subtraction. Other special instructions include test-leading zeros. Special on-chip peripherals: hyperstone's E1-32 contains a DRAM controller that allows you to program page size, refresh rate, timing, and access parameters with an internal-memory register. The controller supports fast-page-mode and extended-data-out DRAMs. The µP also contains a memory, I/O and peripheral-interface controller. You can use this controller to set the width and timing of the µP's address areas. Development tools: hyperstone offers a development starter kit, a PC-based development board, and the hyICE serial connector for stand-alone operation. The company also provides an ANSI C compiler and DSP library, a source- and task-level debugger, a multitasking real-time kernel, an assembler, a linker, a library manager, and a profiler. Eonic Systems (Silver Spring, MD) and Etnoteam provide RTOS support. Visual Tools (Spain) offers a JPEG embedded-image-compression/decompression library for the hyperstone E1-32 µP. The library supports user-defined subsampling for image quality and compression to the desired size. Second sources: LG Semicon is a licensee.
Hardware-descriptor registers hold segment-access rights and segment-base address and size limits. In protected-mode addressing, a 16-bit selector points to a segment descriptor and furnishes a base address. The base address adds to the 32-bit effective address, producing a 32-bit linear address, which the 80386 then uses as a physical or linear-page address. The 386 has four code/data breakpoint registers and two control registers for debugging. You can set the breakpoint registers with addresses for halting execution on a program or data access. Power management: System-management mode (SMM), a power-management mechanism, enables code to control CPU power without rewriting or revamping operating software. The CPU enters SMM via a hardware interrupt, system-management interrupt (SMI); the SMI code can set SMMs to reduce chip power dissipation. Integrated versions of the 386, including Intel's 386EX, have idle and power-down modes: Idle discontinues CPU processing but keeps peripherals active, and power-down shuts down the entire chip. AMD's 386SC300 chip has low-speed mode, during which the CPU goes to 0.5 MHz; doze, which stops the CPU, system, and DMA clocks; sleep, which stops additional clocks and peripherals; and suspend, which stops everything except RTC and memory. Special instructions: The 386 instruction set is a superset of the 8086/186. To support SMM, the 386 has seven additional instructions, such as RSM (resume), which causes the processor to resume from SMM mode. Special on-chip peripherals: Intel's 386EX peripherals include a serial I/O unit, a chip select, a clock generator, a DMA- and bus-arbitrator unit, a DRAM-refresh-control unit, an interrupt-control unit, a memory-management unit, and a parallel-I/O unit. AMD's ElanSC300 combines an Am386 CPU with a PC/AT chip set and essential embedded-PC peripherals. The ElanSC300 also includes mobile-computing peripherals, such as PLL clock generators, PCMCIA-card support, LCD-graphics control, a memory controller, DMA and interrupt controllers, a real-time clock, a serial port, and a parallel port. Development tools: Numerous third-party vendors support the 386 architecture. They provide tools that include assemblers, compilers, linkers/locators, remote software debuggers, software simulators, and integrated design environments for software development. In addition, several vendors provide utilities, such as flash-programming, device-driver, and flash-translation-layer implementations. Hardware tools include in-circuit emulators, logic analyzers, evaluation platforms, and single-board computers. Operating-system support includes DOS and windowed OSs and a variety of real-time OSs from small, royalty-free microkernels to feature rich graphical-user-interface RTOSs. RTOS vendors include Integrated Systems (Sunnyvale, CA), Microtec (San Jose, CA), Microware (Des Moines, IA), QNX (Kanata, ON, Canada), and Wind River (Alameda, CA). Several BIOS companies, such as Phoenix Technologies (Norwood, MA), provide embedded-specific BIOS implementations, including OEM adaptation products for customizing a BIOS to hardware.
The 486 chips use 1- to 15-byte-long instructions for complex operations. The two decoder stages give the hardware time to delineate and decode the instructions waiting in the instruction queue. The instruction or byte-code queue holds 32 bytes for decoding. By fetching 4 words at a time from off-chip or local memory, the hardware minimizes contention between data and instruction accesses of the cache. To speed processing, the hardware loads and writes cache lines in 4-word bursts. The DX4 has a unified cache that is four-way set-associative and implements a write-through policy: Writes to cache pass through to memory, which raises memory bandwidth. The 486's bus and cache implement a bus-snooping protocol for multiprocessor operation. The bus is more efficient than that of the 386 and has a two-clock single read or write; 4-word read bursts take five cycles and constitute most 486 bus accesses. The processors also support secondary Level 2 cache for both single-processor and multiprocessor operation, as well as write-through/write-back protocols. The 486 has four code/data breakpoint registers and two control registers for debugging. You can set the breakpoint registers with addresses for halting execution on a program or data access. National Semiconductor offers a family of integrated 486 products, including the NS486SXF and the NS486SXL, with a three-stage pipeline and a 16-byte prefetch queue. To reduce the core size further, National removes the 486's FPU, virtual-memory support, and real-mode functionality (precluding DOS support). National improves core performance by creating a memory path that provides single-cycle access to DRAMs on page hits. National also implements single-cycle push and pop instructions vs four to six such instructions in the standard 486. National uses this approach to improve performance of stack-oriented languages, such as C and Java. Power management: The standard 486 employs system-management mode (SMM) for power management, which enables code to control CPU power without rewriting or revamping operating software. The CPU enters SMM via a hardware interrupt, system-management interrupt (SMI); the SMI code can set SMMs to reduce chip power dissipation. A halt instruction powers down most of the CPU's logic. National's version, although it lacks support for SMM, has several other power-saving features: A power-saving mode divides the CPU clock and can disable individual peripherals, and an idle mode disables the CPU clock without affecting peripherals. Special instructions: See Web site at www.ednmag.com for details. Special on-chip peripherals: AMD's ElanSC400 microcontroller combines an Am486 CPU with a PC/AT chip set and essential embedded-PC peripherals. The ElanSC400 also includes mobile-computing peripherals, such as PLL clock generators, PCMCIA-card support, LCD-graphics control, a memory controller, DMA and interrupt controllers, a real-time clock, a serial port, and a parallel port. National's NS486SXX contains peripherals, including a DRAM controller, an ISA bus, PCMCIA, LCD, DMA, UART, IrDA, Microwire, I2C, two general-purpose timers, one watchdog timer, two interrupt controllers (15 interrupts plus cascade), and a real-time clock. Development tools: Most of the tool support for the 486 is the same as for the 386. AMD, Intel, and National offer evaluation kits for each of their 486 processors. Second sources: There are no pin-compatible second sources for the 80486. AMD, IBM Microelectronics, Intel, National Semiconductor, and SGS-Thomson act as second sources for some implementations.
The i960CA provides superscalar operation and five pipeline stages. The key to the Cx is its four-instruction-wide instruction decoder, which decodes as many as four instructions per cycle. Current implementations dispatch as many as three of these instructions for execution. The i960CF has 128-bit-wide buses to move instructions to the decoder and 128-bit-wide buses to move data between the cache and registers. Intel built the superscalar i960s around a six-port register file with register or memory-control execution units. These units include an integer unit, an FPU, and an interrupt-control unit on the register side and address-generation and bus-controller units on the memory side. The i960s cache instructions in a lockable cache; later versions add an instruction cache to supplement the register cache. Intel based the i960RP I/O processor on the i960 Jx series processor core. The chip targets server-motherboard and adapter-card applications, in which it creates an "intelligent" I/O subsystem. Intel and others have developed an intelligent I/O (I2O) specification to speed I/O processing and simplify driver development. Special instructions: The i960 has uninterruptible atomic add and modify instructions. Other instructions flush local registers and provide cache-locking control. Development tools: Several third-party vendors offer a range of compilers, emulators, evaluation boards, debugging monitors, and real-time operating systems for the i960 family. Second sources: There are no second sources for the i960 family.
Pentium achieves a two-instruction issue peak and has two five-stage pipelines (U and V) for each instruction. On the original Pentiums, such as the P54C, these pipelines are asymmetrical; the U pipe takes precedence over the V pipe. If the second instruction causes no interlocks (using results from the first instruction to write into the same register/data), then the Pentium schedules the second instruction for the V pipe. When Intel redesigned the Pentium, which became the P55C, the company more closely equalized the functionality of the U and V pipes to increase the CPU's efficiency. The U and V pipelines feed from a common instruction fetch/align stage that fetches multiple instructions from the cache. The CPU passes a full 256-bit line to the instruction decoder. Each pipeline has two decoder stages to decode simple and complex instructions. The wide cache-to-decoder path with a two-stage decode enables Pentium to decode the x86's variable-length instructions. The P55C also includes 57 new instructions to support multimedia applications, such as image processing and audio synthesis. More fundamentally, these multimedia-extension (MMX) instructions benefit applications with vectorizable code. To accommodate the new instructions and data types in the x86 architecture, Intel defines eight 64-bit MMX registers, MM0 to MM7. Intel designers obtained these registers by aliasing them with the floating-point registers. Register aliasing eliminates additional silicon for new registers. It also eliminates the need to modify the operating system or system BIOS, which must track these registers. However, aliasing inhibits you from performing routines that combine floating-point and MMX instructions; switching from MMX instructions to floating-point instructions can take as many as 50 clock cycles. Before the CPU can execute a floating-point instruction, you must use the empty-MMX-state instruction to set up the floating-point registers. For superscalar, dual-instruction, load/store operations, the dual-ported Pentium data translation-look-aside buffer (TLB) and cache tags provide concurrent pipeline accesses. The eight-way-interleaved data-cache SRAM allows concurrent accesses to memory banks. (The cache is actually triple-ported with an extra port for snooping.) Cache hit rates range from 90 to 97%, depending on the code mix. The data cache handles both 4-kbyte and 4-Mbyte pages. It has two four-way, set-associative TLBs, one with 64 entries for 4-kbyte pages and one with eight entries for 4-Mbyte pages. The two-way set-associative code cache has a four-way set-associative, 32-entry TLB that handles both 4-kbyte and 4-Mbyte pages. Dynamic branch prediction allows the CPU to determine which branch to take. Pentium's 256-entry branch-target buffer (BTB) holds branch-target addresses for previously executed branches. The BTB supplies the next instruction address that the last execution of a branch instruction took. Each BTB entry integrates the target address with history and operation bits. Intel claims that a correctly predicted branch takes one pipeline cycle and doesn't cause a pipeline bubble. Pentium's FPU features an eight-stage pipeline, which shares the first five stages of the U and V pipelines. Data transfers to or from the FPU use a 64-bit-wide datapath to the data cache. Pentium adds a write buffer to each pipeline to avoid write contention. Pentium uses burst reads to fill its 256-bit-wide cache line. It also has burst write-back writes. The pipelined memory interface allows a second bus cycle to set up while the first bus cycle completes. Pentium reads or writes a 64-bit double word each cycle in burst mode. AMD's K5, although software- and pin-compatible with Intel's Pentium, is a unique 586-class CPU. Although it issues as many as four instructions per cycle, this µP marks the boundaries of the x86 instructions, enabling multiple x86 instructions to be aligned and assigned issue positions for efficient instruction processing. The K5 tags every byte of code entering the processor's instruction cache with bits of associated predecode information to determine how to break the x86 instruction into RISC-like operations. The K5 also has a dual-ported data cache; the CPU to accesses two cache lines simultaneously. The K5 can execute out-of-order instructions and has extra registers for register renaming. Cyrix's 6x86 is software- and pin-compatible with Pentium. This CPU performs register renaming, multibranch prediction, speculative execution, and out-of-order execution. Cyrix's Media GX processor performs all standard north-bridge functions of a PC's core logic. It also performs the functions of the PC's graphics controller, audio chip set, memory controller, and CPU-to-PCI bridge. Rather than using only transistors to perform these functions, Cyrix developed its Virtual System Architecture (VSA). VSA supports the graphics- and audio-hardware functions through software. VSA uses the Media- GX's system-management interrupt to capture any accesses to the memory- or I/O-address ranges of the graphics and audio functions. Once the processor enters system-management mode, it executes Cyrix-supplied drivers to perform the appropriate function. Special instructions: See the EDN Web site, www.ednmag.com, for details.
MIPS engineers use an instruction-fetch, read-operand and decode-instruction, execute, access-data-memory, and write-back-results pipeline for the R3000. The pipeline lets as many as five instructions execute concurrently--each at a different stage of its instruction cycle, thus giving the effect of single-cycle execution. A branch-delay slot minimizes branch effects. The compiler fills the instruction slot, following the branch with a no-operation instruction or an instruction from the current thread that can execute before the branch takes effect. Toshiba's R3900, an R3000 derivative, incorporates register "scoreboarding" to enable nonblocking loads and avoid pipeline stalls when there are no data dependencies in subsequent instructions. The R3000 memory-management unit (MMU) includes a fully associative, 64-entry translation-look-aside buffer that translates virtual addresses to 32-bit physical addresses. The µP uses a write-through cache policy. A small on-chip FIFO buffer enables the CPU to perform instruction "streaming"--refilling the cache and executing instructions even while reading additional instructions from memory. Special instructions: The R3000 uses the MIPS-I instruction set. Several of the MIPS derivatives add a multiply-accumulate (MAC) instruction. Toshiba's TX19 is the first R3xx implementation to use the MIPS16 instruction extensions. (See R4000 for more details.) Special on-chip peripherals: Philips offers the TwoChipPIC, which combines the UCB1200 that interfaces with the company's PR31700 MIPS µP. The TwoChipPIC provides a microsystem on a chip for handheld devices. Integrated modules include a MAC unit, an LCD controller, an infrared controller, PCMCIA-card support, touchscreen control, and audio in/out. Development tools: A range of third-party development tools is available for the MIPS RISC architecture. Wind River (Alameda, CA), Integrated Systems (Sunnyvale, CA), Accelerated Technologies (Mobile, AL), Green Hills Software (Santa Barbara, CA), and Microsoft (Redmond, WA) provide embedded operating-system support. Cygnus (Mountain View, CA), Microsoft, Green Hills Software, Metrowerks (Austin, TX), Tasking (Dedham, MA), and Wind River offer development tool chains and compilers. Hewlett-Packard (Colorado Springs, CO) and Corelis (Cerritos, CA) offer debuggers and in-circuit emulators. (You can get further information on development tools for the Rxxxx and other MIPS µPs in the MIPS RISC Resource Catalog from the MIPS Group, Silicon Graphics or at www.mips.com.) Philips supplies the hardware-abstraction layer, device drivers, a reference design, and development board for Windows CE implementation on the TwoChipPIC. Microsoft's Visual C++ tool chain supports TwoChipPIC development. IDT's 33-MHz 79S381 evaluation board allows you to evaluate the 3041, 3052, and 3081 µPs. The board features 2 Mbytes of interleaved DRAM, expandable to 16 Mbytes; 256 kbytes of zero-wait-state SRAM; 512 kbytes of EPROM, expandable to 2 Mbytes; and a 1024-bit serial EEPROM. The company provides the 79S361 evaluation platform for the 79R36100. This board has 1 Mbyte of noninterleaved, zero-wait-state DRAM, expandable to 64 Mbytes. It also contains 2 Mbytes of EPROM, and a slot for 1 Mbyte of zero-wait-state SRAM. IDT offers its kernel-integration tool that includes source- and object-code versions of common routines for CPU design. The company also offers a system-integration monitor that is a ROMable debugging kernel. The monitor includes IDT's micromonitor, which requires only a UART and ROM to perform the initial debugging and integration of new hardware. IDT/C is an ANSI C-compliant Gnu compiler, assembler, linker, and librarian. It includes start-up code, cache, and exception routines. Toshiba offers evaluation boards for its TX39 products. These boards feature support for serial, SCSI-II, Ethernet, or VMEbus interfaces. Wind River's VxWorks and Tornado RTOS support these boards. Second sources: The MIPS Group of Silicon Graphics licenses the R3xxx processors to IDT, NKK, Philips, and Toshiba.
The M32R/D has 16 32-bit, general-purpose registers, supports 83 instructions of 16- and 32-bit-wide instruction formats, and has six addressing modes. The CPU executes most instructions in one clock cycle using an instruction fetch, a decode, an execute, a memory access, and a write back. The decode stage dispatches instructions in order, and the remaining stages execute them out of order to hide memory-access latency. The MAC unit contains a single-cycle, 32×16-bit multiplier and a 56-bit adder. The CPU has an instruction queue of two 128-bit entries. The cache maps directly to the address space and has caching modes for internal instruction and data, for internal and external instructions, and for cache off. If a cache miss occurs, the CPU fetches one 128-bit data line in five cycles. The BIU has 128-bit data buffers and supports burst transfers on 128-bit boundary data. A 16.67-MHz bus clock and four digital PLLs generate the internal 66-MHz clock. The PLL contains a digital frequency multiplier. Four cascaded, 64-tap inverter chains generate four timing edges in one-half clock cycle. A phase detector and an up/down counter adjust the pulse width to one-fourth of the one-half clock cycle to keep the duty cycle of the four-times clock at 50%. The generated clock then feeds into a digital phase shifter to reduce the phase difference between the external and internal clocks to 400 psec. Power management: The M32R/D supports sleep and standby modes during which the average power consumption is 260 and 2 mW, respectively. Special instructions: The M32R/D supports MACs of 32×16 and 16×16 bits. It also performs data rounding in the accumulator and block moves. Special on- and off-chip peripherals: The M32R/D integrates no I/O-peripheral functions, but you can use it with a peripheral-function "super-I/O" chip, such as the Mitsubishi M65439FP. The M65439FP performs functions such as M32000D bus control, DRAM control supporting two banks and page-mode burst transfers, and chip-select control for as many as five 64-kbyte to 4-Mbyte blocks with one to eight wait states. The super I/O also contains a two-channel DMA controller that can transfer as much as 2 Mbytes using cycle-steal, single-transfer, continuous-burst-transfer or cycle-steal, continuous-transfer mode. An interrupt controller handles 20 sources with priority resolution for as many as seven levels. Other functions of the M65439FP include timers, a two-channel UART, and a two-slot IC-card controller. It sells for $14 (10,000). Development tools: Cygnus (Mountain View, CA) supplies a C and C++ compiler and debugger for the M32R/D. Mitsubishi also supplies a C compiler and an evaluation board. Integrated Systems (Sunnyvale, CA) and Wind River (Alameda, CA) will supply RTOSs for the M32R/D before year-end. Second sources: There are no second sources for the M32R/D.
The 68EC000 has microcode and second-level, expanded-nanocode microcode levels. Instruction execution triggers a chain of 10-bit microcode words. Each microcode word can reference another word, such as a jump in microcode or a string of 70-bit nanocode words that directly drive the CPU logic. The CPU lacks a memory controller, but the separate address and data buses eliminate the need for buffering addresses. However, the CPU needs logic to generate the required DTACK* signal, which marks the successful completion of a memory cycle. An address decoder is necessary for multiple memory chips, and drivers may be necessary to buffer bus address and data lines. (Integrated versions of the 68EC000 contain this logic.) If DTACK* is late, the CPU generates wait states. Power management: Only the integrated versions provide variations of sleep and low-power stop modes. Special instructions: The chip restricts privileged instructions--reset, stop, and moves and operations on the status register--to supervisor mode. To support user and supervisor modes, the hardware implements separate stacks and pushes and pops the program counter and status register onto the stack for exceptions. A link instruction lets you build link lists on private stacks. A special instruction lets you move as many as 16 registers to or from an effective address, including blocks of data registers to or from address registers. Development tools: Green Hills Software (Santa Barbara, CA) provides C, C++, Fortran, Pascal, and Ada compilers for the 68K architectures. This company also provides its Multi software-development environment for developing programs from these languages and mixing them into an executable in almost any combination. Hewlett-Packard (Colorado Springs, CO) offers logic analyzers, oscilloscopes, emulators/analyzers, software simulators, debuggers/emulators software, a real-time software-performance analyzer, C compilers, assemblers, linkers, and a debugging utility for RTOSs. Huntsville Microsystems (Huntsville, AL) supplies emulators, a $199 background-mode debugger (BMD), and simulators for Motorola devices. The company offers its HMI-200 Series and SPS-2000 Series emulators. Integrated Systems (Sunnyvale, CA), Microtec (San Jose, CA), and Microware (Des Moines, IA) provide RTOSs and other software tools to support hardware and software integration. Intermetrics (Cambridge, MA) offers compilers, assemblers, utilities, debuggers, and royalty-free real-time kernels. Orion Instruments (Sunnyvale, CA) offers in-circuit emulators (ICEs) and HLL source-code debuggers for Windows or Unix platforms. Software Development Systems (Oak Brook, IL) provides C and C++ compilers; assemblers; simulators; target-monitor, BDM, and JTAG debuggers; and interactive development and debugging env ironments. Wind River Systems (Alameda, CA) provides an RTOS, networking facilities, and cross-development tools. Wind River also provides a diagnostic and analysis tool that provides visibility into the dynamic operation of an embedded system. Yokogawa Digital Corp (Tokyo) sells ICEs. Second sources: Second sources of a few NMOS versions of the 68000 are Hitachi, Philips, SGS-Thomson, and Toshiba.
The superscalar 68060 heads the 680x0 lineup with its dual integer and floating-point pipelines. As instructions enter the CPU, they flow into a four-stage prefetch pipeline: instruction-address generation, instruction cycle, instruction early decode, and instruction buffer. In this pipeline, the 680x0 converts 68000-compatible, variable-length CISC instructions to a fixed-length instruction. These instructions then enter dual, four-stage, synchronously operating, integer-execution pipelines. The decode, effective-address-calculation, fetch, and integer-execution pipeline dispatches instructions to the FPU and allows for some execution overlap between the integer and FPU engines. A Harvard architecture allows the 68060 to perform simultaneous instruction fetches and data accesses. The four-way set-associative, four-way-interleaving, on-chip caches support simultaneous read and write operations. You can freeze portions of the caches to prevent reallocation. The 68040 implements a fetch, decode, effective-address-calculation, effective-address-fetch, execute, and write-back pipeline. To speed processing, it has two 4-kbyte direct-mapped caches and separate data and instruction MMUs, which allow simultaneous address translations. The 040 includes bus snooping to ensure cache coherency for multiprocessing. The cache supports both write-through and copy-back modes. The 68020 and 68030 CISC implementations have smaller caches; the 030 and 040 versions implement burst mode, moving as much as 16 bytes in an addressing block between registers and memory. The 040 and 060 deliver apparent single-cycle execution for some instructions, mainly register operations such as memory-to-register moves if the data is in the data cache. A taken branch takes two cycles; a not-taken branch takes three cycles. On the 68060, a 256-entry, four-way, set-associative, on-chip branch cache allows taken and nontaken branches to execute in zero and one clock, respectively. The branch-cache unit contains state bits that provide a history of branch executions, which helps to predict branch direction. Unlike the 020/030, the 040 and 060 perform no dynamic bus sizing. Instead, they have a highly reliable bus with a high-drive option that can implement a synchronous, two-clock read/ write protocol. A 4-word burst takes five clocks. The 040 includes multiprocessor-bus arbitration but requires off-chip logic. Externally, the 68060 bus is a superset of the 68040 bus. Additional signals support higher performance system designs, but the processor can easily operate on a 68040-based bus. An on-chip MMU with separate instruction and data translation look-aside buffers allows the 68060 to access as much as 4 Gbytes of memory. Power management: To support power management, the 68060's functional units respond to dynamically controlled clocking; the caches and execution units power down when not accessed. The static design allows you to reduce or stop the external clock, and a low-power- stop (LPSTOP) instruction disconnects most of the chip from the clock pin. Special instructions: The CPUs have special instructions for variable-length bit fields, moving 16 registers, compare, and swap, which locks memory for multiprocessing. A scaling option addresses data by item size for table access, FPU, and MMU commands. The 68040 and 68060 have a special move instruction (MOVE16) to perform a 16-byte block move and a PLPA instruction that loads a physical address by translating a logical address. A table instruction performs a table look-up and interpolates the data. Special peripherals: The MC68150 allows the 68040, LC040, and EC040 bus to communicate bidirectionally with 32-, 16-, or 8-bit peripherals and memories. The XC68HC901 multifunction peripheral comprises a one-channel USART and an eight-source interrupt controller. Development tools: The 680x0 shares many of the same tools as the 68EC000. Second sources: Toshiba acts as a second source for some versions of the 680x0.
ColdFire has a four-stage pipeline comprising a two-stage, instruction-prefetch pipeline (ad-dress-generation and instruction-fetch cycle) and a two-stage operand execution pipeline (de-code and select/ operand fetch cycle and operand address generation/execute cycle). A 12-byte FIFO instruction buffer decouples the two pipelines. The prefetch pipeline calculates the next instruction address and then fetches 32 bits of instruction data. The operand pipeline has a dual read-ported register file feeding an ALU A modular, standard bus architecture separates the CPU core from on-chip peripherals. The core communicates with on-chip memories using the tightly coupled Kbus processor bus. This bus lets the core perform a 32-bit fetch from internal memory in one clock cycle by pipelining the address and data. A controller interface on the Kbus indirectly attaches the core to user-selectable cache, ROM, and RAM modules. Another bus, the Mbus (master bus), offers centralized arbitration. A special module connects the Mbus to the Kbus. The Sbus (slave bus) interfaces to standard on- and off-chip peripherals and attaches to the Mbus through a system-bus controller. On-chip debugging supports real-time trace; real-time and non-real-time debugging; and access to control registers to define types of memory regions, such as cacheable copy-back, write-through, and noncacheable. Real-time trace reflects the processor's status and indicates events such as instruction completion and monitor of change-of-flow target addresses. Real-time debugging supports program-counter-relative, operand-address, operand-data, and non-real-time-debugging hardware breakpoints. Non-real-time debugging is similar to background-debugging mode on current 683xx products. You can use a three-pin serial interface in this mode to read register contents, generate an infinite-priority interrupt, and force the CPU to halt. Power management: A low-powerstop instruction (LPSTOP) shuts down active circuits in the processor and halts instruction execution. Processing resumes via reset or valid interrupt. Special instructions: ColdFire added 32×32-bit integer-multiply, register-sign-extension, and multiword nonoperation instructions to the 68000 architecture. Compilers use nonoperation instructions to remove branch instructions. Special on-chip peripherals: The MCF5200M processor, which Motorola designed with its FlexCore methodology, integrates the ColdFire core, debugging module, and misalignment module with a multiply-accumulate (MAC) unit supporting 16- and 32-bit operations. The MCF5202 supports a 32-bit multiplexed bus with dynamic bus sizing that allows access to 8-, 16-, or 32-bit memory and peripherals. It also has a debugging module that provides serial control and visibility of the processor and memory system. Motorola offers the ColdFire2 and ColdFire2M in the FlexCore library for customer design. Both devices integrate the ColdFire core with a debugging module, a misalignment module, and memory controllers that support as much as 32 kbytes each of RAM, ROM, and instruction cache. The ColdFire2M also incorporates the MAC unit. Development tools: The MCF5102, Motorola's first implementation of the ColdFire processor, takes advantage of the established 68K software base. Compatibility with 68EC040 development tools gives developers access to a range of tool support. Third-party tools for the ColdFire family include in-circuit emulators from Lauterbach (Framingham, MA), Microtec International (Hillsboro, OR), Embedded Support Tools (Canton, MA), and Orion Yokogawa (Sunnyvale, CA). Cygnus (Mountain View, CA), Diab Data (Foster City, CA), Green Hills (Santa Barbara, CA), and Microtec offer C compilers. Wind River (Alameda, CA), Integrated Systems Inc (Sunnyvale, CA), Embedded System Products (Houston), and Accelerated Technology (Mobile, AL) offer ColdFire RTOS products. Many of these companies also offer debugging support. Second sources: There are no second sources for ColdFire. For most of the 683xx family, Motorola combined a stripped-down 68020 core with a 16-bit (32-bit for CPU32+) on-chip InterModule Bus (IMB), which links the CPU with a device's complex peripherals. The core processor, the CPU32 or CPU32+, is a 68020 CPU for embedded control and lacking memory-management-unit (MMU) or FPU interfaces. The CPU32 and CPU32+ have 16- and 32-bit data buses, respectively. The 32-bit processor has eight general-purpose, 32-bit registers; seven 32-bit address registers; a 32-bit ALU; and separate user and supervisor modes, each with its own stack and separate address and data spaces. The CPU32 is code-compatible with the 68020 but has enhanced addressing modes, including scaled index, address-register indirect with base displacement and index, program-counter relative, and 32-bit branch displacements. Postincrement and preincrement/decrement options simplify iterative code. The CPU accesses memory-mapped peripheral-control registers and I/O as addresses in memory. All 683xxs have a system-integration module featuring system configuration, oscillator and clock dividers, reset and power-down-mode control, chip selects and wait states, parallel I/O with interrupt capability, interrupt configuration/response, and a software watchdog timer. The external-bus interface has as may as 32 address and 16 data lines (32 for CPU32+) and as many as 12 programmable chip-selection lines. The single-chip Integration Module II allows users to select 32-kHz or 4-MHz clock crystals. Power management: A low-power-stop (LPSTOP) instruction stops the clock. Devices can run at low frequencies. Special instructions: The 68020 does not support BCD-pack/unpack, bit-field, compare-and-swap, coprocessor, MMU, module-call/return, and memory-indirect-addressing instructions. New instructions include a table look-up and interpolate and the ability to put the chip into a low-power standby mode. Development tools: The 683xx leverages the extensive development-tool support from the 68xxx architecture. These tools include assemblers from 12 vendors, compilers and debuggers from 14, RTOSs from 12, emulators from 13, and evaluation boards from three. Second sources: There are no second sources for the 683xx.
The PowerPC 750 contains seven parallel-operating execution units: two integer units, a branch-processing unit, a load/store unit, an FPU, a condition-register unit, and a Level 2 (L2) cache-interface unit. The CPU can fetch as many as four instructions per cycle. The 750 processes branches as they enter the instruction buffer and can decode and dispatch two nonbranches in one cycle. Completion logic keeps track of the outstanding instructions and retires them in order. The PowerPC 750 µP can use static or dynamic branch prediction to improve the accuracy of instruction prefetching. For static prediction, the branch-operation codes provide hints to predict whether a branch is taken or not. For dynamic prediction, the CPU uses a 512-entry branch history table and a 64-entry branch-target instruction. The CPU permits speculative execution down a predicted path beyond one unresolved branch. The 750 has separate 32-kbyte instruction and data caches. Both eight-way set-associative, lockable caches provide byte-level parity checking. A locked cache typically supplies data on a hit, but cache lines are not replaced on a miss. The 750 contains an on-chip L2 cache controller and back-side L2 bus, which improves system performance by reducing system bus traffic. The L2 cache controller includes 8196 tag entries, which support 256-kbyte, 512-kbyte, or 1-Mbyte of external, two-way set-associative, unified L2 cache. The L2 cache uses standard, commodity SRAMs. The nonblocking L2 cache supports hit-under-miss mode and can simultaneously service as many as four requests at a time. The L2 cache bus can operate at various speeds relative to the processor frequency. The PowerPC 604e contains seven independent execution units: two single-cycle integer units, a multiple-cycle integer unit, a branch-processing unit, a load/store unit, an FPU, and a condition-register unit. Instructions execute out of order, and execution results can be immediately available to subsequent instructions through the use of rename registers. The completion unit commits, or "retires" results, to floating-point or general-purpose registers. The unit retires as many as four instructions per clock cycle in order, ensuring a precise exception model. The PowerPC 604e µP uses dynamic branch prediction to improve the accuracy of instruction prefetching. This feature and the ability to speculatively execute through two unresolved branches minimize pipeline stalls. The 604e has separate 32-kbyte, four-way set-associative instruction and data caches, both of which provide byte-level parity checking. The 604e and 750 have separate memory-management units (MMUs) for instructions and data. The MMUs support as many as 4 petabytes of virtual memory and 4 Gbytes of physical memory. Access privileges and memory protection are controlled on 128-kbyte to 256-Mbyte blocks and 4-kbyte pages. Translation-look-aside buffers with 128 entries efficiently translate addresses by storing the most recently used page translations. The 604e and 750 support 64-bit data and 32-bit address buses. The interface protocol allows multiple masters to access system resources through a central arbiter. The PowerPC 604e works in multiprocessor systems and snooping and requires no additional bus cycles. The 604e's on-chip snooping logic maintains cache coherency in multiprocessor systems. The 750 supports snooping but is optimized for uniprocessor systems. It supports no data sharing among caches in different processors. The buses on the 604e and 750 are compatible electrically and in the protocol used. A common chip set supports both processors. The 603 comprises five parallel execution units: integer execution, floating point, branch, system, and load/store. With a four-stage pipeline--fetch, dispatch, execute, and complete--the 603 can achieve three instructions per clock cycle. During the fetch stage, the 603 uses a six-instruction prefetch queue to hold pending instructions. Unlike other PowerPC derivatives, the 603 supports only static branch prediction. However, the architecture supports out-of-order execution and in-order retirement, similar to other PowerPC devices. The PowerPC 602, a cost- and power-reduced implementation of the 603, has four parallel-execution units: fixed point, FPU, branch processing, and load/store. The scalar design of the 602 limits instruction issue to one instruction per cycle. The 602 performs branch folding to help eliminate branches and incorporates a single-precision FPU. The embedded PowerPC processors include Motorola's MPC500 and MCP800 families and IBM's 400 series devices. Compared with other PowerPC devices, these devices have similar--but fewer--execution units and issue only one instruction at a time. They include additional integrated peripherals to tune the devices for embedded applications. The MPC500 uses an InterModule Bus, developed for Motorola's 683xx devices, as a backplane to connect all system modules. The MPC500 family includes a system-integration unit that enables simple integration with external memories, other CPUs, and peripheral devices. Memory components attach directly to the 403Gx processors with the programmable-memory interface on the processor's bus-interface unit. The DRAM controller includes the address multiplexer, eliminating the need for an external address multiplexer. You can use software programming to tune the timing for the interface control signals. Power management: See the EDN Web site at www.ednmag.com for more details. Special instructions: See the EDN Web site at www.ednmag.com for more details. Development tools: See the EDN Web site at www.ednmag.com for more details. Second sources: Mitsubishi acts as a second source for IBM's embedded PowerPC µPs.
After the decoder creates µops, it sends them to a 40-deep reorder buffer (ROB). The µops then await dispatch to the execution portion of the pipeline. At this point, the µops either are ready for execution or are waiting for data from a memory access or a result from a previous µop. To avoid register dependencies, the PPro performs renaming: Extra registers represent the x86's programmer-visible registers. The dispatch/execute engine queues ready-for-execution µops within a 20-entry, distributed-reservation station. The PPro determines the data flow by analyzing which µops depend on each other's results. The processor dispatches µops from anywhere or any order within the reservation station. The PPro speculatively executes and returns these µops to the ROB, where the retire engine evaluates them. Although the PPro executes µops or instructions out of order, the device completes the instructions in the original program order. Furthermore, speculative execution implies that the device executes some instructions that never retire. This situation occurs if the device mispredicts a program branch. When the PPro encounters a mispredicted branch, it flushes its deep pipelines and remove µops from the ROB. To minimize the possibility of a mispredicted branch, Intel designers increased the branch target buffer (BTB) to 512 entries and added history bits to help the prediction algorithm. The Pentium II is a PPro minus the on-chip Level 2 cache but with the same multimedia-extension (MMX) instructions as Pentium, all contained in a single-edge contact (SEC) cartridge. The cartridge includes a 512-kbyte, L2 cache that runs at half the CPU speed. AMD's K6 with MMX is a six-issue, superscalar µP with a Socket 7-compatible bus interface. It features a decoupled, decode/ execution, superscalar design that can simultaneously decode multiple x86 instructions. It also performs single-clock RISC operations, out-of-order execution, data forwarding, speculative execution, and register renaming. The AMD-K6 processor, based on a six-stage pipeline, contains parallel decoders, a centralized RISC86 operation scheduler, and seven execution units. Similar to the PPro, the K6 decodes x86 instructions into RISC86 operations that adhere to the RISC-performance principles of fixed-length encoding, regularized instruction fields, and a large register set. The K6 implements branch prediction logic in the form of an 8192-entry branch-history table, a branch-target cache, and a return-address stack. In the K6, x86 instruction decoding begins before the CPU fills the on-chip instruction cache. Predecode logic determines x86-instruction length on a byte-by-byte basis. The K6 stores this predecode information, along with x86 instructions, in the instruction cache, for later use by the decoders. The decoders translate as many as two x86 instructions per clock into RISC86 operations. The scheduler contains the logic needed to manage out-of-order execution, data forwarding, register renaming, simultaneous issue and retirement of multiple RISC86 operations, and speculative execution. The scheduler's RISC86 operation buffer can hold as many as 24 operations. The scheduler can simultaneously issue a RISC86 operation to any available execution unit (store, load, branch, integer, integer/multimedia, or floating point). The scheduler can issue as many as six and retire as many as four RISC86 operations per clock. Unlike the PPro or K6, the Cyrix 6x86MX processor directly executes native x86 instructions, rather than converting x86 instructions into RISC-like instructions. The 6x86MX achieves a dual-x86 instruction issue/execute rate using dual seven-stage pipelines. The CPU performs register renaming, multilevel dynamic branch prediction, speculative execution, and out-of-order completion. The 6x86MX has a dual-ported, 64-kbyte cache and a dual-ported, 384-entry translation-look-aside buffer; both support two reads and two writes or one read and one write on every cycle. In addition, the 6x86MX fully supports Intel's MMX instruction set. The instruction-fetch stage of the pipeline fetches 16 instruction bytes per cycle from the instruction cache and feeds the instruction-decode stage. The instruction decoder issues as many as two complex x86 instructions per cycle. During decoding, the decoder examines the resource requirements of the two instructions and chooses the optimal pipeline for each instruction. During these stages, the decoder accesses the 512-entry branch-target buffer and the 1024-entry branch-history table to avoid pipeline bubbles. During the access stages of the pipeline, the CPU performs scoreboard checks, renames registers, and accesses the physical register file. The 6x86MX also calculates one or two linear addresses per cycle for all addressing modes and accesses the translation-look-aside and cache. The ability to fetch as many as two memory operands from the data cache before the instruction-execution stage allows the 6x86MX to execute memory-reference instructions in one cycle.
The microSPARC has a five-stage pipeline: fetch, decode, memory access, execute, and write back. It also has a four-entry write buffer to prevent write stalls. An FPU contains 32×32-bit floating-point registers, a general-purpose execution unit, and a floating-point multiplier. A three-entry queue of floating-point instructions increases concurrency with integer execution. Ross Technology (Austin, TX) makes microSPARC-compatible hyperSPARC devices, which the company sells as after-market products to Sun-workstation users. The hyperSPARC's pipeline comprises an instruction-fetch unit, two integer ALUs, a load-and-store unit, and an FPU. The device fetches--and when possible dispatches to functional units--instructions in pairs. The FPU contains a four-entry instruction queue, helping to eliminate stalls resulting from the execution of multiple-cycle floating-point instructions. In addition, when data dependencies exist, the FPU forwards data generated by one instruction in the floating-point queue to the second instruction without going through the register file. Both microSPARC and hyperSPARC include SPARC-compliant memory-management units (MMUs). The microSPARC's MMU uses 3 high-order bits of physical address to map eight address spaces. The MMU controls arbitration among I/O, data cache, instruction cache, and translation-look-ahead-buffer (TLB) references to memory. The MMU contains a 64-entry, fully associative TLB and supports 256 contexts. The hyperSPARC's MMU uses a context register to identify as many as 4096 contexts. The microSPARC µPs have a separate 64-bit memory interface that handles as much as 128 Mbytes or 256 Mbytes for the microSPARC II of 16-Mbit DRAM. An on-chip, 25-MHz, 32-bit, synchronous Sbus (slave-bus) interface and controller handle five Sbus slots. The hyperSPARC interfaces to the system through the SPARC-standard 40-, 50-, or 66-MHz; 64-bit; multiplexed; synchronous Mbus (master bus). Ross' processors operate synchronously or asynchronously with the Mbus clock. The microSPARC-IIep replaces the Sbus interface with a 32-bit, 33-MHz, PCI interface. Special instructions: The microSPARC and hyperSPARC µPs comply with instructions in the SPARC V8 specification. Temic Semiconductor's (San Jose, CA) SPARClet µP includes DSP capabilities by extending the SPARC instruction set and accessing hardware operators using the coprocessor operating code. Special peripherals: The microSPARC-II has an on-chip Sbus interface. Sun provides peripheral ASICs that attach to the Sbus and provide memory and I/O capabilities, such as Ethernet, serial, keyboard, mouse, SCSI, and parallel ports. One such ASIC, the PCIO chip, links the processor and 10/100-Mbit Ethernet; an 8-bit expansion bus links to standard "super-I/O"-like ASICs for connection to keyboards, mice, serial ports, and the like. The microSPARC-IIep contains a PCI interface for using industry-standard peripherals. Development tools: A variety of OSs, each with its own set of development tools, support the microSPARC-IIep. Sun's Solaris OS features the Workshop suite of development tools. Workshop contains a C/C++ compiler and source-code-control, debugging, and profiling tools. Workshop provides a self-hosted development environment allowing programmers to develop software for embedded applications on their desktop development workstations. Wind River's (Alameda, CA) Tornado provides an integrated suite of development tools for a cross-platform, host-target environment. Tornado features graphical host-based tools, a high-performance RTOS, and host-target communication protocols. Chorus Systems (Campbell, CA) features the ClassiX RTOS. You can compile application code on a Solaris host with the Workshop compiler and debug the code with a Gnu-based source-level debugger. ClassiX also features a Common Object Request Broker Architecture (CORBA)-compliant Object Request Broker and an Interface Definition Language (IDL) compiler. IDL describes the interface to a routine or function. For example, IDL defines objects in the CORBA distributed-object environment, which describes the services that the object performs and how data passes to the object. IDL stores the definitions in an interface repository that a client application can query to determine which functions, or objects, are available on the object bus. For developers using alternative system software, the Cygnus (Mountain View, CA) GnuPro C/C++ tool kit provides compiling and debugging tools. Second sources: There are no second sources for microSPARC or hyperSPARC devices; however, Sun licenses the microSPARC core to C-Cube Microsystems, Hyundai, Scientific Atlanta, and Zylan.
The V800 architecture comprises a five-stage pipeline: fetch, decode, execute, memory access, and write back; 32 general-purpose registers; a 32-bit barrel shifter; and a hardware multiplier. Most instructions execute in one clock and are 2 bytes long, allowing smaller code. The CPU has a pipeline-stall feature that automatically inserts a bubble in the pipeline to avoid data dependencies and hazards. Instruction and data accesses occur on separate buses. Interrupt latency from an external source or peripherals is a minimum of 11 CPU cycles. A bus-control unit (BCU) generates a prefetch address to prefetch an instruction code from external memory and store it in the 4-double-word prefetch queue. For accesses from internal ROM, instructions go straight to the CPU; that is, not through the prefetch queue. Instruction fetches from internal ROM consume one cycle; data fetches from ROM require three cycles. Therefore, you should shadow look-up tables and fixed data structures to the CPU's internal RAM, in which the V800 can access data in one clock. The BCU also provides a bus-arbitration function, allowing other devices, such as DMA, to share and take control of the V851's external bus. Programmable wait- and idle-state insertion control facilitates interfacing to slow memory. Although most of the microcontrollers provide as much as 16 Mbytes (24 address bits) of linear addressing, the new V850E also provides dynamic bus sizing. The maximum addressing range of the V800 architecture is 4 Gbytes. The V800 accesses peripherals as memory-mapped I/O that connect to the CPU through a 16-bit bus. ROM and RAM communicate to the CPU using a 32-bit bus. Although the first member of this family, the V851, has 32 kbytes of ROM and 1 kbyte of RAM, the V850 architecture allows internal expansion to 1 Mbyte of ROM and 4 kbytes of RAM. Similarly, the external bus of the V851 addresses as much as 16 Mbytes. (The architecture allows access to as much as 4 Gbytes on future chips.) The V850's memory space divides into 1-Mbyte unit blocks, and you can insert wait states in a bus cycle for every two blocks. Power management: In halt mode, the clock generator continues to operate, but the CPU clock stops, allowing the on-chip peripherals to function. Idle mode stops the CPU clock and internal-system clock; however, because the clock generator continues to run, normal operation can resume without waiting for oscillator and PLL stabilization. In stop mode, everything stops, but register and memory contents stay intact. Special instructions: NEC's V800 devices support a software-trap instruction. The CPUs also perform saturate operations in which the devices store maximum values of additions that result in overflow. For example, if the result exceeds the positive-value 7FFFFFFFh, the CPU stores 7FFFFFFFh in the result registers and then sets the saturation flag. The V850E device also provides single-cycle byte-swapping operations for endian translation of data structures. Also, NEC includes single instructions to assist in C procedure calls for pushing and popping multiple registers. The net effect would be a decrease of code in the prologue and epilogue sections in C and a resulting speed increase. Special on-chip peripherals: An on-chip DRAM controller, synchronous flash controller, and DMA controllers are available in the latest devices. Development tools: NEC, Green Hills Software (Santa Barbara, CA), and Cygnus Support (Mountain View, CA) offer C-compiler tool chains for the V800. Accelerated Technologies (San Diego), Green Hills Software, NEC, JMI Software (Spring House, PA), and Wind River Systems (Alameda, CA) provide RTOSs. A host of stand-alone evaluation boards, PC ISA-bus evaluation boards, and in-circuit emulators is available from NEC and third-party vendors. NEC works jointly with Synopsys (Mountain View, CA) and Mentor (Wilsonville, OR ) to provide simulation tools for the V800 Series embedded core/ASIC development. NEC's OpenCAD environment supports these tools and is compatible with the standard device-development tools. Second sources: There are no second sources for the V800. |
||
| EDN Access | Feedback | Table of Contents | |
||
| Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc. | ||