EDN Access PLEASE NOTE:
FIGURES WILL LINK
TO A PDF FILE.

April 23, 1998


EDN's 1998 DSP 32-BIT Architecture Directory


Analog Devices SHARC DSP

09CS3202The 32-bit fixed- and floating-point SHARC DSPs, or ADSP-2106xs, integrate four internal buses, a large on-chip memory, and an I/O controller to offload I/O. Within the CPU core, the ALU, multiplier, and shifter operate in parallel to perform multifunction, single-cycle instructions. SHARC DSPs feature an enhanced Harvard architecture in which the data-memory bus transfers data and the program-memory bus transfers both instructions and data. With its separate program- and data-memory buses and on-chip instruction cache, the processor can simultaneously fetch two operands and an instruction from cache in one cycle. The 32-entry, 48-bit-wide instruction cache is selective--caching only the instructions whose fetches conflict with accesses to program-memory data.

The SHARC DSP uses a general-purpose, 10-port, 32-register data-register file to transfer data between the computation units and the data buses and to store intermediate results. The 48-bit instruction word accommodates a variety of parallel operations for concise programming. For example, the ADSP-2106x DSPs can conditionally execute a multiply, an add, a subtract, and a branch in one instruction.

SHARC DSPs feature two data-address generators (DAGs), which implement circular data buffers. These DAGs contain sufficient registers to allow you to create as many as 16 primary and 16 secondary circular buffers. The DAGs, which may start and end at any memory location, automatically handle address-pointer wraparound.

The ADSP-2106x SHARC chips have two high-speed serial ports and a host/parallel port, providing a direct interface to off-chip memory, peripherals, and a host processor. Link ports facilitate interprocessor communication and bus arbitration among as many as six ADSP-2106x chips.

The ADSP-2106x's CPU executes using on- or off-chip memory. Some SHARC chips contain as much as 512 kbytes of on-chip memory organized into two banks of dual-port RAM. You can use this memory to store a combination of 16, 32-, or 40-bit data and 48-bit instructions and perform as many as four accesses per cycle: program memory for code and data, data memory for data, and an off-chip load using the chip's I/O controller.

SHARC's I/O controller executes I/O transfers in parallel with CPU execution. The I/O controller offloads reads and writes between on- and off-chip memory, but delays occur when accesses contend for the same data. The I/O controller manages all DMA channels, transferring data among internal and external memory and all peripherals, such as the host port, as many as eight serial ports, and six link ports. All DMA operations generally do not interrupt or delay core thread execution. The DMA controller allows you to dynamically control the external-memory-bus width. The synchronous serial ports support time-division-multiplexed serial streams and hardware companding and can transfer data as fast as 40 Mbps. The six communication ports move data in 4-bit nibbles, transferring as much as 1 byte/clock cycle. With six links operating simultaneously, maximum throughput is 240 Mbytes/sec.

The CPU, I/O controller, and peripherals interconnect and perform flexible, nonintrusive transfers through a multibus-crossbar-interconnection unit. To reduce bottlenecks, the interconnect crossbar permits unlimited data and instruction movement from external or internal memory or cache and permits I/O from on- or off-chip peripherals--all in one cycle.

The 21060 and 21062 provide six communication ports for array multiprocessing. These ports feed through the I/O controller and let you create meshes of DSPs that can access each other's memory spaces. (Point-to-point connections between DSP ports define each processor in the mesh.) The on-chip I/O controller sets up, runs, and responds to these ports. Transfers pass through the I/O ports to and from internal memory. The I/O controller separates these transfers from mainstream DSP.

A parallel port serves as a direct interface to off-chip memory, peripherals, or a host processor. As many as six ADSP-2106x chips can share this interface with a common system processor. SHARCs offer a unified address space using a 32-bit address bus and a 32- or 48-bit data bus. For a 40-MHz clock, the chip supports a 15-nsec access time with zero-wait-state memory. The special host interface supports both 16- and 32-bit µPs, as well as system buses, such as ISA and PCI. SHARC treats this host as a memory-mapped device with direct writes or reads to internal memory.

The newest SHARC DSP, the ADSP-21065, also provides a synchronous DRAM (SDRAM) interface that transfers data to and from SDRAM as fast as 240 Mbytes/sec, or twice the clock frequency. The glueless SDRAM interface can access 16- or 64-Mbyte SDRAMs and enables you to connect to any one of four external memory banks.  

Addressing modes

SHARC offers immediate, indexed, bit-reversed, circular-modulo, and register-direct and -indirect addressing. (It must use indirect addressing for off-chip memory access.)  

Special instructions

SHARC provides bit manipulation, division iteration, reciprocal of square-root seed, conditional subroutine call, single and block repeat with zero-overhead looping, fixed- and floating-point compare, and conditional execution of most instructions. SHARC supports IEEE-754 single-precision, floating-point (23-bit data, 8-bit exponent, and sign bit), and a 40-bit extended IEEE format for additional accuracy (32-bit data).  

Support

Analog Devices sells a full-speed, nonintrusive, JTAG-based emulator that uses the ADSP-2106x’s built-in debugging capability. It runs under Windows and supports debugging for multiprocessor systems. The company also supplies an EZ-Lab Development System, a PC plug-in card for multiple 2106x processors, and an EZ-kit lite with a C compiler for $179. Third-party products include PC and VME multiprocessor cards and OSs. Analog Devices supplies a C compiler based on Gnu technology. This compiler supports Numerical C, which extends vector- and matrix-processing capabilities for signal processing. Other tools include an assembler/linker, a simulator, application libraries, a PROM splitter, and a C source-level debugger. The simulator simulates the 2106x core, including the pipeline and instruction cache, the memory subsystem and associated buses, interrupts, and the I/O processor and associated peripherals. The simulator accurately handles aborted pipeline stages, cache misses, and delay cycles associated with interrupts and bus contention and looping. Analog Devices developed the simulation engine as a dynamic-link library, isolating it from the graphical user interface via a series of public application-programming interfaces (APIs). You can use these APIs to connect the simulator with other software models and have them exchange and synchronize signals.


Hyperstone E1-32

09CS3206The hyperstone E1-32 combines RISC and DSP technology in a unified core. The integrated DSP unit, working in parallel with the ALU, can perform DSP calculations while the ALU performs loop counts, address calculations, or load/store operations. Hyperstone based the E1-32 on a two-stage pipeline and can only issue one instruction per cycle. The DSP instructions require two or more cycles to complete, and the ALU executes its instructions during the latency cycles of DSP instructions. You or the compiler must arrange your code to take advantage of these latency cycles. And, because the E1-32 supports no separate X and Y memory blocks, you have to perform all loads and stores during the latency cycles to achieve good DSP performance. The E1-32 has a load/store architecture built around a register set that includes 64 general-purpose local and 22 global registers. Local registers are organized into a 64-word, circular register stack to hold function/subroutine stack frames. The stack is organized into frames of as many as 16 words; the E1-32 keeps current frames on-chip  and automatically pushes them to off-chip memory as the register stack fills up (and pops them back from memory as the register stack empties). For fast parameter passing, the current stack frame can overlap with the previous one with a variable range. To minimize silicon overhead, the DSP unit shares all the E1-32's functional blocks, including the register set. However, the DSP unit does provide dedicated result registers and 32- and 64-bit hardware accumulators. The DSP unit supports 16- and 32-bit data types.

Zero-overhead looping on E1-32 requires you to execute two multiply-accumulate operations per loop and use the latency cycles to perform the address calculations, data loads, and compare instructions. The E1-32's 100-MHz operation helps in this area.

The 4-Gbyte address space divides into four blocks; you can individually configure each block for bus width and timing. The E1-32 integrates a fast-page-mode DRAM controller in one of the block spaces. You can use the other blocks for glueless connection of SRAM, EPROM, or other memory devices, each with their own timing and bus width. A separate I/O-address space also allows each I/O device to have its own timing.

Special instructions

Instructions can be 16, 32, or 48 bits; this variation helps reduce code size. The variable-length instructions, which the E1-32 automatically prefetches, provide constants and native addresses as large as 32 bits. DSP instructions include multiply, complex and real multiply-accumulate and multiply-subtract, and complex addition and subtraction. Other special instructions include test-leading zeros.

Development tools

Hyperstone offers a development starter kit, a PC-based development board, and the hyICE serial connector for stand-alone operation. The $3500 development board comes with as much as 8 Mbytes of DRAM, as much as 512 kbytes of SRAM, and as much as 128 kbytes of EPROM. You can attach your prototype hardware using the board's I/O expansion connectors. The company provides an ANSI C compiler, a source- and task-level debugger, an assembler, a linker, and a software profiler. Hyperstone offers hyDSP, a collection of subroutines that include FFTs, discrete cosine transforms, multidimensional arithmetic, and a variety of digital filters. The hyRTK multitasking, real-time operating system performs pre-emptive task scheduling. The operating system, including an integrated debugging monitor and a floating-point library, takes less than 32 kbytes. Third-party vendors Eonic Systems (www.eonic.com) and Etnoteam (www.etnoteam.com) also provide RTOS support.


Siemens Tricore

09CS3207Siemens' Tricore architecture represents the industry's trend toward a blurring of the distinction between microcontrollers and DSPs (see "Microprocessor and DSP technologies unite for embedded applications," EDN, March 2, 1998, pg 73). This architecture's functional units and its unified instruction set target microcontroller- and DSP-specific functionality. Tricore is a superscalar core with two primary four-stage pipelines; the first bit of every instruction identifies which pipeline that instruction follows. One pipeline does loops, loads, and address-generation arithmetic; the other pipeline does all the math and branches. The execute unit comprises a multiply-accumulate (MAC) module, an ALU, and a tightly coupled coprocessor interface. A third pipeline performs loop control for zero-overhead looping. Tricore supports a mixture of 16- and 32-bit-wide instructions to help conserve code space; each operation code includes a size bit to improve the efficiency of instruction decoding.

Tricore implements a Harvard architecture with separate address and data buses for program and data memories. Tricore is also a load/store architecture with 16 32-bit general-purpose data registers and 16 32-bit address registers. You can concatenate consecutive even-odd data registers to form eight 64-bit registers for extended precision.

Unlike traditional DSPs, Tricore lacks separate X- and Y- memory spaces, which may require you to perform some loop unrolling to achieve the parallel performance of DSPs. As long as data is available for Tricore's execute unit, it can perform single-cycle MACs. The data side of the core has a 128-bit-wide bus to on-chip DRAM, which you can use to save two data and two address registers in one cycle to the cache.

Addressing modes

Tricore supports the typical addressing modes of a load/store architecture, including absolute, base+offset, preincrement, and postincrement. It also supports circular buffers for DSP filters and bit-reversed indexing for FFTs. You must align the start of the circular buffer to a multiple of the data size, which the instruction using the buffer prescribes. The length of the buffer must also be a multiple of the data size the instruction using the buffer references.

Special instructions

The instruction set supports operations on Booleans, bit strings, characters, signed fractions, addresses, signed and unsigned integers, single-instruction multiple-data, and single-precision floating-point numbers. In addition to a plethora of microcontroller-oriented instructions, such as bit manipulation, Tricore supports the traditional DSP instructions, including multiply and MAC, saturate, scaling, and rounding. Tricore also supports packed arithmetic. Conditional add, subtract, and select instructions let the device avoid using conditional jumps.

Development tools

Tasking (www.tasking.com) and Green Hills (www.ghs.com) offer C- and C++-compiler, debugger, simulator, and RTOS support for Tricore. Accelerated Technology (www.atinucleus.com) also supplies a Tricore RTOS. Nohau (www.nohau.com), Ashling (www.ashling.com), Hitex (www.hitex.de), and Lauterbach (www.lauterbach.com) supply Tricore in-circuit emulators. Tasking's instruction-accurate simulator allows you to analyze the basic functionality of your program. Siemens is also working on a cycle-accurate simulator, which the company expects to be available this year. The new simulator implements a flexible cache model that provides options, such as defining start and end addresses, the number of ways and cache lines, and the line size and banks. This simulator also includes branch-prediction logic and determines interrupt latency.


Texas Instruments TMS320C3x

09CS3204TI's TMS320C3x integrates a Von Neumann µP architecture with a high-performance, 32-bit, floating-point DSP multiply-accumulate (MAC) core. The C3x also performs fixed-point math based on a 24-bit-wide mantissa on the inputs. Although most designers use the C3x for its floating-point capability, fixed-point math is occasionally useful for functions such as clipping of image data. On the µP side, the C3x supports a unified, flexible, 24-bit address space (16 Mbytes×32-bit words). On the DSP side, the C3x processor performs single-cycle MAC processing. The processor receives the next instruction while accessing two data values for the current instruction's MAC cycle.

The C3x family does not support IEEE floating-point formats. The C3x format uses an implied sign bit to increase precision. In most applications, the difference in data format is relevant only if you are passing the data to another processor.

The TMS320C3x DSP comprises memory/access, central-core, and I/O subsystems. The memory/access subsystem comprises separate program, data, and DMA buses, which allow parallel program fetches, data reads and writes, and DMA operations. This internal busing scheme enables programs to access the next instruction  and two data values simultaneously and to transfer data to or from the I/O subsystem in one cycle. The data-address buses share a data bus that can make two sequential RAM accesses in one cycle because the buses run at twice the speed of the processor core. Two 32-word, lockable, on-chip caches automatically load as the DSP accesses instructions from external memory. The two 4-kbyte RAM blocks hold parameters and constants for sum-of-products MAC processing, and a 32-kbyte ROM can hold code or coefficients for MAC processing (C30 only).

The central core has its own set of buses to move data and results. These buses move data among internal registers; an integer/floating-point multiplier; a parallel, 32-bit barrel shifter/ALU; and the memory subsystem. The core stores results in extended-precision or auxiliary registers that hold the values. Two address generators in the subsystem generate the addresses to access the data memories. The core registers, eight 40-bit extended-precision registers, auxiliary registers, and key-control registers reside in a central multiported register file. The C3x uses a software stack to support context switching.

The third C3x subsystem, the I/O, comprises a single-channel DMA controller (dual channel in the C32) and a collection of peripherals that interlink with the peripheral-address and data-bus set. The memory-subsystem buses pass through a multiplexer and link to the peripheral bus, which serves the DMA controller and peripherals. On the C30, the peripheral bus links to an external expansion bus with a 13-bit address and 32-bit data bus.

Addressing modes

The C3x supports register-direct, paged-memory-direct, register-indirect, and immediate addressing. A single circular buffer supports circular addressing and bit-reversed addressing for FFTs. The circular buffer requires block-size and base-pointer registers plus an auxiliary register that the buffer shares with X and Y memories.

Special instructions

The C3x performs single- or block-instruction hardware looping (supports nestable block repeats but lacks automatic save and restore of status); standard branches, which empty the pipe; delayed branches, which wait three cycles before changing program counter; interlocked access instructions for multiprocessing (load/store integer or floating-point value and signal interlocked); computed gotos (dynamic subroutine calls); and conversion of floating-point to integer and vice versa. The C3x can perform bit test. You can specify instructions to execute in parallel.

Support

TI supplies a full-speed in-circuit emulator and an evaluation module. The C3x lacks JTAG support but has a proprietary five-pin emulation interface. TI sells a tool set that includes a C compiler, an assembler/linker, a source-level debugger, a code profiler, a simulator, and an application library. Third-party tools include C and Ada compilers, multiple OS products, filter-design packages, advanced graphical-design tools, and hardware tools.


Texas Instruments TMS320C4x

09CS3205The C4x has seven internal buses and on-chip memories that help deliver single-cycle execution when walking through X and Y memories for a series of multiply-accumulate (MAC) operations. TI built the C4x around a five-port register file, and, rather than time-sharing a single bus system, the C4x features separate buses for program and two data fetches. Additionally, the C4x has a floating-point-unit multiplier, an ALU, and a barrel shifter for parallel operations. The C4x also performs fixed-point math based on a 24-bit-wide mantissa on the inputs.

A 128-word cache enables the processor to deliver single-cycle pipelined execution and still use slower external memory. (It does not use the cache with internal memory.) Key inner routines fill the cache as they run. The CPU accesses an instruction from external memory and automatically loads the instruction into cache, which is divided into four 32-word segments or lines. The CPU uses a least recently used algorithm to select the cache segment for the new instructions. You can freeze a segment in the cache by setting cache-freeze bits in the CPU-status register.

Six 8-bit independent communications ports support point-to-point communications with networks of C40s and peripherals. (The C44 has only four ports.) Each port comprises eight data pins and four handshake signals. These ports free the 31-bit local and global external-memory buses for program or data accesses to the processor's 4G-word address space (C40 only). Program and data occupy a unified address space that you can configure according to your memory requirements. The local and global buses have different memory block assignments within each memory space. I/O can also use the external buses.

A six-channel DMA subsystem with its own address and data buses moves data between the communications ports and memory without altering the CPU's sequential threads. Such data movements do not overload the DSP with servicing overhead, although some data contention for memory may slow CPU execution.

Addressing modes

The C4x supports register-direct, paged-memory-direct, register-indirect, immediate, and circular addressing to support single-sized circular buffers. The CPU applies bit-reversed operations to register-indirect addressing only.

Special instructions

The C4x performs single or block instruction, zero-overhead hardware looping (nestable block repeats but without automatic save and restore of status), standard/delayed branches, interlocked access for multiprocessing (load/store integer or floating-point value and signal interlocked), conversion of floating point to integer and vice versa, reciprocal and reciprocal square-root seed, and conversion to and from IEEE floating-point formats. It performs bit test. You can specify certain instructions to execute in parallel.

Support

Development system includes scan-based emulation via the C4x's JTAG test port. External hardware can use the JTAG port to control the processor and to set and monitor registers or memory. You can string multiple C4x chips on a JTAG circuit for parallel debugging. One processor breakpoint can halt execution in an array of C4x chips, and you can single-step them all in lock step. TI sells a C4x evaluation board with four processors that works with a number of host platforms. Software tools include a C compiler, a source-level debugger for parallel debugging, an assembler/linker, and a simulator. TI has an application library. Third-party support includes the Spox, Parallel C, Virtuoso, and Helios OSs, as well as a variety of hardware tools.


| 16-Bit | 24-Bit | Back |


Copyright © 1998 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Business Information, a unit of Reed Elsevier Inc.