EDN Access

 

May 8, 1997



Advanced RISC Machines 16-bit, fixed-point Piccolo DSP

A DSP coprocessor module for ARM7 mP cores, Piccolo adds a 32-bit DSP instruction set and uses the ARM coprocessor interface for communicating with the ARM processor core. The mP and the Piccolo DSP share a memory bus.

  Piccolo’s interface includes a tagged input-queue structure and an output FIFO buffer. The input queue, or reorder buffer, enables the ARM mP to preload Piccolo with data before Piccolo requires the data, essentially demultiplexing multiple input-data streams for DSP algorithms. The reorder buffer allows ARM code to fetch DSP data or coefficients from memory in the most convenient order and allows the DSP code to consume the items in the required order. Piccolo automatically and transparently refills its registers from the reorder buffer as Piccolo uses and replaces old data. Piccolo sequentially returns results to the mP through the output FIFO buffer.

  The ARM mP handles all interrupts and data-address generation. Although the mP operates in parallel with Piccolo, the mP’s performance degrades when the DSP is active, because Piccolo consumes the mP’s bandwidth. In addition, because Piccolo reloads only from the reorder buffer, a data-intensive algorithm may starve the register file. Furthermore, the programmer must ensure that the ARM core responds to the needs of the DSP. In other words, Piccolo cannot interrupt or notify the ARM core when Piccolo needs data or has full output buffers. However, ARM made this trade-off to achieve a smaller DSP core with minimal complexity.

  Piccolo’s other features include a private instruction cache, a 16×16-bit single-cycle multiplier, a 32-bit barrel shifter, four 48-bit extended-precision accumulators, a saturation unit, and register-based storage for 32 16-bit or 16 32-bit data items. Piccolo also has a split ALU that provides single-cycle, dual 16-bit arithmetic and logical operations in one instruction word.

  Addressing modes—Piccolo has hardware support for four nestable zero-overhead loop constructs. Leveraging the resources of the ARM µP, Piccolo addressing is limited to accessing data from the input and output buffers.

  Special instructions—Unlike the ARM mP cores, Piccolo instructions are not conditionally executable. Piccolo offers intrinsic support for tasks such as Viterbi and bit manipulation. Repeat instruction is uninterruptible; however, the ARM core handles all interrupts.

  Support—ARM is developing high-level-language support for Piccolo, although the company believes that most programmers will use assembly language. The unified mP and DSP architecture allows one tool chain for development. ARM estimates that 50% of the DSP code in a system will be running on an ARM core.

Analog Devices 16-bit, fixed-point ADSP-2100 DSP family

The ADSP-2100 family’s CPU handles general processing needs and delivers single-cycle instruction execution when executing tight DSP algorithms. The processor can execute multiple operations per cycle. The mutliply-accumulate (MAC) unit, ALU, and barrel shifter are separate but cannot execute in parallel. Secondary registers shadow each execution unit’s registers, allowing fast context switching for interrupt processing.

  If you need extended precision, you can address the MAC unit’s 40-bit accumulator (includes 8 guard bits) as two 16-bit and one 8-bit register and individually copy the contents to another register. The barrel shifter moves 16-bit inputs left or right into a 32-bit register. The shifter also includes hardware support to perform exponent detection and normalization for block floating point and increasing the precision of a 16-bit DSP. Algorithms such as FFTs in which bits grow from stage to stage use block floating point. An application uses the shifter to convert between fixed- and floating-point numbers.

  ADSP-2100 family members have X and Y address generators and program and data buses. While executing from the on-chip memory, the buses feed the X- and Y-data values for each MAC cycle. Thus, you can use program memory as data memory to hold constants for single-cycle MAC processing. The program bus is free for MAC use when the CPU executes from on-chip program memory. The dual-ported program memory allows two memory accesses in one cycle. For access to external memory, the ADSP-2100 has a programmable wait-state generator for zero to seven wait states.

  Analog Devices’ designers opted for a 16-bit-wide data word and a 24-bit-wide instruction word. The wider instruction word lets the device use more complex instructions and offers more flexibility than does a 16-bit operation code. The difference in code and data-word sizes requires a Harvard architecture with two memory spaces. But Harvard architectures with separate memory spaces are typical for most DSPs; they enable instruction fetches to occur in parallel with single-cycle MAC operations. For external-memory design, the different memory widths mean that if you share three 8-bit-wide memory chips with program and data, you sacrifice every third byte of the data-memory area.

  Addressing modes—The ADSP-2100 includes immediate, register-direct, memory-direct, and register-indirect addressing modes. The program sequencer handles code execution during zero-overhead looping. Each address generator supports as many as four circular buffers, each with three registers. The registers define the end, length, and access addresses. One address generator provides bit-reversed addressing for data only.

  Special instructions—The ADSP-2100 can conditionally execute most instructions. A do-until command establishes a sequence of instructions that can be arbitrary in length and nested four deep for repeat operations. Because the ADSP-2100 is a nonpipelined machine, it incurs no penalties for jumps and calls.

  Support—Analog Devices supplies an ANSI C compiler and an assembler, a linker, and an interactive simulator. Evaluation boards are available for most ADSP processors. In-circuit emulators are available for hardware-target debugging. Analog Devices has also licensed ADSP-217X and ADSP-218X cores to AMD (Austin,TX), Acer Labs (Santa Clara, CA), and Mentor Graphics (Wilsonville, OR). All cores have built-in test circuitry that requires one dedicated pin.

Variations

  ADSP-2101/3—25 MHz, 38 mA max at 5V (10.24 MHz, 14 mA max at 3.3V for 2103), 2k×24-bit program RAM, 1k×16-bit data RAM, timer, two serial I/O ports, 68-pin PLCC, $19.90/$17.03 (1000).

  ADSP-2104/2104L—20 MHz, 38 mA max at 5V (13.824 MHz, 14 mA max at 3.3V for 2104L), 512×24-bit program RAM, 256×16-bit data RAM, timer, two serial I/O ports, 68-pin PLCC, $5.25 (10,000).

  ADSP-2105—20 MHz, 31 mA max at 5V, 1k×24-bit program RAM, 512×16-bit data RAM, timer, one serial I/O port, 68-pin PLCC, $9.90 (10,000).

  ADSP-2109/2109L—20 MHz, 38 mA max at 5V (13.824 MHz, 14mA at 3.3V for 2109L), 4k×24-bit program ROM, 512×16-bit program RAM, 256×16-bit data RAM, timer, two serial I/O ports, 68-pin PLCC, $7.50 (10,000).

  ADSP-2111—25 MHz, 38 mA max at 5V, 2k×24-bit program RAM, 1k×16-bit data RAM, timer, two serial I/O ports, host-interface port, 100-pin PQFP, $38 (1000).

  ADSP-2115—25 MHz, 38 mA max at 5V, 1k×24-bit program RAM, 512×16-bit data RAM, timer, two serial I/O ports, 68-pin PLCC, $13.11 (10,000).

  ADSP-2161/2/3/4—16 MHz, 42 mA max at 5V (10 MHz, 20 mA max at 3.3V for 2162 and 2164), 8k×24-bit program ROM (4k×24-bit for 2163 and 2164), 0.5k×16-bit RAM, timer, two serial I/O ports, 68-pin PLCC. The 2161 or 2162, $15.81 (10,000); 2163 or 2164, $9 (10,000).

  ADSP-2165/6—25 MHz, 60 mA max at 5V (16 MHz, 24 mA max at 3.3V for 2166), 12k×24-bit program ROM, 1k×24-bit program RAM, 4k×16-bit data RAM, timer, two serial I/O ports, 80-pin PQFP, $25 (10,000).

  ADSP-2171/2/3—33 MHz, 76 mA max at 5V (20 MHz, 26 mA max at 3.3V for 2173), 8k×24-bit program ROM (2172), 2k×24-bit program RAM, 2k×16-bit data RAM, 8- or 16-bit host interface, two serial I/O ports, 128-pin PQFP and TQFP, $23/$21 (10,000).

  ADSP-2181/3—40 MHz, 70 mA max at 5V (52 MHz, 50 mA max at 3.3V for 2183), 16k×24-bit program RAM, 16k×16-bit data RAM, timer, two serial I/O ports, 16-bit internal DMA port to access on-chip memory, 8-bit DMA port for program and data-memory transfers, 128-pin PQFP and TQFP, $34.50/$47.61 (10,000).

  ADSP-2185/2185L—33 MHz, 63 mA max at 5V (33 MHz, 36 mA at 3.3V for 2185L), 16k×24-bit program RAM, 16k×16-bit data RAM, timer, two serial I/O ports, 16-bit internal DMA, 8-bit DMA for program and memory transfers, 100-pin TQFP, $31 (10,000).

  ADSP-2186/2186L—40 MHz, 65 mA max at 5V (33 MHz, 36 mA at 3.3V for 2186L), 8k×24-bit program RAM, 8k×16-bit data RAM, timer, two serial I/O ports, 16-bit internal DMA, 8-bit DMA for program and memory transfers, 100-pin TQFP, $20 (10,000).

  ADSP-2187L—52 MHz, 50 mA max at 3.3V, 32k×24-bit program RAM, 32k×16-bit data RAM, timer, two serial I/O ports, 16-bit internal DMA, 8-bit DMA for program and memory transfers, 100-pin TQFP, $54.50 (10,000).

  ADSP-21msp58/9—26 MHz, 95 mA max at 5V, 2k×24-bit program RAM (additional 4k×24-bit program ROM for 21msp59), 2k×16-bit data RAM, timer, two serial I/O ports, host-interface port, 16-bit ADC and DAC, 100-pin TQFP, $25 (10,000).

Analog Devices 16-bit, fixed-point ADSP-21cspxx DSP family

Analog Devices based the 16-bit, fixed-point ADSP-21cspxx DSP core on the ADSP-21xx family. The 21cspxx performs concurrent signal processing. To facilitate programming in C, the ADSP-21cspxx has a 16M-word address range and 48 on-chip data registers for local variable storage, computation, and data-address generation. To access multiple signals in real time, 48 additional background registers allow the DSP to perform task switches in one clock. These shadow registers benefit applications that perform algorithms on two unrelated data streams.

  The heart of the ADSP-21cspxx is a fetch-decode-execute pipeline, which performs all computations in one cycle, after the instruction pipe is loaded. Your software can use delayed branches to avoid the two-cycle penalty the pipeline requires to recover from branches. A 1616-bit multiply-accumulate (MAC) unit uses dual 40-bit accumulators that reduce the register bottleneck normally associated with a single-accumulator MAC. You can interchange the use of accumulators on an instruction-by-instruction basis. The second accumulator has a shared output register with the DSP’s 40-bit barrel shifter.

  Although the ADSP-21cspxx has 16-bit data buses, the device uses a 24-bit instruction word. The wider word can support more operations per instruction and provides more flexible addressing mechanisms, such as preaddress and postaddress modification. A two-bus von Neumann architecture feeds the pipeline and computational units. The two buses allow simultaneous data fetches from the ADSP-21csp01’s unified memory space. Similar to Analog Devices’ Super Harvard architecture (SHARC), a 64-word, two-way set-associative instruction cache automatically caches only those instructions that conflict with memory accesses.

  Two data-address generators (DAGs) each support four simultaneous circular buffers. The DAGs have base registers that allow a programmer to place the circular buffers anywhere in memory. The DAGs can access as many as 16M words of memory. One address generator provides bit-reversed addressing for data only.

  The ADSP-21csp01 contains two bidirectional serial ports that you can program to be multichannel and to transfer data at 25 Mbps. A 16-bit DMA port interfaces the device to other processors and system buses. A DMA controller allows the device to transfer data to and from each serial port and to and from the DMA port without interrupting the processor.

  Addressing modes—The ADSP-21csp01’s addressing modes include immediate, register direct, memory direct, and register indirect. The ADSP-21csp01 also supports zero-overhead address looping in which each address generator supports as many as four execution loops, each with four registers that define the start, end, length, and access addresses.

  Special instructions—Instructions are more orthogonal than those of the ADSP-21xx, and the ADSP-21csp01 can conditionally execute most instructions. A do-until command establishes a sequence of instructions for operations.

  Support—Development tools include an assembler, a simulator, a linker, and a C compiler integrated into a Windows-based design environment from Analog Devices. The company also offers the EZ-ICE in-circuit emulator, which uses the chip’s JTAG interface to monitor and control the target-board processor. An EZ-LAB development system provides a stand-alone configuration or plugs into the PC’s ISA bus for program development. Analog Devices also supplies a DSP runtime library.

Variations

  ADSP-21csp01KS-200/21csp01BS-200—50 MHz, 110 mA max at 5V, 4k×24-bit RAM, 4k×16-bit RAM, 160-pin PQFP, the KS and BS operate from 0 to 70°C and from –40 to +85°C, respectively, $49 (1000).

Array Microsystems 16-bit, complex, fixed-point a66xxx DSP chip set

Array Microsystems’ a66xxx chip set offers chip-, board-, and application-level approaches to performing FFTs, FIR filters, and correlation in the frequency domain. The Digital array Signal Processor (DaSP) and the Programmable array Controller (PaC) are the heart of the a66xxx. The DaSP/PaC chip set can handle a variety of DSP algorithms and system configurations. You can use different memory configurations to enable a variety of system-level architectures to achieve the desired performance. You can also place multiple processors in parallel to increase system performance. These architectures include recursive single memory, cascaded, parallel multichannel, and recursive dual memory. You can program the DaSP to rearrange its on-chip resources to provide the functional units it needs to get a task done.

  The DaSP is a block-floating-point array processor that uses a pipelined architecture. The DaSP’s internal structure comprises several adders; data complementors; rounders; an array of multipliers that allow you to perform single-cycle functions, such as an entire Butterfly filter; and a 20-bit ALU. The function decoder controls the operation of these elements by determining the appropriate internal path. Additionally, the DaSP incorporates a scale-factor generator that automatically implements conditional scaling of input data. Essentially, this generator is a barrel shifter with logic to detect whether the data needs scaling.

  The data-input, auxiliary-data-input, and data-output memory buses on the DaSP handle all data transfers. You can use each of the memory buses to transfer real or complex data. The DaSP accepts and performs a function on two sets of operands every machine cycle. One set of operands is at the data-input port and the other, at the auxiliary (coefficient) input port. Each operand set comprises four 32-bit complex values or eight 16-bit real values. A single DaSP function on these operands may comprise multiple arithmetic operations.

  The PaC internal structure contains processing elements, including address generators/bus multiplexers, a 32-instruction RAM (expandable with external ROM) for holding the user’s DSP program, control registers, a host interface, and a clock- and process-control section that synchronizes the chip with the rest of the system. You can initialize the PaC from a host processor, or you can program the PaC to fetch its own configuration and instruction memory and use external instructions.

  PaC address generators provide the addresses for any DaSP/PaC system. These address generators operate in parallel, producing address sequences for the input and output memories and the three DaSP data memories. Two of the address generators produce a 16-bit address every input- and output-clock cycle. Three address generators produce a 16-bit address every processing-clock cycle. You can access data arrays as large as 64 kbytes using the 16-bit addressing scheme. The generated address goes to the bus multiplexer, which routes the address to the appropriate buses. The PaC supports circular addressing by alternating between data frames; the processor reads from one frame while the system writes to the other. This function is transparent to users, and the PaC automatically switches frames.

  The address generators produce address sequences ranging from sequential to FFT-specific. You can use sequential addresses for general array-processing operations and for windowing and magnitude-squared operations. The DSP chip set generates the FFT-specific address sequences for the data and coefficient memories for radix-4 and -2 and for mixed radix-4/2, in-place, decimation-in-frequency FFT algorithms. The chip set generates other FFT-specific sequences for trigonometric recombinations to simultaneously implement two real-data FFTs. These sequences allow the implementation of a 2N, real-point FFT or two parallel-N, real-point FFTs using an N complex-point algorithm.

  The recursive, I/O-buffered dual-memory system is the most common architectural arrangement of the chip set. This arrangement has some advantages over the recursive dual-memory core stage. In the core stage, processing resources are idle when the DSP performs I/O, pointing out the need for circular buffers. You can configure a real-time DSP system by using all five PaC address buses in a double-buffered configuration.

  Addressing modes—The DaSP and PaC generate address sequences that you would use to access data memory only for DSP-type algorithms. The PaC provides addressing sequences to allow access of data in sequential order for block-data operations.

  Special instructions—The DaSP instruction set includes many FFT-specific functions as well as several general arithmetic and logical functions. The FFT-specific functions include radix-2/radix-4 Butterfly and others. Unlike traditional DSP µPs, which process data values and produce one data value per instruction, the DaSP uses one instruction to process two arrays of input and auxiliary data to produce one resultant array.

  Support—Array Microsystems provides reference designs for applications performing frequency-domain operations. The company also provides a menu-driven DSP-software-development system comprising program generators and simulators supporting the function- and application-specific DSP products.

Variations—Not available at press time.

  Second source—Atmel

Atmel 16-bit, fixed-point Lode DSP

The Lode DSP is a static core that can form the basis of a communications system by adding on-chip peripherals or user-defined logic. The Lode core has a five-stage pipeline: fetch, decode, read, execute, and write. The deep pipeline could complicate programming and result in several cycles of delay during program branches. Lode uses a Harvard architecture, with two data buses leading to one dual-port RAM address space. It has two linked, 16×16- to 32-bit multiply-accumulate (MAC) units, MAC0 and MAC1. It also has a 40-bit ALU with a barrel shifter. You can use the ALU in parallel with the MACs. The processing units use a set of four 40-bit accumulators that include 8 guard bits. The accumulators can serve as inputs and outputs for the MAC units as well as the ALU.

  The dual MAC units benefit operations such as filtering and autocorrelation, but the MAC units’ inputs limit their flexibility. MAC0’s inputs are flexible and can come from registers, memory, or one of the accumulators. MAC0 shares one of its inputs with MAC1. However, the second input to MAC1 comes from a special register that contains the nonshared input of MAC0 delayed by one instruction. Regardless, the Lode can generate two MAC results each cycle. You can also use the MACs independently as simple adders.

  You can use the barrel shifter as a preshifter for MAC data and as a normalizer for MAC outputs. In addition, a postshifter lets you extract data from the accumulators into memory.

  In any instruction cycle, the DSP can make two memory accesses: either two reads or a write and a read. The device includes eight 16-bit memory pointers and eight 8-bit memory-pointer modifiers. You can also use the memory-pointer registers as general-purpose, 16-bit registers. Lode uses automatic context switching on some registers during interrupts.

  Lode has many power-saving features. For example, logic circuitry supplies the clock only to those parts of the design that operate in a given cycle, eliminating wasted power.

  Addressing modes—Lode supports direct, indirect with optional postmodification, immediate, and circular memory addressing.

  Special instructions—Lode performs specialized instructions for communications applications, such as Galois arithmetic for CRC calculations, a divide-step instruction, and instructions for Viterbi acceleration and trace-back operations. It provides bit-manipulation instructions, delayed branches, and computed calls from the top of the stack. Two nestable, zero-overhead levels of looping are available: Loops are automatically reinitialized after completion; hence, Lode incurs no overhead within an outer loop when the software invokes the inner loop. A wait instruction disconnects the clock to all but a minimal set of circuitry.

  Support—Hardware support includes an in-circuit-emulator board and on-chip emulation capabilities, as well as HDL testbenches. Software support includes an assembler/linker, a loader, and a simulator. All tools run under SunOS.

DSP Group 16-bit, fixed-point Pine/Oak DSP cores

DSP Group developed the Pine and Oak DSP cores and licenses them to developers. These DSP processors are standard cells in many licensees’ ASIC libraries. Engineers can use the minimal core or, by adding more memory, peripherals, and interrupt- and custom-logic sections, can expand the architecture for higher processing efficiencies. At the system level, the chips comprise a 16-bit DSP core with links to on-chip program ROM, data RAM, and a bus-interface unit. The cores use double-metal CMOS technology. Pine is available in 1-, 0.8-, and 0.6-µm processes; Oak comes in a 0.6-µm process. Both DSPs have built-in power-management features to cut power dissipation, including 2.7V parts and static operation to further reduce power consumption. Internal control can automatically shut off unused functional units’ memory.

  Pine has two data buses and one program bus, two RAM data blocks for X and Y memory, a data-arithmetic-address-generator unit (DAAU), and a multiply-accumulate (MAC) unit. Oak expands on Pine by adding a bit-manipulation unit (BMU), a program-control unit, an expanded instruction set, and a special on-chip-emulation module that provides trace and breakpoint capabilities for real-time debugging. Oak also has an indexed addressing mode and a software stack to improve its usefulness with C programming. (Pine has a limited level-hardware stack.)

  Oak’s MAC unit has two 16-bit input registers. The MAC unit takes two 16-bit, signed or unsigned numbers and delivers a 32-bit 2’s complement product in one cycle. The MAC unit then sign-extends the product to 36 bits through 4 guard bits. The ALU performs arithmetic/logical operations on the data operands. It also performs functions such as normalization, step division, and rounding. The BMU has a 36-bit barrel shifter, a bit-field-operation unit, and two additional 36-bit accumulators. The bit-field-operation unit reads from memory, modifies, and writes back to memory, bypassing the accumulator. Bypassing the accumulator not only frees a critical hardware resource, but also avoids the use of the accumulator’s high-power-consumption circuitry. Cellular phones use bit-field operations to pack the bits before putting the data onto the channel. The two accumulators and a set of shadow registers enable rapid context switches; the accumulators can also evaluate 36-bit exponents. The accumulator optionally saturates out-of-range values as they transfer to 16-bit registers or memory.

  At each cycle, the three buses move X- and Y-memory data to the MAC unit from core X and Y data RAMs while the program-control unit (PCU) fetches a new instruction from on-chip ROM or RAM. The X-data bus also serves as the main CPU data bus by linking the two data RAMs, a status register, a PCU, and a set of general-purpose registers. A single-level cache inside the chip allows for the repeat instruction.

  The DAAU generates X- and Y-memory addresses for each MAC cycle and modifies the pointers after operations, including modulo addressing. It has 10 16-bit pointer registers for addressing: a stack pointer, a base register, and eight other registers that handle different configurations of the DAAU. Oak has four general-purpose, 16-bit registers, including a top-of-stack pointer that references the top of the current software stack for interrupt or subroutine-processing calls. You can define four additional on-chip, general-purpose registers that are not part of the DSP core. These registers can be handy for application-specific hardware.

  Oak supports DMA operations, downloading capabilities from data-memory space to program-memory space, an automatic-boot procedure, and support for the 64k-word X- and Y-data space and the 64k-word program space. The X RAM space can be in internal and external memory; the Y RAM space is in core internal memory only. You can expand the X and Y memories in the core to 2k words. (However, licensees can expand memory beyond 2k words, but this procedure requires redesigning portions of the core.) Only the X memory expands externally to 62k words. This limit on memory expansion potentially limits the performance of certain applications that require simultaneous access of X and Y memory. The off-core program memory can expand to 64k words. Oak has a built-in, 16-bit loop counter for repeating as many as 64,000 times instructions or instruction blocks. You can nest a repeat instruction in a loop block with as many as four levels of block nesting.

  Addressing modes—The Pine and Oak DSP cores support direct, register-indirect, relative, index, and immediate addressing modes.

  Special instructions—The devices support conditional subroutine call/return from a subroutine and interruptible- and block-repeat instructions. (Pine has one repeat level; Oak has four levels of nesting.) They also support division step, bit-field test and set (Oak only), compare, square, accumulate/subtract previous product, move data/program memory, conditionally modify accumulator, double-precision calculations, bit-field operations, exponent evaluation, context switching, minimum/maximum calculation, and automatic boot.

  Support—DSP Group supplies in-circuit-emulator and evaluation/development boards and on-chip-emulation capabilities; the company also sells a bond-out chip for emulation and debugging. Software tools include an assembler/linker; a loader; a simulator and a debugger; a C compiler; and the Assyst simulator, which enables users to map their customized logic into the tools. All tools run under Windows; some tools also run under Unix.

  Licensees—Adaptec, Asahi, Atmel, DSP Communications, GEC Plessey, Harris Corp, Hyundai, Integrated Circuit Systems, Kenwood, LSI Logic, NEC, Rohm, Samsung, Siemens, Silicon Systems, Taiwan Semiconductor Manufacturing Co, VLSI Technology, and Xicor.

Hitachi SH-DSP µP with 16-bit fixed-point DSP

Although some µP vendors have added multiply-accumulate units to their devices to allow them to perform basic digital signal processing, Hitachi has added a complete DSP to the SH µP. The SH’s DSP unit shares the five-stage pipeline with the integer unit; the DSP is not a coprocessor. The CPU contains a fetch and decode unit, which manages the single instruction stream for both the integer and DSP units, routing instructions to the appropriate unit.

  The integer unit of the SH-DSP comprises an enhanced SH-2 core that supports the DSP unit (see EDN’s "23rd annual µP/µC directory," Sept 12, 1996, pg 45). One enhancement is the addition of 32-bit DSP instructions; the RISC portion of the SH-DSP still operates only on 16-bit instructions. The SH-2 core uses a von Neumann architecture, and the DSP unit uses a modified Harvard architecture with one address space and a separate bus for instructions and data. Another significant enhancement to the SH-2 core is the replacement of its 8-kbyte cache with separately addressable X and Y memories. The main integer ALU calculates X addresses, and a separate, 16-bit pointer-arithmetic unit calculates Y memory addresses.

  The SH-DSP has four internal buses. The 32-bit internal bus transfers both instructions and data and can access any region of the processor’s 4-Gbyte address space. Separate X and Y buses comprise 15-bit addresses and 16-bit data. The address width is only 15 bits because the X and Y buses can access only aligned word-length data, and the least significant bit is always zero. The SH-DSP also contains an 8-bit peripheral data bus to access memory-mapped registers. A 32-bit external interface provides a direct connection to extended-data-out DRAM, pipeline-burst SRAM, or burst ROM.

  The integer and DSP units can’t execute instructions in parallel, because they share the chip’s internal address and data buses. However, the bus structure allows the DSP to access two data operands and fetch an instruction during one cycle. During that cycle, the SH-DSP can execute one ALU operation and a 16×16-bit multiply.

  The DSP unit has a register file separate from the integer unit’s registers. The DSP unit’s registers comprise six 32-bit registers and two 32-bit accumulators with 8 guard bits. These registers are visible only to the DSP unit and to DSP-extended load/store instructions. The DSP unit contains fetch buffers for storing three or fewer instructions of a tight program loop. In the first iteration of the loop, the processor normally fetches and executes instructions, but it also stores the instructions in buffer. Subsequent iterations fetch from the buffer, reducing power consumption.

  Addressing modes—The SH supports immediate, PC-relative, indirect-PC, indirect-indexed with automatic pointer updating, direct, and indirect-register addressing. It also provides modulo addressing for the circular buffer. Zero-overhead loop control supports one loop.

  Special instructions—Instructions perform both arithmetic and logical barrel shifting of data in DSP registers. Instruction performs priority encoding to locate the most significant bit of the source operand; you can combine the result of this operation with an arithmetic shift to normalize a value. The SH also offers conditional instruction execution of some ALU and shift operations and support for saturation arithmetic.

  Support—Hardware- and software-cosimulation development tools are available from CardTools (Sunnyvale, CA). Hitachi and Green Hills Software (Santa Barbara, CA) offer a C and C++ compiler/assembler/simulator/graphical-user-interface-based development environment. Hitachi also offers an evaluation board, emulators, and an integrated software-design and -development environment.

Variations

  SH7410—60 MHz; 200 mA typ at 3.3V; 4k×16-bit data RAM; 12k×32-bit program ROM; interrupt controller, three-channel serial I/O; two-channel serial communication interface; 32-bit I/O; three-channel, 16-bit, free-running timer; four-channel DMA controller; 176-lead QFP; $25 (10,000).

Lucent Technologies DSP16xx 16-bit, fixed-point DSP family

Although Lucent Technologies sells its DSP products as part of a modem chip set, the company bases the modem on the DSP16xx core architecture. The main execution unit of the DSP16xx is the data-arithmetic unit, which has a 16×16-bit multiplier and a 36-bit ALU/shifter with 4 guard bits and two accumulators. The dual accumulators are helpful for calculations such as autocorrelation because you can perform the function with half the number of memory accesses. The multiplier and adder operate in parallel, and the multiplier has registered inputs and outputs.

  The multiply-accumulate (MAC) unit has a three-stage pipeline for fetching, multiplying, and accumulating. The simplified MAC unit allows the DSP16xx to run at high frequencies. The MAC unit can shift the multiply result before running it through the ALU/shifter and into one of the accumulators. The instruction-stream pipeline has fetch, decode, and execute stages and runs in parallel with the MAC unit. The shallow pipeline minimizes the impact of branches to two cycles. The DSP16xx has an exposed pipeline, letting a programmer see data at any point. The programmer controls the fetching of data into the ALU and controls the multiply and add. This method minimizes the number of registers to hold temporary data and, therefore, minimizes power consumption and die size.

  The DSP divides internal memory into X and Y memory spaces. The X memory contains both program and coefficients and would typically become a bottleneck for MAC units. However, for fast inner-loop processing, the program can use special instructions to load an inner-loop code block into a 15-instruction cache in the DSP16xx. The other advantage of the instruction cache is in power savings. The DSP16xx uses fixed-point, 2’s complement arithmetic throughout. The bit-manipulation unit has a 36-bit, barrel shifter; two 36-bit accumulators; and four general-purpose 16-bit registers.

  The DSP16xx family with its modified Harvard architecture uses three internal buses to move instructions, coefficients, and data in parallel for high-throughput processing. The DSP defines two 64k-word address spaces—one for program coefficients and one for data. Both X and Y buses connect to the same dual-port RAM. If references occur simultaneously to both ports of one bank, the chip incurs a one-instruction-cycle penalty and first performs data access. Memory writes always take two cycles. Analysis of algorithms shows that fewer writes than reads to memory occur for target applications. A special address cycle allows both a read and a write to memory, a compound addressing mode of MAC units.

  The DSP has XAAU and YAAU address generators, each with its own internal adders and registers to hold address values and offsets. The XAAU has a 12-bit adder, a 12-bit static-offset register, and four 16-bit pointer registers. The YAAU has eight static registers and an adder. Programmers can access XAAU and YAAU registers. The X side has half the number of registers as the Y side, because signal processing typically requires fewer coefficient pointers. (Coefficients are stored on the X side.) Also, the Y side points to memory that requires more pointers.

  Addressing modes—The DSP16xx has register- and memory-direct, register-indirect, and immediate addressing modes; it has no bit-reversed addressing.

  Special instructions—Instructions for the DSP16xx include single/block-instruction hardware looping, conditional subroutine call, compare, compound addressing, exponent detection, bit-field extraction, and replacement. It has no rotate instructions.

  Support—Lucent Technologies supplies a hardware-development system with an in-circuit-emulator pod. Evaluation and demo boards are also available. The company sells software-development tools, including a C compiler, an assembler/linker, a debugger, a simulator, and an application library. Lucent offers a Linkable Functional Simulator, a DSP simulator model that plugs into system-level simulation tools from EDA vendors. This model allows you to develop your application at the system level using building blocks and to determine whether your design has the bandwidth to perform the task. The standard simulator includes a cycle-accurate model of Lucent’s DSP.

Variations

  DSP1604/06—40 MHz, 100 mA typ at 5V (33 MHz, 38 mA typ at 3.3V); 16k- and 24k-word ROM; 512-word, dual-port RAM; three 8-bit I/O ports; DRAM controller; JTAG interface; serial-I/O ports with optional dual-channel mode; dual crystal oscillator; two timers; low-power modes; 80-pin MQFP; 84-pin PLCC, 100-pin TQFP, $10 to $15 (100,000).

  DSP1605—40 MHz, 100 mA typ at 5V; 16k-word ROM; 1k-word, dual-port RAM; 8-bit I/O ports; 8-bit parallel host interface; DRAM controller; serial I/O ports with optional dual-channel mode; two timers, dual crystal oscillator; low-power modes; 80-pin MQFP, $9 (100,000).

  DSP1610—40 MHz, 130 mA typ at 5V, 512-word boot ROM, 4k- or 8k-word dual-port RAM, sleep mode, bit-manipulation and barrel-shifter units, two serial I/O ports, JTAG port, 16-bit timer, four external interrupts, JTAG interface, 132-pin PQFP, $57 (100,000).

  DSP1611—50 MHz, 85 mA typ at 5V (26 MHz, 32 mA typ at 2.7V); 2k-word boot ROM; 124k-word, dual-port RAM; two serial I/O ports; JTAG port; low-power modes; bit-manipulation unit; timer; two external interrupts; 8-bit, parallel host and control I/O interfaces; 100-pin BQFP/TQFP, $45 (100,000).

  POMP1615—20 MHz, 32 mA typ at 3V, 24k-word ROM, 2k-word RAM, five timers, one ADC and one DAC, PLL, DRAM controller, two serial peripheral interfaces, JTAG port, 100-pin BQFP/TQFP, $16 (100,000).

  DSP1616—50 MHz, 32 mA typ at 2.7V, 12k-word ROM, 2k-word RAM, timer, two serial I/O ports, 100-pin BQFP/TQFP, $24 (100,000).

  DSP1617—50 MHz, 100 mA typ at 5V (30 MHz, 32 mA typ at 2.7V), 24k-word ROM; 4k-word, dual-port RAM; two serial I/O ports, power-management modes; mask-programmable clock (internal one or 23 divide); 8-bit host and control I/O interfaces; timer, bit/shift unit; JTAG port; 100-pin PQFP/TQFP; $28 (100,000).

  DSP1618—50 MHz, 75 mA typ at 5V (30 MHz, 34 mA typ at 2.7V); 16k-word ROM (or 12k-word flash); 4k-word, dual-port RAM; two serial I/O ports; power-management modes; 5/3/2.7V operation; mask-programmable clock (internal one or 23 divide), 8-bit host and control I/O interfaces; bit/shift unit; timer; JTAG port; 100-pin BQFP/TQFP, $45 (100,000).

  DSP1620—120 MHz; 116 mA at 3V; 80MHz; 118 mA typ at 5V; 4k-word ROM; 32-word, dual-port RAM; one DMA port; one standard serial port; power-management modes; PLL; 16-bit host interface with DMA; bit-shift unit; timer; 132-pin BQFP; 144-pin TQFP; $50 (300,000).

  DSP1627—70 MHz; 98 mA typ at 5V (50 MHz, 39 mA typ at 2.7V); 32k-word ROM; 6k-word, dual-port RAM; two serial I/O ports; power-management modes (hardware stop=50 mA); mask-programmable clock (internal one or 23 divide); 8-bit host and control I/O interfaces; bit/shift unit; timer; JTAG port; 100-pin BQFP/TQFP, $45 (10,000).

  DSP1628—78 MHz; 66 mA typ at 2.7V; 48k-word ROM; 16k-word, dual-port RAM; two serial I/O ports; power-management modes (hardware stop=50 mA); PLL; 8-bit host and control I/O interfaces; bit/shift unit; timer; 100-pin BQFP/TQFP; 144-BGA, $24 (10,000).

Motorola DSP561xx 16-bit, fixed-point DSP family

Motorola’s DSP561xx processors build on the basic DSP56000 architecture and add a codec for D/A and A/D voice conversions for digital cellular and voice communications. The DSP561xx delivers single-cycle, two-clock, multiply-accumulate (MAC) operation and offers hardware support for sum of products and vector processing. The core MAC unit has two 40-bit accumulators, including 8 guard bits, with four 16-bit input registers to hold incoming variables and coefficients. The input registers for a MAC instruction must load at the same time as the previous MAC instruction. You can load the MAC input registers in parallel with the MAC operation. When storing results to 16-bit memory, limiter circuitry optionally saturates the 40-bit accumulator values to ±1.0, the largest number the accumulator can store.

  The DSP561xx has on-chip program RAM and dual-ported data RAM; each has its own address and data bus. The dual-ported data RAM allows the address generator to deliver two addresses per pipeline cycle, yielding two data reads or one read and one write. The address generator has 12 16-bit registers, such as address, offset, and modification registers, for sophisticated addressing and holding interim data values. The core can access the address-generator registers via a global data bus that links the address-generator registers to external memory, peripherals, and a functional bit-manipulation unit.

  The chip’s 16-bit external bus multiplexes between 64-kbyte program and data memories. The CPU can perform one access to external memory with no instruction-cycle penalty. When you use slower memory, the chip may request wait states externally or programmably controlled. With a 60-MHz external clock and a 30-MHz basic pipeline cycle, a CPU memory fetch must take less than 33 nsec for single-cycle execution. For MAC processing, portions of the X-memory space supply Y-memory values. The DSP561xx has two data-memory address buses that fetch data from the X-memory RAM and from the external memory for Y-memory values.

  Addressing modes—The DSP561xx supports register-direct, memory-direct, register-indirect (postincrement/ decrement by 1 or offset indexed by offset), and immediate addressing. The address generator also supports modulo and bit-reversed addressing.

  Special instructions—The DSP561xx provides hardware looping using do loops and repeat instruction; only the do loops are interruptible; conditionally exit block loop, division iteration, double-precision multiply step instruction, bit manipulation, compare.

  Support—Motorola sells the Application Development System with in-circuit-emulation operation using the DSP’s on-chip emulation features. The on-chip emulator port lets external hardware set breakpoints, single-step, and read/modify memory or registers. You can configure the chip to run from external RAM for development. Third-party hardware tools are also available. Motorola supplies a Gnu C compiler and debugger, an assembler/linker, and a simulator. Third-party vendors supply data-acquisition and filter-design packages, as well as OS software.

Variations

  DSP56156—60 MHz, 188 mA max at 5V, 2-kbyte program RAM, 2-kbyte data RAM, 64-byte boot ROM, 8-bit host-processor interface, timer, codec, two serial I/O ports, 27 I/O pins, 112-pin TQFP. $20 (50,000).

  DSP56166—60 MHz, 100 mA max at 5V, 2-kbyte program RAM, 4-kbyte data RAM, 64-byte boot ROM, 8-bit host-processor interface, timer, codec, two serial I/O ports, 25 I/O pins, 112-pin TQFP. $17.90 (50,000)

Motorola DSP568xx 16-bit, fixed-point DSP family

The DSP568xx combines µC functionality with a programmable DSP. The DSP family’s parallel-instruction set controls three concurrent execution units within the 568xx’s three-stage pipeline: the data ALU, the address-generation unit (AGU), and the program controller. Three internal address and four internal data buses support data transfers. The general-purpose µC-style instruction set, with its flexible addressing modes and bit-manipulation instructions, enables you to write control code without worrying about DSP complexities.

  The data ALU provides single-cycle multiplies and multiply-accumulate (MAC) instructions with 36-bit accumulation (4 guard bits), as well as a set of logical and arithmetic operations. The ALU contains 0, Y0, and Y1 input registers (X); two accumulators, which can also serve as input registers); a MAC unit; a 16-bit barrel shifter; and automatic saturation logic. You can write ALU results back to either of the accumulators. Additionally, if you don’t expect the ALU result to be 36 bits, then the result can go directly back to one of the three input registers without corrupting an accumulator value.

  The AGU supports DSP and µC addressing modes. The AGU can provide two data-memory addresses with address updates in one cycle. The AGU contains five 16-bit pointer registers (one functioning as a stack pointer), an offset register, a modifier register for circular-buffer support, and two address ALUs (one supporting modulo arithmetic) to fetch two data items from memory every instruction cycle. The stack pointer has several addressing modes, improving compiler performance and supporting structured programming techniques, such as parameter passing and local variables.

  The 568xx supports an interruptible hardware do loop on any-sized block of instructions. In a set of nested loops, a programmer generally uses hardware looping for the innermost loop. Then, you can perform the outer loops using software looping and the 568xx’s data ALU register, AGU register, or a memory location to store the loop counter. To improve the performance of software looping, the 568xx supports a decrement instruction that operates directly on X memory and uses a conditional branch operation. Furthermore, Motorola added an addressing mode that doesn’t require an address calculation and allows direct access to the first 64 locations in X memory; this approach makes the access faster than a long immediate access.

  Addressing modes—The 568xx supports register-direct, short and long memory-direct, seven memory-indirect, and immediate addressing modes. It also supports short-branch offset and modulo arithmetic for circular buffers.

  Special instructions—The 568xx performs hardware-do and -repeat looping on one instruction or a block of instructions. Single and dual parallel-move instructions perform memory accesses in parallel with ALU operation, allowing two data-memory accesses while fetching an instruction. The 586xx can perform bit-manipulation operations on any register or memory location, and it can perform single-cycle multiply and MAC with optional rounding, addition, subtraction, and squaring. Using a conditional transfer instruction with a compare instruction implements searching and sorting algorithms. If the specified condition is true, then the DSP performs a transfer from one register to another (for example, to store the array index of the maximum value in an array).

  Support—The 568xx uses Motorola’s OnCE port for on-chip emulation through a standard JTAG interface. Motorola offers a C compiler, cross-linker-assembler and -simulator package for PC, Sun, and HP platforms. This package includes a graphical user interface, as well as a DSP56811ADS hardware-development system. DSP56L811 evaluation modules for PCs and third-party tools are also available.

Variations

  DSP56L811—40 MHz, static, 20 mA max at 2.7V, 1k×16-bit program RAM, 2k×16-bit data RAM, three general-purpose timers, watchdog timer, real-time timer, two serial peripheral interfaces, PLL, 32 general-purpose I/O pins (eight with interrupt capabilities), 100-pin, TQFP, $8.88 (100,000).

NEC 16-bit, fixed-point µPD7701x DSP

NEC’s µPD7701x combines a number of architectural features to ensure fast execution of single-cycle multiply-accumulate (MAC) instructions. These features include three internal buses for X- and Y-data and transfer buses; a pipelined MAC unit, a barrel shifter; and eight general-purpose, 40-bit register/accumulators.

  The µPD7701x has 16-bit data words and 32-bit instruction words. It has dual external memory ports—one for 16-bit data and one for 32-bit programs—with two distinct 16k-word address spaces for data. The 32-bit instruction word helps to increase code efficiency. Memory-read and -write accesses can take one cycle, although instruction pipelining may require an extra cycle for some instructions. An on-chip wait-state generator lets the processor run with slower, less expensive memory. The 7701x supports both programmed and externally requested wait states and has an 8-bit host I/O port.

  The 7701x comprises a data unit, a program unit, and a peripheral set. The data unit contains X- and Y-memory units, each of which has an address generator, a register file, and a MAC-execution unit. The program unit contains the instruction-address unit with built-in loop control, interrupt-control logic, program memory, and instruction-decode/control logic. A transfer bus links the two main units. Each main unit connects to an external data-memory interface: 14-bit addresses and 16-bit data for the data unit.

  The MAC unit has three 40-bit parallel subunits: a multiplier, an ALU, and a barrel shifter. The MAC unit provides 8 guard bits, but, because the 7701x supports no automatic saturation, your program must manipulate the results to retain accuracy before storing into memory. Unlike many other DSP implementations, the MAC subunits do not have dedicated input and output registers. Instead, the MAC tightly integrates with a set of eight general-purpose registers. The core uses the X, Y, and transfer buses to load data into the general-register set; the general-register set provides the data to drive the MAC subunits, which can execute concurrently. In effect, the general-register set, which is basically a multiport register file, serves as the interchange that links the data to the execution side of the processor.

  Two 2-word RAM data-memory banks supply the X- and Y-data components for each MAC cycle. Each bank has its own address generator with a set of four address-pointer registers. Each unit also has an index-register link to the main data bus. Through this bus, code can load and modify the pointer and modification registers. Each unit also has a modulo register for circular buffering. A special bit-reverse circuit handles bit-reversed addressing for each bank. You can directly load internal RAM under program control. Additionally, the DSP hardware supports automatic interruptible looping with a four-level loop stack that lets code nest, so that it can loop under hardware control.

  Addressing modes—The 7701x supports memory-direct, register-indirect, and immediate addressing. Hardware supports modulo and bit-reversed addressing for each data memory.

  Special instructions—The 7701x supports conditional operations (minimize jumps), 1-bit shift-multiply-add, clip result, register-indirect subroutine call, register-indirect jump, single- and block-instruction hardware loop, and repeat. The 7701x lacks bit-manipulation instructions.

  Support—NEC supplies a PC-based plug-in development board that offers in-circuit emulation using the 7701x’s on-chip emulation features. A C compiler for the 7701x is unavailable. NEC supplies an assembler-linker-loader package, and a third-party simulator is also available. The Spox real-time OS is ported to the 7701x DSPs.

Variations

  m PD77016—33 MHz, 140 mA max at 5V, 1.5k×32-bit program RAM, 4k×16-bit data RAM, two serial I/O channels, 160-pin QFP, $22.67 (10,000).

  m PD77015/m PD77017/m PD77018A—33 MHz, 40 mA max at 3.3V, 4k/12k/24k×32-bit program ROM, 256×32-bit program RAM, 4k/8k/24k×16-bit data ROM, 2k/4k/6k×16-bit data RAM, two serial I/O channels, 100-pin TQFP, $17.23/$22.67/$26.67 (10,000).

Oxford Micro Devices parallel-video A236 DSP chip

The A236, primarily targeting real-time video processing, has single-instruction, multiple-data architecture with four 16-bit vector processors, a 24-bit scalar processor, and a motion-estimation coprocessor. Each vector processor has a triple-port register stack with 64 general-purpose registers and a 16-bit barrel shifter. Software can access the registers in overlapping groups via register windows. The DSP provides functions for common video-processing operations, such as chroma keying. Each vector processor also has a 16-function ALU and a pipelined, 16-×16-bit, 2’s complement combinatorial multiplier with a 40-bit accumulator. You can use the scalar processor for scalar arithmetic and Boolean operations; for program control; and for computation of data addresses, program addresses, and loop counts. The scalar processor has a triple-port register stack with 32 registers, a 16-function ALU, and a 16-bit barrel shifter.

  A crossbar switch provides flexible, byte-level addressing and connects all the processors to a 64-bit-wide, 1-kbyte, two-way set-associative data cache. Byte-level addressing is useful for addressing a group of eight 8-bit operands from an arbitrary address for convolution or motion estimation. The motion-estimation coprocessor for video-compression applications couples to the vector processors. The coprocessor computes the sum of the absolute values. These values comprise the differences in values of 8-bit-pixel groups.

  A 32-bit instruction unit implements a "superscalar-vector" instruction set. Most instruction words are 32 bits long, with 64-bit-long extended instructions for immediate operands. The chip’s parallel architecture and ability to simultaneously operate on eight parallel operands allow it to typically execute as nine instructions for every instruction word. The 1-kbyte instruction cache is 64 bits wide. This width allows the instruction unit to access extended instructions in a cycle.

  One instruction can compute a memory address using the scalar processor, fetch four or eight parallel operands from an arbitrary starting address, and operate on those operands using the vector processors. The vector processors support the storage of 8- and 16-bit parallel data types to maximize memory usage and support common video-data formats. The A236 also supports an interrupt-stack pointer and two 24-bit, software-stack pointers, one for the vector processors and one for the scalar processor. To avoid cache conflicts, software can store scalar parameters and subroutine-calling parameters in the scalar stack, and the vector stack stores vector parameters.

  The A236 has three 16-bit, bidirectional, asynchronous, double-buffered, video-aware, parallel-DMA ports to load data for I/O as fast as 100 Mbytes/sec and to pass information among multiple A236 chips. You can connect common video-decoder and -encoder chips to the A236 without glue logic or frame buffers. You can easily implement multiframe buffers of interlaced and noninterlaced video or other stream-type data.

  A 32-bit-wide memory port with 64-byte bursts provides a 400-Mbyte/sec interface to synchronous DRAMs (SDRAMs). A digital PLL operates the SDRAMs at full-rated speed. The A236 has a port that connects to a serial EEPROM, which contains the chip’s BIOS. Upon reset, the A236 reads the EEPROM and transfers the software to the system’s SDRAMs. You can program the EEPROM with the A236. The serial port can also control other system-level peripherals.

  You can use the A236’s 16-bit-wide, bidirectional, asynchronous, double-buffered, parallel-DMA port as a command port for passing commands and data to and from a host processor. These ports support a packet mode with 32-bit addressing. The A236 also has an RS-232C port that an interrupt-service routine handles.

  Addressing modes—The A236 supports immediate and register-direct addressing.

  Special instructions—Oxford tailored the instruction set, supporting conditional operators, for C programming. The company includes an extensive set of conditional operations for program-flow control, including true and false conditions. The A236 also supports parallel operations on quad 8- and 16-bit words and octal 8-bit words. The basic parallel data structure is the "quad," such as 4 pixels. The four vector processors can simultaneously operate on this quad using one instruction, with the scalar processor computing an address or pointer. A 2-D array of quads describes a video frame. A loop, handling a series of quads, operates on this video frame. The A236 supports packed and interleaved, signed and unsigned, parallel data types to speed composite and monochromatic video-data processing. The A236 also supports saturation and normal arithmetic.

  Support—Oxford Micro Devices provides an assembler, a parallel C compiler, a linker, a loader, a simulator, and a debugger, all running under Windows 95. The company also provides a combined hardware- and software- evaluation kit that contains a video-processing system. This system contains an analog video input and output, a video encoder and decoder, the A236, 4 Mbytes of 100-MHz SDRAM, and a PCI interface.

Variations

  A236—40 MHz, 300 mA max at 5V, 208-pin PQFP, $35 (100,000).

SGS-Thomson 16-bit, fixed-point D950 DSP-Core

The D950-Core’s triple-level metal CMOS technology lets you embed the core as a megafunction in a gate array or cell-based device that has a 0.5-µm library containing RAM, ROM, and programmable-logic arrays. By adding optional peripherals, such as interrupt or DMA controllers, a bus-switch control unit, as well as memory for data and instruction, you can customize an application-specific DSP around the core. The key to the core’s performance is its operation parallelism, which allows the core to simultaneously perform multicycle functions along with other processors.

  The D950-Core’s architecture comprises a data-calculation unit (DCU), an address-calculation unit (ACU), a program-control unit (PCU), and an emulation-and-test unit (ETU). SGS-Thomson organized these units around three 16-bit buses—two for data and one for instruction. Each bus is dedicated to a 16-bit address bus. Data memory (as many as 128k words of RAM and ROM) maps to the off-core buses. The D950 also maps several noncritical registers in the off-core memory space; this approach allows SGS to add more registers without using bits in the instruction opcode. Access to program RAM and ROM (as many as 64k words) is via the instruction bus; data and instruction buses share a bus-control interface.

  The DCU computes operands, which can be 16 or 32 bits, signed or unsigned. The chip’s DCU includes a 16×16-bit parallel multiplier to implement a single-cycle multiply-accumulate (MAC) instruction and other MAC-based functions. Special instructions, called assignments, support these functions and allow you to perform a MAC operation and simultaneously add a register value to the product register. A 40-bit ALU implements a range of functions with two 40-bit accumulators. The core has a 40-bit barrel-shifter unit and a bit-manipulation unit that handles master-controller-unit processing through bit operations. Both 16/32-bit fractional (signed/unsigned) and 16/32-bit integer (signed/unsigned) word formats are available.

  The ACU, with two separate address generators, generates an address for each of the two identical data memories and updates them at each instruction, allowing instruction execution and two register-to-memory moves in one cycle. The ACU also contains two stack pointers, allowing you to perform fast context switching. A D950 instruction uses the two stack registers to save or restore two registers at a time.

  The PCU updates the program counter (PC) according to the current instruction or internal and external events. It performs program-address generation, instruction fetch and decoding, and exception processing and supports three nestable hardware loops without size or location constraints. By default, the PC increments by 1. The ETU contains three independent sections that share an external interface: an emulation port, core-scan registers, and a test port (for production-test purposes). Access to these sections is via dedicated I/O pins, which allow the ETU to interface with an outside JTAG controller or function as the primary access to the final chip.

  You can configure an 8-bit general-purpose parallel port (P0 to P7) as an input or output. A test condition is attached to each bit to test external events; chip control is via interface pins related to interrupt, low-power mode, reset, and miscellaneous functions.

  Addressing modes—The D950 supports direct, indirect-linear, indirect-modulo, indirect-bit-reverse (all with postincrement), indirect-indexed, and immediate addressing modes.

  Special instructions—The D950 supports bit manipulation, double-precision calculations, and support-specific coprocessor instructions for integrated coprocessors.

  Support—SGS-Thomson offers a JTAG PC board with a graphical, windowed, high-level source debugger for emulation. A C compiler, a simulator, and an assembler/linker that run on PC and Sun systems are available, as are VHDL models from Synopsys and Mentor Graphics (Wilsonville, OR). The D950 is integrated into SPW from Alta Group (Sunnyvale, CA), which allows cosimulation with different models of D950 (VHDL, instruction-set simulator, or emulator).

Texas Instruments 16-bit, fixed-point TMS320C1x DSP family

TI’s TMS320C1x has a large collection of application code supporting a range of applications. The chip may require careful assembly-language programming because of a 4k-word program-space limit, but costs have dropped to the point that C1xs compete with many µPs and µCs.

  This DSP family is accumulator-based; however, the accumulator is 32 bits wide for double-precision, 2’s complement arithmetic. Although the accumulator lacks guard bits, the DSP has an automatic saturation mode for rounding. The architecture relies on shared resources, such as buses and memories, rather than on implementing multiple sets to speed processing. The modified Harvard architecture separates program and data accesses yet allows transfers between code and data spaces to share resources.

  A multiply-accumulate (MAC) operation takes two cycles and two instructions—first the multiply and then the add instruction. Addressing for the MAC data values is not automatic; a basic MAC pass requires code to specifically address, fetch, and operate on each set of X- and Y-data values.

  The TMS320C1x has one data bus, limiting data reads or writes to one per cycle. For each MAC operation, code must load the multiplier’s T register with one data value and then move the second data value to the multiplier before starting the multiply.

  The MAC unit’s 16-bit barrel shifter in parallel with the single-cycle, 16-bit multiplier, feeds into a 32-bit ALU that handles both 16- and 32-bit operations. The accumulator feeds back to the ALU. The accumulator then feeds a simple shifter (shifts over 0, 1, or 4 bits) that, in turn, links to the data bus. A 1-bit shift is useful for removing a redundant sign bit; the 4-bit shift allows you to perform a 32-tap filter without overflowing the accumulator.

  A logic section that has its own program bus and memory handles code fetch and decoding. You can transfer data in the program ROM or EPROM to the data RAM and then use the data as a constant for MAC series expansion.

  Parallel I/O ports—You can use the 16-bit external data bus as an I/O port to connect external peripherals, such as ADCs or DACs. Separate I/O select signals let you use as many as eight I/O ports on the bus (16 with the C14).

  Event manager—The C14 event manager adds a capture-and-compare subsystem to supplement the C14’s two 16-bit timer/counters. The subsystem has six 16-bit compare and six 16-bit capture registers. The system compares compare-register values with the running timers; on a match, the subsystem generates an interrupt or an external signal. A high-precision PWM mode adds 2 bits of resolution for PWM outputs. (Resolution is 40 nsec at 25.6 MHz.) The subsystem can capture events; changes in one of six input lines trigger logic to set the timer/counter value in a capture register. The subsystem has a FIFO stack that buffers as many as four capture values for four capture registers.

  Addressing modes—The C1x supports paged-memory direct addressing; 7 bits in instruction concatenate with a 9-bit data-page pointer for accessing data RAM (128 words each page). It also supports register-indirect (using 8 bits from one of the auxiliary registers to address data RAM) and immediate addressing. Most members of the C1x family have a limited off-chip addressing range (12 bits); the C16 is the exception with 16-bit addressing. Additionally, external addressing is for program memory only; the C1x does not support off-chip data memory.

  Special instructions—The C1x offers no hardware looping or bit manipulation but has an instruction to transfer data between program and data memory; it also offers a test-input pin and branches if the pin is zero. An instruction moves a value one position higher in memory; this function is the predecessor to the multiply-accumulate-with-data-move (MACD) operation in later architectures.

  Support—An in-circuit emulator and evaluation modules (PC add-in cards) are available from TI. TI furnishes a development tool kit with an assembler/linker, a simulator, and an application library. Many third-party vendors sell hardware- and software-development tools for the C1x.

Variations

  TMS320C10—25.6 MHz, 94 mA max at 5V, 288-byte RAM, 3-kbyte ROM, 16-bit parallel I/O port, 44-pin PLCC, $5 (10,000).

  TMS320C14—25.6 MHz, 50 mA max at 5V; 512-byte RAM; 8-kbyte, one-time-programmable ROM; 16-bit parallel I/O port; two timers; 16-bit watchdog timer; one baud-rate-generation timer; 68-pin PLCC; $7.90 (10,000).

  TMS320C15—25.6 MHz, 85 mA max at 5V (16 MHz, 50 mA max at 3.3V); 512-byte RAM; 8-kbyte, one-time-programmable ROM; 40-pin DIP; 44-pin CLCC; 44-pin PLCC; $5.70 (10,000). Lucent Technologies Microelectronics also sells an ASIC macro for this device.

  TMS320C16—35 MHz, 100 mA max at 5V (25.6 MHz, 65 mA max at 3.3V), 512-byte RAM, 16-kbyte ROM, 64-pin PQFP, $6.30 (10,000).

  TMS320C17—20 MHz, 65 mA max at 5V (14.4 MHz, 22.7 mA max at 3.3V); 512-byte RAM; 8-kbyte, one-time-programmable ROM/EPROM/OTP; two serial I/O ports; one timer; 8/16-bit asynchronous-coprocessor port; companding hardware compresses and expands data for serial or parallel mode; handles both the A- and µ-Law forms, which meet US, Japanese, and European standards; 40-pin DIP; 44-pin PLCC; $5.72 (10,000).

Texas Instruments 16-bit, fixed-point TMS320C2xx DSP family

TI based the TMS320C2xx DSPs on the 320C2xLP core that the company offers as part of its custom DSP capability. The C2xx’s instruction set is a superset of and source-code-compatible with the C2x and is a subset of the C5x.

  The accumulator-based C2xx processor has a central ALU (CALU), which feeds the 32-bit accumulator. The accumulator also acts as one of the inputs to the CALU. The other input to the accumulator comes from either the 16×16-bit multiplier (through a scaling shifter) or the input data-scaling shifter. Software can rotate the contents of the accumulator through the carry bit to perform bit manipulation and testing. For implementing fractional arithmetic or justifying a fractional product, the C2xx processes the product-register output through a product shifter to eliminate the extra bit in a multiplication. The product-scaling shifter allows as many as 128 product accumulates without overflowing the accumulator.

  The basic multiply-accumulate (MAC) cycle involves multiplying a data-memory value by a program-memory value and adding the result to the accumulator. When the C2xx repeats the MAC, the program counter automatically increments, freeing the program bus to fetch the second operand. This feature allows the MAC to achieve single-cycle execution.

  Similar to the C5x, the C2xx can access 64,000 16-bit parallel I/O ports. The peripherals on C2xx devices, such as serial ports and software wait-state generators, are I/O-mapped in the on-chip I/O space. Your program must use other I/O addresses to access off-chip peripherals. You can use slower external memories using the C2xx’s software wait-state generator or the chip’s Ready pin. Most of the C2xx devices can generate zero to seven wait states.

  The C240 has an onboard event manager to support motor-control applications. The event manager features three up/down timers and nine comparators, which you can couple with waveform-generation logic to create as many as 12 PWM outputs. The event manager supports symmetrical (centered) and asymmetrical (noncentered) PWM-generation capabilities. It also supports a space-vector PWM state machine, which implements a scheme for switching power transistors to yield longer transistor life and lower power consumption. A deadband-generation unit also helps protect power transistors. In addition, the event manager integrates four capture inputs, two of which can serve as direct inputs for optical-encoder quadrature pulses.

  Addressing modes—The C2xx supports immediate addressing and paged-memory-direct addressing, in which 7 bits in an instruction concatenate with a 9-bit data-page pointer to access data RAM. It also supports register-indirect addressing using the 16 bits in one of eight auxiliary registers to access memory. It can automatically postincrement or decrement auxiliary registers. The C2xx offers no circular buffering.

  Special instructions—A MAC-with-data-move instruction adds a data move for on-chip RAM blocks to the MAC unit, which is useful for convolution and transversal filtering. The C2xx also offers single-instruction repeat, multiply and accumulate previous product, multiply and subtract previous product, accumulate previous product and move data, multiconditional branches and calls, store long immediate to data-memory locations, rotate accumulator left/right, and block move.

  Support—TI offers an emulator that supports JTAG scan-based emulation for nonintrusive product test. The company also supplies a C compiler, a source-level C assembler/debugger, a linker, a simulator, a profiler, and an application library. Evaluation modules, prototype cards, emulators, and application algorithms are also available through third parties. Mentor Graphics (Wilsonville, OR) offers an HDL model of the C5x core.

Variations

  TMS320C203—40/57 MHz, 22 mA typ at 3.3V (80 MHz at 5V), 544×16-bit program/data RAM, synchronous serial port, UART, timer, 100-pin TQFP, $5.20/$5.50/$5.80 (10,000).

  TMS320C204—40/57 MHz, 22 mA typ at 3.3V (80 MHz at 5V), 544×16-bit program/data RAM, 4k×16-bit ROM, synchronous serial port, UART, timer, 100-pin TQFP, $6.40/$6.70/$7 (10,000).

  TMS320LC205—40/57/80 MHz, 44 mA typ at 3.3V, 4.5k×16-bit program/data RAM, synchronous serial port, UART, timer, 100-pin TQFP, $7.80/$8.30/$8.70 (10,000).

  TMS320F206/F207—40/57/80 MHz, 22 mA max at 5V, 4.5k×16-bit program/data RAM, 32k×16-bit flash, synchronous serial port (F207 has two), UART, timer, 16 I/O pins (F207 only), 100-pin TQFP, $10.80/$12.40/$13.70 (10,000).

  TMS320C209—40/57 MHz, 54 mA max at 5V, 4.5k×16-bit program/data RAM, 4k×16-bit ROM, timer, 80-pin TQFP, $10.20/$11 (10,000).

  TMS320F/C240—40 MHz, 22 mA typ at 5V, 544×16-bit RAM, 16k×16-bit flash or ROM, event manager, dual 10-bit ADCs, one SPI, one UART, 28 general-purpose I/O ports, three timers, $20.30.

Texas Instruments 16-bit, fixed-point TMS320C54x DSP family

The TMS320C54x DSPs are TI’s highest performance 16-bit, fixed-point DSPs. The C54x DSPs use a modified Harvard architecture that incorporates three buses for data memory and one for program memory; each bus has its own address bus. Two of the data-memory buses are for reads, and one of the buses is for writes from the accumulator output. The C54x can generate as many as two data-memory addresses per cycle using two auxiliary register arithmetic units. The four internal buses and dual address generators enable multiple operand operations and reduce memory bottlenecks.

  The C54x has two 40-bit accumulators. A 40-bit adder dedicated to multiply-accumulate (MAC) operations has a separate 40-bit ALU that feeds the accumulators. The ALU and two accumulators support eight special parallel instructions that execute in one cycle. The ALU also features a dual 16-bit configuration that enables dual single-cycle operations. The 40-bit adder, at the output of the multiplier, allows unpipelined MAC operations as well as dual addition and multiplication in parallel. The multiplier performs 17×17-bit multiplies to allow 16-bit signed or unsigned multiplication, with rounding and saturation control in one cycle. Single-cycle normalization and exponential encoding support floating-point arithmetic.

  The C54x’s instruction set complements the parallelism of the architecture. It supports many two- and three-operand instructions, as well as some 32-bit operands. Eight individually addressable auxiliary registers and a software stack aid a C compiler’s efficiency. The C54x supports two circular buffers of arbitrary length and location.

  A compare-select-store unit contains an accelerator that reduces the Viterbi "butterfly update" to four cycles for Global System for Mobile communications channel decoding.

  You can use slower external memories by using the 54x’s software wait-state generator. All C54x devices support on-chip dual-access RAM (DARAM) that you can configure as data or program memory. The C54x can access this DARAM twice per machine cycle. TI based the C54x on a static-CMOS process that supports three power-down modes. A PLL allows you to throttle the clock.

  Addressing modes—The C54x supports single-data-memory-operand addressing that also supports 32-bit operands. It also supports dual-data-memory-operand addressing, which parallel instructions use. It provides immediate, memory-mapped, circular, and bit-reversed addressing.

  Special instructions—The C54x performs dedicated-function instructions, such as FIR filters; single and block repeat; eight parallel instructions (for example, parallel store and multiply accumulate); multiply and accumulate and subtract (10 multiply instructions); and eight dual-operand memory moves.

  Support—TI offers an evaluation module and an emulator that supports JTAG scan-based emulation for nonintrusive product test. The company also supplies a C compiler, a source-level C assembler/debugger, a linker, a simulator, a profiler, and an application library. Third-party tools and application algorithms are also available.

Variations

  TMS320C541—80/100/133 MHz, 40 mA max at 3.3V, 5k×16-bit program/data RAM, 28k×16-bit program/data ROM, two synchronous serial ports, timer, 100-pin TQFP, $19 to $23 (10,000).

  TMS320C542/C543—80/100 MHz, 65 mA max at 3.3V, 10k×16-bit program/data RAM, on-chip boot loader, time-division-multiplexed serial port, host-port interface (C542 only), timer, buffered serial port, 128- and 144-pin TQFP, $22 to $25 (10,000).

  TMS320C545/C546—80/100/133 MHz, 40 mA max at 3.3V, 6k×16-bit program/data RAM, 48k×16-bit program/data ROM, synchronous serial port, buffered serial port, host-port interface (C545 only), timer, 128-pin TQFP, $24 to $30 (10,000).

  TMS320LC549—133/166/200 MHz, 40 mA max at 3.3V, 32k×16-bit program/data RAM, 16k×16-bit program/data ROM, time-division multiplexed serial port (TDM), two buffered serial ports, host-port interface, timer, 144-pin µBGA, $31 to $40 (10,000).

Texas Instruments 16-bit, fixed-point TMS320C5x DSP family

The TMS320C5x, source-code-compatible with the C2x, operates as fast as 100 MHz with a 20-nsec instruction cycle. The static CMOS TMS320C5x is both an accumulator- and a register-based processor. It has a fixed-point multiply-accumulate (MAC) circuit with a registered, 16×16-bit multiplier loading a 32-bit product register. The product register, in turn, feeds a 32-bit accumulator without guard bits. The C5x also has two parallel functional units feeding off the data bus: an independent ALU with a register file of eight auxiliary registers and a bit-manipulation, or parallel-logic, unit (PLU). Multiply and accumulate take one cycle each. The basic MAC cycle involves putting a value into a temporary register, fetching a second value, multiplying into a holder register, and accumulating the result in the next cycle. However, if the TMS320C5x executes a MAC instruction within a hardware loop, the DSP can achieve single-cycle execution.

  For single-instruction cycle context switching, the C5x has a separate one-deep shadow-register stack for the major registers (accumulator, accumulator buffer, product and status registers, three temporary registers, index register, and auxiliary compare register). For control applications that need bit manipulation, the PLU runs in parallel with the MAC and ALU circuits. The PLU operations can set, clear, test, or toggle multiple bits in a control/status register or data-memory location without altering the accumulator contents. The C5x also has 0- to 16-bit left- and right-data barrel shifters.

  A power-down mode minimizes power by shutting down the CPU or the CPU and the peripherals. Pulling down the Hold pin can also force the chip into power-down mode. An interrupt brings the chip up to normal run conditions.

  Addressing modes—The C5x supports paged-memory direct addressing, in which 7 bits in instruction concatenate with a 9-bit data-page pointer for accessing data RAM (128 words each page). It also supports indirect, immediate, dedicated-register, and memory-mapped-register addressing. The processor supports automatic circular-buffer addressing for two buffers. The addressing mechanism supports buffer wraparound if the address-generation unit steps on the end of the buffer but not on overshoot.

  Special instructions—The C5x supports single and block repeat, load T (multiply) register and accumulate previous product; load T register, accumulate previous product, and move data; multiply and accumulate; multiply and accumulate previous product; square and accumulate; square and subtract previous product; call subroutine indirect; block move (with repeat instruction and program to data, data-to-data memory); table read/write; test and manipulate bit in memory.

  Support—The C5x has a JTAG port for chip test and in-circuit-emulatorlike debugging control and monitoring. TI supplies a DSP starter kit, an evaluation module, and an emulator based on the C5x’s built-in emulation logic. TI supplies a C compiler, a source-level C assembler/debugger, an assembler/linker, a simulator, a profiler, and an application library. Third-party hardware and software tools are also available.

Variations

  TMS320C50—57/80 MHz, 60/67/94 mA max at 5V (core portion only), 20-kbyte program/data RAM, 4-kbyte ROM, two serial ports, one timer, 132-pin PQFP, $22/$24 (10,000).

  TMS320C51—57/80/100 MHz, 60/67/94 mA max at 5V (core portion only), 4-kbyte program/data RAM, 16-kbyte ROM, two serial ports, one timer, 132-pin PQFP/100-pin TQFP. $15/$16/$19 (10,000).

  TMS320C52—57/80/100 MHz, 60/67/94 mA max at 5V (core portion only), 2-kbyte program/data RAM, 8-kbyte ROM, one serial port, one timer, 100-pin PQFP/TQFP, $11/$12/$14 (10,000).

  TMS320C53—57/80 MHz, 60/67/94 mA max at 5V (core portion only), 8-kbyte program/data RAM, 32-kbyte ROM, two serial ports, one timer, 132-pin PQFP/100-pin TQFP, $19/$21 (10,000).

  TMS320LC56/LC57/C57S—57/80 MHz, 36/53 mA max at 3.3V (core only), 14-kbyte program/data RAM, 64-kbyte ROM (57S has 4-kbyte ROM), one buffered and one standard serial port, one timer, one host-port interface (LC57 and C57S only), 100-pin TQFP (LC56)/128-pin TQFP (LC57 and C57S), $20 to $37 (10,000).

Texas Instruments 16-bit, VLIW TMS320C6x DSP

TI’s TMS320C6x is the first general-purpose DSP processor based on a very-long-instruction-word (VLIW) architecture, which the company calls "VelociTI." Although TI’s first product based on this architecture, the TMS320C6201, is a fixed-point implementation, the VelociTI architecture will also support floating-point implementations in the future. The C6201 core comprises dual datapaths and dual matching sets of four functional units.

  The eight functional units comprise two 16×16-bit multipliers and six 32-bit arithmetic units, including a 40-bit ALU and a 40-bit barrel shifter. Each functional-unit set has its own bank of 16 32-bit registers but can access the other functional-unit set’s register bank; the functional-unit set does this procedure through one data bus. Register access between the two datapaths supports only one read operation per cycle. However, each functional-unit set can perform as many as two reads and one write per cycle from a register in its own bank. You can also issue multiple writes to a register on the same instruction cycle as long as the instructions have different latencies.

  Unlike most DSPs, the C6x does not support separate X and Y memory spaces. Instead, it provides a single data memory with two 32-bit paths for loading data from memory to the register banks. Two other 32-bit paths store register values to memory. A 32-bit address bus supports these datapaths. A 32-bit address bus also addresses the program memory, but the single datapath is 256 bits wide. This width allows the CPU to fetch, but not necessarily execute, eight 32-bit instructions per cycle. TI calls this group of eight instructions a "fetch packet."

  Keeping all eight functional units busy is the key to squeezing the highest performance from the C6x. In reality, data dependencies, instruction latencies, and resource conflicts limit optimal performance. Therefore, the CPU can execute one to eight instructions per cycle. The compiler and assembly optimizer play a big role in establishing the sequence of instructions for the C6x to execute. The programming tools determine parallelism at compile or assembly time. The programming tools link instructions in a fetch packet by the least significant bit of an instruction. If the bit is set, the C6x executes the instruction in parallel with the subsequent instruction.

  The compiler and assembly optimizer are responsible for performing dependency checking and parallelism among instructions. Therefore, the code executes as programmed on independent functional units and eliminates the need for core features, such as out-of-order execution or dependency-checking hardware.

  The C6x lacks a dedicated multiply-accumulate (MAC) unit. Instead, it performs MAC operations by using separate multiply and add instructions. Although this approach requires two instruction cycles, the pipelined effect yields apparent single-cycle execution. Using this design approach, TI engineers simplified the C6x’s functional units, which, in turn, allows them to run the core at 200 MHz.

  Addressing modes—The C6x performs linear and circular addressing. However, unlike most DSPs that have dedicated address-generation units, the C6x calculates addresses using one or more of its functional units.

  Special instructions—The processor conditionally executes all instructions, a method for reducing branching and therefore keeping the 11-stage pipeline flowing.

  Support—The development tools for PC ($2995) and Sun ($4995) host platforms include a C6x C compiler, an assembly optimizer, a simulator, a linker, and a debugger. TI also offers a hardware-emulation board that is compatible with the company’s XDS510JTAG emulator interface. The assembly optimizer simplifies assembly-language programming and automatically schedules and parallelizes instructions from serial, inline assembly code. The assembler reads straight line code without regard to registers or functional units and does the resource assignment. Deterministic operation allows the debugger to lock-step through the code. The debugger performs code profiling to determine the amount of time the processor spends in various portions of the code.

Variations

  TMS320C6201—200 MHz; 800 mA typ at 2.5V internal; 300 mA at 3.3V for I/O; 64-kbyte program RAM or cache; 64-kbyte data RAM; 32-bit external memory interface supports SDRAM, synchronous-burst RAM, and SRAM; two enhanced buffered serial ports with direct support for T1/E1 lines; a 16-bit host-access port; four DMA channels with boot-loading capability; PLL; 352-pin BGA; $135 (1000).

Texas Instruments 32-bit TMS320C8x multiprocessor DSP

TI’s TMS320C8x, formerly, the Multimedia Video Processor (MVP), integrates as many as four parallel-processing DSPs (PPDSPs) and a 32-bit RISC master processor (MP) on a chip. The C8x processes multiple tasks in parallel, assigning each task to a specific processor and collectively delivering 2 billion-operation/sec performance.

  The C8x’s processors execute independently and concurrently. A high-speed, on-chip crossbar on the C80 connects the CPUs with 25 2-kbyte blocks of dedicated SRAM (11 4-kbyte blocks on the C82), which provides cache and RAM for each CPU. Any processor can access any RAM blocks. The crossbar handles nine to 15 simultaneous RAM accesses/clock cycle: three per PPDSP, two for the MP, and one for the transfer controller. Peak crossbar bandwidth is 4.2 Gbytes/sec (2.6 Gbytes on the C82). Memory-mapped control registers handle both the transfer and the video controller; the chip has a separate 32-bit datapath between the µP, transfer controller, and video controller (C80 only), which enables the µP to set register values.

  The C8x has high processor throughput: Each CPU executes from its own crossbar-memory instruction cache—2 kbytes for the PPDSPs and 4 kbytes for the µP (4 kbytes for the C82’s parallel processors). The µP also has 4 kbytes of crossbar-memory data cache. Executing from crossbar instruction caches, the CPUs achieve apparent single-cycle execution.

  The PPDSPs have a 64-bit instruction word with three major subfields for controlling the data unit (with a 32-bit ALU and 16-bit multiplier) and each of two independent address units. Each address unit has a 32-bit datapath to the crossbar and also can perform general-purpose math. Three zero-overhead loop controllers support nested looping. The 32-bit ALU can split into two 16-bit ALUs or four 8-bit ALUs; the multiplier performs one 16×16-bit multiply or two 8×8-bit multiplies. Additional hardware supports bit-field and pixel processing.

  The µP integrates a 64-bit FPU that shares a 31×32-bit register file with integer processing. A register scoreboard flags registers that are waiting for loads from memory or from the FPU to keep operations in order without unnecessary waiting.

  The µP’s FPU incorporates a single-precision floating-point multiplier and a double-precision floating-point adder. It also supports vector processing with built-in vector operations and four accumulators to hold interim vector results. Both the µP’s main integer path and its FPU units are pipelined. The µP has a three-stage pipeline. Single-precision multiply or double-precision addition are pipelined; the functions, therefore, can finish every cycle. The integer unit triggers FPU operations, which proceed independently. Vector instructions can start a multiply, add, and load or store every cycle, yielding a peak performance of 100 Mflops.

  The C8x is more than just a collection of CPUs and memory; it has its own on-chip I/O and memory controller, the transfer controller, which provides an adaptive memory interface with automatic byte alignment, as well as both linear and X and Y (frame buffer) addressing. It supports SDRAM, DRAM, VRAM, and SRAM, and the C80 also has a video controller to minimize design for video applications. The video controller supports two video frames (all video timing signals). It also has a special serial-register-transfer controller that lets the transfer controller control frame VRAM.

  Addressing modes—The C8x supports indexed, base, immediate, and relative addressing.

  Special instructions—The C8x performs bit and byte manipulation, log base 2, and many types of compare instructions. It supports conditional execution and hardware looping for single and block repeat.

  Support—TI sells a PCI-based software-development board as well as the XDS510 parallel-processing in-circuit emulator, which debugs C8x chips. Ariel (San Diego), Loughborough Sound Images (Loughborough, Leicestershire, UK), and Precision Digital Images (Redmond, WA) sell hardware-development systems. TI sells a software tool set that includes an assembler/linker, a C compiler, a simulator, and a parallel debugger. The debugger includes the µP debugger and multiple PPDSP debuggers. A parallel debug manager handles and coordinates individual CPU debuggers. TI fields a multitasking executive that runs on the µP, interfaces to a host CPU, and issues commands to C8x’s PPDSPs. Spectron Microsystems’ (Santa Barbara, CA) Spox real-time, multitasking operating system ($12,000) has been ported to the C8x.

Variations

  TMS320C80—50 MHz, 3300 mA max at 3.3V, 305-pin CPGA, $260 (10,000).

  TMS320C82—50 MHz, 352-BGA, $90 (10,000).

Zilog 16-bit, fixed-point Z893xx DSP family

The Z893xx has an accumulator-based DSP architecture built around a single-cycle multiply-accumulate (MAC) unit, which includes a 16×16- to 24-bit multiplier with automatic truncation, a 24-bit product register, and a 24-bit accumulator and ALU with no guard bits. The DSP processor runs from a 4k- or 8k-word, one-time-programmable program ROM. (The Z8939x can also access 64k words of external program memory.) Two internal bus sets—a program address/data-bus set and a data-address/data-bus set—allow the processor to access program and data concurrently with a MAC operation.

  Two RAM blocks hold program coefficients and data, which automatically feed directly into the MAC’s input registers each cycle. RAM-block addressing automatically increments or decrements the address, which eliminates the need for data-address-generation code for each MAC cycle. Results of the MAC land in a product register and 24-bit accumulator each cycle. You can treat the product register as a general-purpose register when it is not performing multiplies. Although the Z893xx lacks a barrel shifter, a shifter between the product register and ALU allows you to shift the result right by 3 bits before adding it to the accumulator.

  The basic DSP chip has external program (Z8939x only) and I/O buses. You can use the I/O bus to access peripheral devices, such as an ADC. The DSP chip stores the data you access through the I/O bus in eight external registers, which the DSP core can access. However, because there is no DMA support, the processor must perform the data transfers. An external-memory read/write takes one cycle. You can insert a wait state using software control; you can use the wait pin for additional wait states. Running code from external memory takes one additional cycle for each instruction; the data is read in one cycle but is not available for processing until the next instruction cycle.

  Some Z893xx devices have a codec interface that is compatible with 8-bit PCM, 16-bit codecs, and 16-bit stereo sigma-delta codecs. Many general-purpose, 8- and 16-bit ADCs and DACs are adaptable to this interface. You can also use the interface as a high-speed serial port or general-purpose counter. Z893xx chips also have two 13-bit timers: one dedicated to the codec interface; the other, to general purpose.

  Addressing modes—The Z893xx supports memory-direct addressing for as many as 512 RAM-based words; it also supports register-indirect addressing to RAM or ROM with pointer registers and immediate, short-form direct addressing using 16-bit data registers in RAM. It provides one-cycle, external-peripheral addressing treating the peripheral as a register. Modulo-addressing options include modulo 2 to 256 for data access.

  Special instructions—The Z893xx performs compare register to accumulator, conditional execution of certain instructions, and conditional branching and subroutine calls. The Z893xx does not perform repeat (hardware looping) or bit manipulation.

  Support—Zilog offers a C compiler, an assembler/linker, a simulator, a source-level debugger, and application libraries. The company also has a TMS320-to-Z893xx assembly-code translator. Zilog sells an evaluation board and an in-circuit emulator for the Z893xx.

Variations

  Z89321—20 MHz; 50 mA max at 5V; 512-word RAM; 4k-word, mask ROM; codec interface; 40-pin DIP; 44-pin PLCC/QFP; $3.73 (10,000).

  Z89371—16 MHz; 50 mA max at 5V; 512-word RAM; 4k-word, one-time-programmable ROM; 40-pin DIP; 44-pin PLCC/QFP; $7.72 (10,000).

  Z89391—20 MHz; 50 mA max at 5V; 512-word RAM; ROMless; codec interface; 16-bit, external-instruction bus; 84-pin PLCC; $6.21 (10,000).

  Z89323—20 MHz; 50 mA max at 5V; 512-word RAM; 8k-word mask ROM; four-channel, 8-bit ADC; PLL; two PWM; two watchdog timers; three timers; SPI; 44/68-pin PLCC; 44/80-pin QFP; $5.34 (10,000).

  Z89373—12 MHz; 50 mA max at 5V; 512-word RAM; 8k-word one-time-programmable ROM; four-channel, 8-bit ADC; PLL; two PWMs, two watchdog timers, three timers, SPI; 44/68-pin PLCC; 44/80-pin QFP; $11.49 (10,000).

  Z89393—20 MHz; 50 mA max at 5V; 512-word RAM; ROMless; four-channel, 8-bit ADC; PLL; two PWMs; two watchdog timers; three timers; SPI; 16-bit external instruction bus; 100-pin QFP; $9.49 (10,000).

Zilog 16-bit, fixed-point Z894xx DSP family

The accumulator-based Z894xx, originating from the Clarkspur core, provides an upward migration path for the Z893xx. Although the code is not binary-compatible, the Z894xx supports most of the Z893xx’s instructions. The Z894xx has a four-stage pipeline that delivers single-cycle multiplies and pipelined multiply-accumulate (MAC) instructions. The hardware multiplier performs a 16×16- to 32-bit multiply and transfers the result to the 32-bit ALU (with 8 guard bits for the MAC) or reiterates the multiplication. The address pointers can simultaneously address the two data RAMs for loading data into the multiplier.

  Zilog’s Z894xx contains a bit-field unit (BFU) with a 32-bit barrel shifter that can manipulate 16- or 32-bit values. The shifter can shift or rotate a 32-bit operand left or right and place the result in the accumulator. In addition, the BFU can extract a source-bit field and mask and merge it with the specified destination contents.

  The DSP implements a Harvard architecture, providing independent program- and data-memory spaces that the DSP accesses simultaneously through X and Y buses in parallel operations. The chip contains an internal-data (ID) bus and a multiplier-product (P) bus. The ID bus provides access to RAM, the stack, the program counter, the RAM pointer, and the data-address space. The 32-bit P bus provides access to the ALU, accumulator, multiplier outputs, and BFU. You can treat a 32-bit product register as two 16-bit registers. External interfaces include separate address and data buses for simultaneous access of external program and data memory.

  The Z894xx provides three 12-bit register pointers for each RAM bank. The chip can automatically increment or decrement these pointers to implement circular buffers without software overhead. The Z894xx implements the same type of codec that the Z893xx devices include.

  Addressing modes—The Z894xx supports register, direct, indirect, indirect with bit-reversal (useful for some FFT algorithms), and immediate addressing.

  Special instructions—The Z894xx performs conditional execution of certain instructions, as well as conditional branching. Unlike the Z893xx, the Z894xx performs repeat (hardware looping) and bit test and manipulation. Instruction can zero all bits in flag except the one of interest and store that value into the accumulator. You can also merge flags into the accumulator without overwriting previous bits.

  Support—Zilog offers an emulator, an assembler, a linker, a C compiler, a simulator/debugger, and an evaluation board. Zilog offers protopacks to accommodate differing packaging options.

Variations

  Z89462—40 MHz, 90 mA max at 5V (20 MHz, 30 mA max at 3.3V), 1k-word program RAM, 1k-word data RAM, dual codec interface, two 16-bit timer/counters, 100-pin VQFP, $11 (10,000).

Butterfly 24-bit, complex, fixed-point BDSP9124/9320 DSP chip set

Butterfly DSP’s chip set with BDSP9124 DSP and BDSP9320 memory manager performs DSP functions, such as digital filtering, image recognition, image compression, spectrum analysis, correlation, convolution, and adaptive filtering in the frequency or time domain.

  The BDSP9124’s quad-port architecture includes two bidirectional data ports, a bidirectional acquisition port, and a bidirectional coefficient port. Its 24-bit-wide, multiport-data-flow structure eliminates the need for external data multiplexing. This structure also allows single-port asynchronous or synchronous memories to serve each bus.

  This design’s bidirectional nature is conducive to the development of recursive, single-processor systems that process algorithms by passing the data through the chip several times. With six onboard butterfly units and two 60-bit accumulators, the BDSP9124 architecture differs from single-multiply, accumulator-based DSPs. When performing the high-level instructions, each BDSP9124 moves more than 10 Gbps through its I/O port.

  The BDSP9320’s memory-management unit provides more than 150 memory-address sequences and system synchronizations. With 20 address bits, the BDSP9320 directly addresses 1M word of memory, permitting very large arrays, 2-D arrays, or support for as many as 32 independent channels. The chip uses a circular-buffer technique with pointers for multiple-channel processing.

  The BDSP9124/9320 supports cascaded; single-instruction, multiple-data; and parallel-processing structures. Two cascaded chips process a complex input stream twice as fast as one. Five cascaded chips perform a 1 million-point FFT at a sustained 50-MHz complex sample rate. The architecture is memory-latency-insensitive.

  Addressing modes—The 9320 generates address sequences that you would use to access data memory only for DSP-type algorithms. Additionally, the chip provides 9320 addressing sequences to allow access of data in sequential order for block-data operations.

  Special instructions—The architecture minimizes software programming by embedding 26 macro DSP instructions in the silicon. These macros include real and complex FIR-filter and radix-2, -4, and -16 operations. The instructions use the parallelism inherent in DSP algorithms. This capability allows a 50-MHz BDSP9124 to perform radix-16 Butterfly operations in 320 nsec and a 1k-tap, complex, 24-bit FFT in 65 µsec. Three cascaded BDSP9124s perform the same operation in 21 µsec.

  Support—Butterfly DSP provides PC-based software simulators, evaluation boards, and ASIC support. The emulation and software simulators allow real-time debugging. The company also provides C compilers.

Variations

  BDSP9124 DSP—40/50 MHz, 600 mA max at 5V, 352-pin BGA, $1085 (100).

Motorola 24-bit, fixed-point DSP5600x DSP family

The DSP5600x integrates µC buses and µC architectural concepts with a DSP multiply-accumulate (MAC) core and X- and Y-memory blocks. The µC buses allow the DSP to interface to multiple masters and relinquish the DSP bus to another microcontroller. The µC architecture implies a flexible register set and program model.

  Like most other DSPs, the DSP56000 has a versatile external memory bus, standard bit-manipulation capabilities, and the ability to execute directly from external memory with single-instruction-cycle accesses. The chip has no on-chip program ROM, except for a small boot ROM on some versions. However, the DSP56000 can access external memory each instruction cycle with no time penalties.

  In the traditional sense, the DSP56000 is an accumulator-based machine because all math and logic operations go through the accumulator. However, the architecture does allow bit manipulation on registers and memory. It has a single-cycle MAC unit, but the unit has two 56-bit accumulators (8 guard bits); two sets of two 24-bit registers feed the unit. Before you use the data, you must load it into the MAC registers; however, the MAC takes only one cycle (two clocks) for a multiply and an accumulate. Other registers include control and addressing registers. The memory-mapped control registers are discrete but are addressed by memory location.

  Like many other DSPs, the DSP56000 has two identical address generators that automatically access X and Y memories for MAC cycles. Each address generator has a 56-bit ALU and four sets of three registers: Four pointer registers each have an associated offset and modifier register. The modifier registers can specify the type of address-register arithmetic operations, or they can hold data. The modifier registers support a FIFO buffer and bit-reversed addressing.

  The processor combines 16-bit addressing with 24-bit words. It has three internal address- and data-bus pairs that allow an instruction fetch and two data accesses in one cycle and, therefore, avoid the need for an on-chip cache. A fourth bus, the global data bus, is a simple 24-bit logic bus that transfers data to and from on-chip peripherals. You can switch any of the internal address and data buses into the external 16-bit address and 24-bit data bus; external devices can access internal memory via a bus request to the DSP. When the 56000 stores 56-bit values to 24-bit memory or registers, you can deploy an optional 1-bit shift operation and saturate the value to ±1.0. Unlike with other DSPs, the DSP56000’s X and Y memories have their own address spaces, which include on-chip RAM and ROM for the bottom addresses. An internal bus-switch unit handles transfers between internal buses and the single external bus. The bit-manipulation unit performs bit operations on memory values and address, control, and data registers.

  Addressing modes—The 56000 supports register-direct, memory-direct, register-indirect, immediate, and bit-reversed addressing.

  Special instructions—The 56000 performs do/end-do, single- or block-instruction hardware looping, bit manipulation, compare, divide iteration, jump if bit clear/set, conditional jump to subroutine, and move program memory. It performs logic operations only on bits 24 through 47 of the accumulator; these bits represent the most significant part of the data.

  Support—Motorola offers several low-cost DSP5600x evaluation boards as well as a 40-MHz application-development system. Third-party hardware tools are also available. The 56000 uses a proprietary debug interface, OnCE, in lieu of the standard JTAG interface. Motorola supplies a Gnu C compiler and debugger, an assembler/linker, and a simulator. Third-party vendors supply data-acquisition and filter-design packages, as well as OS software.

Variations

  DSP56002/56L002—40/66 MHz; 90 mA max at 5V (56L002: 40 MHz; 50 mA max at 3.3V); 512×24-bit program RAM; two 256×24-bit data RAMs; 64324-bit boot ROM; two 256×24-bit data ROMs containing sine, logx, and 2x tables; 8-bit host-processor interface with DMA support; 24-bit timer; PLL; power-saving modes; synchronous serial interface; serial communication interface; 24 I/O pins; 132-pin TQFP; $12.10 (100,000).

  DSP56004/56007—50/66 MHz; 90 mA max at 5V; 512×24-bit program RAM; two 256×24-bit data RAMs; 32×24-bit boot ROM; two 256×24-bit data ROMs containing sine, A- and µ-Law tables; serial host-processor interface; PLL; power-saving modes; serial audio interface; four general-purpose I/O pins; in 80-pin PQFP, $8.50 (100,000) (56007: 6400×24-bit program ROM; 3200×24-bit data RAMs; 52×24-bit boot ROM; two 512×24-bit data ROMs; $8/$12.30 (100,000)).

  DSP56005—50 MHz; 125 mA max at 5V; 4608×24-bit program RAM; two 256×24-bit data RAMs; 96×24-bit boot ROM; two 256×24-bit data ROMs containing sine, logx, and 23 tables; external memory expansion with 16-bit address and 24-bit data buses; bootstrap loading from external data bus, host interface, or serial-communication interface; 132-pin PQFP; $24.30 (100,000).

  DSP56009—80 MHz; 150 mA max at 5V; 512×24-bit program RAM; 10k×24-bit program ROM; 8960×24-bit data RAMs. (You can switch as much as 2304×24 bits from X and Y to program RAM, giving a total of 2816×24 bits of program RAM.) 64324-bit boot ROM; 4864×24-bit data ROMs; bootstrap loading from external data bus or serial host interface; in 80-pin PQFP, $19.40 (100,000).

  DSP56011—80 MHz; 135 mA typ at 5V; 2.9k×24-bit configurable program/data RAM; 4.2k×24-bit program ROM; 1.8k×24-bit data ROM; serial audio interface; serial and parallel host interfaces; digital audio transmitter supports SPDIF, IEC958, CP-340, and AES/EBU formats; in 100-pin TQFP, $15 (100,000).

Motorola 24-bit, fixed-point DSP563xx DSP family

The 563xx is Motorola’s highest performance fixed-point DSP architecture. The core uses a seven-stage pipeline (two fetches, one decode, two address generations, and two executions) to achieve single-cycle instruction execution. Although a branch penalty is three cycles, the 563xx supports conditional ALU instructions, which often avoid the need to change program flow.

  When the processor executes a single-cycle multiply-accumulate (MAC) operation, the first execute stage does the multiply, and the second stage does the accumulate. The register-based architecture of the 563xx uses an interlocking mechanism that automatically inserts a no-operation (NOP) instruction into the pipe to avoid stalls. This approach permits execution to "catch up" with data dependencies.

  The 563xx is binary-code-compatible with the 56000, but the 563xx also supports addressing modes that include address-register PC relative. This mode is useful for multitasking and position-independent code, which lets a programmer deliver and relocate object modules without relinking to the original code. Motorola expanded addressing on the 563xx to the full 24 bits, up from 16 bits on the 56000 family. Unlike the DSP56000, which has a 16-location stack limit, the DSP563xx implements an overflow mechanism for the on-chip hardware stack to off-chip data memory. Although the mechanism prevents unrecoverable stack overflows, the chip takes a two-clock penalty when externally dumping stack entries.

  The 563xx core integrates a six-channel DMA that operates concurrently with the core’s execution units and has separate address and data buses. The DMA transfers data among memories (P, X, and Y) or among memory and peripherals or the external host buses (PCI or ISA).

  You can convert the device’s flexible program RAM to a mixture of program RAM and a 1024×24-bit, eight-way, fully associative instruction cache that you can lock at the "way" level. The instruction cache is useful for large programs that require partial storage in external memory. The cache uses a least recently used sector-replacement algorithm.

  The DSP runs at 3.3V but has 5V-tolerant I/O. The static core operates from dc to 80 MHz and uses a PLL with a built-in prescaler that allows dynamic clock throttling. For additional power savings, the core automatically powers down unused memories, peripherals, and core logic on every instruction.

  Addressing modes—The 563xx supports register-direct, address-register-indirect, PC-relative, immediate, and absolute addressing.

  Special instructions—The 563xx’s barrel shifter supports multibit-shift instructions in both directions and by any number of bits. The shifter also supports instructions for bit-stream parsing and generation. The device can conditionally execute all ALU instructions, including zero, negative, and overflow. If any instruction is false, the processor executes a NOP instruction. The 563xx performs 16-bit arithmetic that is useful for handling various compression algorithms, such as LD-CELP (low-delay code-excited linear prediction). Normally, when using a 24-bit architecture for 16-bit arithmetic, performance degrades because you have to round the 24-bit numbers in software.

  Support—Motorola backs the 563xx family with a host of development tools. You can use an application-development system, the DSP5630ADS, to evaluate the chip and debug target systems. The device comes with an assembler, simulator software, and a C compiler. The 563xx’s JTAG-based OnCE port allows you to examine all internal buses in real time and record the last 12 change-of-flow instructions. Domain Technologies (Plano, TX) and Sonitech (Wellesley, MA) offer PC-based emulators that use the DSP563xx’s OnCE port. Momentum Data Systems (Costa Mesa, CA) and Spectrum (Burnaby, BC, Canada) offer 56301-based boards with a PCI-bus interface.

Variations

  DSP56301—66/80/100 MHz; 112 mA max at 3.3V; 4-kbyte program RAM; 4-kbyte data RAM; DRAM controller; expansion bus connects to DRAMs, asynchronous and synchronous SRAMs, and three 16-bit timers; 42 I/O pins; PCI and ISA interfaces; two synchronous serial interfaces; a serial-communications interface; 208-pin TQFP and 252-pin BGA; $42.20 (100,000).

  DSP56302—66/80 MHz, 112 mA max at 3.3V, 6.7k×24-bit program RAM, 4.7k×24-bit data RAM, parallel host interface, three 16-bit timers, 34 I/O pins, bytewide host interface, two synchronous serial interfaces, serial-communication interface, 144-pin TQFP, $51.80 (100,000).

  DSP56303—66/80/100 MHz, 112 mA max at 3.3V, 4-kbyte program RAM, 4-kbyte data RAM, parallel-host interface, three 16-bit timers, 34 I/O pins, bytewide host interface, two synchronous serial interfaces, a serial communications interface, 144-pin TQFP or 196-pin BGA, $24.20 (100,000).

  DSP56304—66/80 MHz, 112 mA max at 3.3V, 11k×24-bit program ROM, 6k×24-bit data ROM, 2k×24-bit configurable program/data RAM, parallel host interface, three 16-bit timers, 34 I/O pins, bytewide host interface, two synchronous serial interfaces, a serial-communications interface, 144-pin TQFP, $20.50 (100,000).

  DSP56305—66/80/100 MHz, 112 mA max at 3.3V, 2k×24-bit program ROM, 1k×24-bit data ROM, 3.75k×24-bit configurable program/data RAM, DRAM controller, three 16-bit timers, 42 I/O pins, PCI and ISA interfaces, three Global System for Mobile communications coprocessors, 252-pin BGA, $42.20 (100,000).

Analog Devices 32-bit, floating/fixed-point ADSP-21020 DSP

The ADSP-21020 provides the foundation for Analog Devices’ SHARC DSP. As with earlier Analog Devices DSPs, the ADSP-21020 uses a 48-bit instruction word to encode multiple operations per instruction. The most complex instruction can perform three computations, two data moves, and two pointer calculations in one cycle. One disadvantage, however, is that the large instruction word adds to system cost; the 21020 needs access to dual external large memories. The chip’s enhanced Harvard architecture supports two data-address generators (DAGs) and two external buses with programmable wait states: a 48-bit instruction bus and a 40-bit data bus with 24 and 32 bits of addressing, respectively.

  The 21020 lacks on-chip program or data memory. Instead, the CPU achieves single-cycle mutliply-accumulate (MAC) instructions by executing the inner-loop instructions from the 21020’s 32-word on-chip cache and bringing the coefficients and data from external memory. The cache caches only the instructions that use program memory for data, yielding virtual three-bus performance. A bit in the DSP lets you freeze cache contents, which helps eliminate the overhead of starting a time-critical loop.

  Unlike earlier DSP designs, the ADSP-21020 is not an accumulator-based design. Operations center on a 32×40-bit, 10-port register file that holds multiple accumulators and registers, providing more flexibility for C compilation and assembly programming. The data registers support fixed- or floating-point formats, depending on how the instructions reference them. The 21020 has 10 ports, with only nine active in one cycle, that link the DSP’s three computational units and the data and program buses to the register file. For fast context switching, the DSP shadows the register file and and all DAG registers.

  The ADSP-21020’s three computational units comprise a floating-point multiplier with dual, fixed-point accumulators; a 32-bit barrel shifter; and a floating- and fixed-point ALU that does fixed- or floating-point math. The 80-bit-wide accumulators provide 16 bits of headroom for bit growth, which is especially useful for large MAC strings. The three units can operate in parallel, each accessing inputs from and returning results to the register file. Operations are concurrent unless a conflict results, such as when two units access the same register. Each functional unit executes in one clock cycle.

  The ALU’s flag register holds the results of as many as eight ALU compare operations. The flag-register bits form a right-shift register; when the processor executes an ALU compare operation, these bits shift toward the least significant bit. You can use the accumulated compare flags to implement 2- and 3-D geometrical transforms for graphical clipping operations.

  The 21020’s two DAGs access X and Y data. Each address generator has eight register sets supporting 16 simultaneous circular buffers; each register set comprises index, modify, base, and length registers. The circular buffers can reside at any memory address and can be any arbitrary length. Circular buffering is critical for managing tap-delay lines in any time-based algorithm or for managing data in time and frequency domain transforms. For example, in an application that performs reverberation in an audio algorithm, you can set pointers in one DAG to reference each surface of the room and pointers in the other DAG to access coefficients that represent the reflectivity of each of those surfaces. Without the 16 register sets, you would have to frequently save and restore register context or perform many pointer-value calculations in software.

  The 21020 minimizes the use of program branches by offering conditional execution of most instructions: The instructions use a preliminary condition test and, if the test is true, execute the main instruction. Delayed branches allow you to hide the latency associated with pipeline flushes. With delayed branching, you can save two cycles by inserting the branch instruction two slots before you want the branch to occur.

  Addressing modes—The 21020 provides immediate with 32 bits, indexed, bit-reversed, circular-modulo, register-direct, and indirect addressing. However, it must use indirect addressing for off-chip memory access.

  Special instructions—The 21020 performs bit manipulation, division iteration, reciprocal of square-root seed, conditional subroutine call, single and block repeat with zero-overhead looping, fixed- and floating-point compare, and conditional execution. The ADSP-21020 supports IEEE-754 single-precision, floating-point (23-bit data, 8-bit exponent, and sign bit) instructions; it can also use 32-bit, fixed-point formats, fractional, and integer (2’s complement or unsigned). The program sequencer’s six-level-deep count and loop stacks support six levels of interrupt nesting.

  Support—Analog Devices provides a tool set that includes an ANSI C compiler, a C compiler with numerical C extensions for math and floating-point applications, a source-level debugger, an assembler/linker, a simulator, application libraries, and a PROM splitter. Analog Devices offers an assembly-language code compactor that you can use on the C-compilation output to parallelize operations. Analog Devices sells a full-speed in-circuit emulator and an evaluation board. Third-party tools include the Spox real-time OS, filter-design packages, a graphical application-development package, and other hardware tools. Analog Devices has also licensed the ADSP-21020 to Temic Semiconductors (Santa Clara, CA).

Variations

  ADSP-21020—33 MHz, 490 mA max at 5V, 32-bit timer, four programmable I/O pins, 223-pin PGA, $106 (1000).

Analog Devices 32-bit, floating/fixed-point SHARC DSP (ADSP-2106x)

In addition to the architectural features of the ADSP-21020, the fixed- and floating-point Super Harvard Architecture Computer (SHARC), or ADSP-2106x, integrates a large on-chip memory and an I/O controller to offload I/O. SHARC chips have two high-speed serial ports and a host/parallel port, both providing a direct interface to off-chip memory, peripherals, a host processor, and bus arbitration and link ports for interprocessor communication among six ADSP-2106x chips in a multiprocessor cluster.

  The ADSP-2106x’s CPU executes using on- or off-chip memory for a range of application code. Some SHARC chips contain as much as 512 kbytes of on-chip memory organized into two banks of dual-port RAM. This RAM holds large chunks of critical code and delivers sustained single-cycle memory accesses. You can use this memory to store a combination of 16-, 32-, or 40-bit data and 48-bit instructions and perform as many as four accesses per cycle: program memory for operation code and data, data memory for data, and a load from off chip using the chip’s I/O processor.

  SHARC includes an I/O controller that executes I/O transfers in parallel with CPU execution. The I/O controller offloads reads and writes between on- and off-chip memory, but delays occur when accesses contend for the same data. The controller manages 10 DMA channels, transferring data among internal memory; external peripheral devices; and the host, two serial ports, and six link ports. All DMA operations are zero-overhead data transfers that generally do not interrupt or delay core thread execution. The DMA controller allows you to dynamically control the external memory-bus width. The synchronous serial ports can transfer data as fast as 40 Mbps; the six communication ports move data in 4-bit nibbles, transferring as much as 1 byte/clock cycle. With six links operating simultaneously, maximum throughput is 240 Mbytes/sec.

  The CPU, I/O controller, and peripherals interconnect and perform flexible, nonintrusive transfers through a multibus-crossbar-interconnection unit. To reduce bottlenecks, the interconnect crossbar permits unlimited data and instruction movement from external or internal memory, cache, and I/O from on- or off-chip peripherals—all in one cycle.

  SHARC provides six communication-link ports for array multiprocessing. These ports feed through the I/O controller and let you create meshes of DSP processors that can access each other’s memory spaces. (Point-to-point connections between DSP ports define each processor in the mesh.) The on-chip I/O controller sets up, runs, and responds to these ports. Transfers pass through the I/O ports to and from internal memory. The I/O controller separates these transfers from mainstream DSP.

  A parallel port serves as a direct interface to off-chip memory, peripherals, or a host processor. As many as six ADSP-2106x chips can share this bus with a common system processor. SHARC’s offers a unified address space using a single 32-bit address bus and a single 32- or 48-bit data bus. For a 40-MHz clock, the chip supports a 15-nsec access time with zero-wait-state memory. The special host interface supports both 16- and 32-bit µPs, as well as system buses, such as ISA and PCI. SHARC treats this host as a memory-mapped device, with direct writes or reads to internal memory.

  Addressing modes—SHARC offers immediate, indexed, bit-reversed, circular-modulo, and register-direct and -indirect addressing. (It must use indirect addressing for off-chip memory access.)

  Special instructions—SHARC provides bit manipulation, division iteration, reciprocal of square-root seed, conditional subroutine call, single and block repeat with zero-overhead looping, fixed- and floating-point compare, and conditional execution. SHARC supports IEEE-754 single-precision, floating-point (23-bit data, 8-bit exponent, and sign bit), and a 40-bit extended IEEE format for additional accuracy (32-bit data).

  Support—Analog Devices sells a full-speed, nonintrusive, JTAG-based emulator that uses the ADSP-2106x’s built-in debugging capability. It runs under Microsoft Windows and supports debugging for multiprocessor systems. The company also supplies an EZ-Lab Development System, a PC plug-in card for multiple 2106x processors as well as an EZ-kit lite with a C compiler for $179. Third-party products include PC and VME multiprocessor cards and OSs. Analog Devices supplies a C compiler based on Gnu technology. This compiler supports Numerical C, which extends vector- and matrix-processing capabilities for signal processing. Other tools include an assembler/linker, a simulator, application libraries, a PROM splitter, and a C source-level debugger.

Variations

  ADSP-21060/21062—40 MHz; 650 mA typ at 5V; 540 mA at 33 MHz; 4- or 2-Mbit, dual-ported SRAM; 32-bit timer; 10 DMA channels; host-processor interface; two synchronous serial ports; 240-pin PQFP; $296/$196 (1000).

  ADSP-21061—33/40 MHz; 1-Mbit, dual-ported SRAM; six 32-bit timers; 10 DMA channels; host-processor interface; two synchronous serial ports; 240-pin PQFP; $71/$77 (1000).

Motorola 32-bit, floating-point DSP96002 DSP

Motorola’s DSP96002 is basically a 32-bit floating-point extension of the 24-bit fixed-point DSP56000. The 6002 has five major internal buses to speed multiple-operation processing. These buses include program-, X-, and Y-memory bus sets. They also include a global data bus for transferring address and local data and a DMA bus that supports two DMA channels. The on-chip DMA controller moves data without disrupting the DSP’s instruction thread.

  The DSP96002 also has two 32-bit external bus interfaces with separate address and data buses with page-mode DRAM support. These external interfaces have built-in multimaster capability. Another DSP96002 or a host processor can use a bus request to take over the bus and use it to access shared external memory or the DSP96002’s internal memory.

  Motorola’s DSP96002 presents a programming model nearly identical to that of the earlier 24-bit DSP56000 fixed-point processor. Motorola engineers extended the instruction set with floating-point instructions and extended the registers, including addressing registers, from 16 to 32 bits.

  Motorola built the register-based DSP96002 around a multiported-register file of 10 96-bit registers. Similar to the DSP56000, the DSP96002 has X and Y RAM and ROM blocks to supply the coefficients and variables for sum-of-product multiply-accumulate (MAC) calculations. MAC operations take input data from the register file; so, although MAC operations aren’t pipelined, the DSP must pipeline the X and Y data accesses to pump data into the register file before the downstream MAC operation uses the data.

  Execution units include a separate multiplier, an adder/subtracter that handles both add and subtract for FFT calculations, a logic unit, and a barrel shifter. These units support integer and floating-point operations with 11-bit exponents and 32-bit mantissas. The DSP96002 meets IEEE standards for single- and double-precision floating-point representations.

  The DSP96002 essentially has the same address-generation unit as that of the earlier DSP56000. This unit comprises two address generators that can operate concurrently. The generators each have three sets of four 32-bit registers: address (address pointers), offset (offset values), and modify registers. You load and access these registers via the global data bus. The DSP96002 has a flexible architecture; many typically hard-wired features are available as programmable options that you can set up via control registers. For example, the DSP chip supports a mix of address spaces ranging from a single unified address space to one that has separate 32-bit spaces for X, Y, and program memory.

  Addressing modes—The DSP96002 supports register-direct, memory-direct, register-indirect (postincrement/decrement by 1 or offset, indexed by offset), and immediate. The address generator also supports modulo and bit-reversed addressing.

  Special instructions—The DSP96002 supports hardware looping with single and block repeat, bit test and change, compare graphics, conditional subroutine call and branch, convert integer to floating point and vice versa, reciprocal seed, and reciprocal square-root seed. The DSP96002 doesn’t support conditional execution instructions.

  Support—Motorola sells the Applications Development Module for evaluating and debugging the DSP96002. The module uses the processor’s on-chip emulation support to set breakpoints, single-step the CPU, and read/modify memory or registers. It does not provide JTAG support. You can configure the chip to run with external RAM for development. Some third-party hardware tools are available. Motorola supplies a Gnu C compiler and tools as well as an assembler/linker, a librarian, an application library, and a behavioral simulator. Third-party tools include C and Ada compilers, graphical development systems, filter-design software, and real-time OSs.

Variations

  DSP96002—33.3/40/60 MHz; 600 mA max at 5V; 1-kbyte×32-bit program RAM, which you can configure as an instruction cache; two 512×32-bit data RAMs; two 512×32-bit preprogrammed data ROMs containing sine, A-, and µ-Law tables; two 8- to 32-bit host ports with DMA support; power-saving modes; timer/event counter; 254-pin CQFP; $42 (100,000).

Texas Instruments 32-bit, floating-point TMS320C3x DSP

TI’s TMS320C3x integrates a von Neumann µP architecture with a high-performance, 32-bit, floating-point DSP multiply-accumulate (MAC) core. The C3x also performs fixed-point math based on a 24-bit mantissa width on the inputs. Although most designers use the C3x for its floating-point capability, fixed-point math is occasionally useful for functions such as clipping of image data. On the µP side, the C3x supports a unified, flexible, 24-bit address space (16 Mbytes×32 bits). On the DSP side, the C3x processor performs single-cycle MAC processing. The processor receives the next instruction while accessing two data values for the current instruction’s MAC cycle.

  The C3x family does not support IEEE floating-point formats. The C3x format uses an implied sign bit to increase precision. In most applications, the format distinctions become irrelevant only if the data is going to another processor.

  The TMS320C3x DSP comprises memory/access, central-core, and I/O subsystems. The memory/access subsystem comprises separate program, data, and DMA buses, which allow parallel program fetches, data reads and writes, and DMA operations. This internal busing scheme enables programs to access the next instruction and two data values simultaneously and to transfer data to or from the I/O subsystem in one cycle. The data-address buses share a data bus that can make two sequential RAM accesses in one cycle because the buses run at twice the speed of the processor core. Two 32-word, lockable, on-chip caches automatically load as the DSP accesses instructions from external memory. The two 4-kbyte RAM blocks hold parameters and constants for sum-of-products MAC processing, and a large ROM can hold code or coefficients for MAC processing (C30 only).

  The central core has its own set of buses to move data and results. These buses move data among internal registers; an integer/floating-point multiplier; a parallel, 32-bit barrel shifter/ALU; and the memory subsystem. The core stores results in extended-precision or auxiliary registers that hold the values. Two address generators in the subsystem generate the addresses to access the data memories. The core registers, eight 40-bit extended-precision registers, auxiliary registers, and key-control registers reside in a central multiported register file. The C3x uses a software stack to support context switching.

  The third C3x subsystem, the I/O, comprises a single-channel DMA controller (dual channel in the C32) and a collection of peripherals that interlink with the peripheral-address and data-bus set. The memory-subsystem buses pass through a multiplexer and link to the peripheral bus, which serves the DMA controller and peripherals. On the C30, the peripheral bus links to an external expansion bus with a 13-bit address and 32-bit data bus.

  Addressing modes—The C3x supports register-direct, paged-memory-direct, register-indirect, and immediate addressing. A single circular buffer supports circular addressing and bit-reversed addressing for FFTs. The circular buffer requires block- size and base-pointer registers plus an auxiliary register that the buffer shares with X and Y memories.

  Special instructions—The C3x performs single- or block-instruction hardware looping (supports nestable block repeats but lacks automatic save and restore of status), standard/delayed branches (standard empties pipe, delayed waits three cycles before changing the program counter), interlocked access instructions for multiprocessing (load/store integer or floating-point value and signal interlocked), computed gotos (dynamic subroutine calls), convert floating point to integer and vice versa. The C3x can perform bit test. You can specify instructions to execute in parallel.

  Support—TI supplies a full-speed in-circuit emulator and an evaluation module. The C3x lacks JTAG support but instead has a proprietary five-pin emulation interface. TI sells a tool set that includes a C and C++ compiler, an assembler/linker, a source-level debugger, a code profiler, a simulator, and an application library. Third-party tools include C and Ada compilers, multiple OS products, filter-design packages, advanced graphical-design tools, and various hardware tools.

Variations 

  TMS320C30—33 to 50 MHz, 600 mA max at 5V (excludes 50-MHz version), 4k×32-bit program ROM, 2k×32-bit data RAM, two timers, two serial ports, expansion bus; in 181-pin PGA, 33-MHz version, $135; 40-MHz version, $156; in 208-pin PQFP, 40-MHz version, $66; 50-MHz version, $75 (10,000).

  TMS320C31—33 to 60 MHz, 325 to 475 mA max at 5V, 2k×32-bit data RAM, boot ROM, two timers, serial port; in 132-pin PQFP, $35 to $42 (10,000).

  TMS320LC31—33/40 MHz, 300 mA max at 3.3V (excludes 40-MHz version), 2k×32-bit data RAM, boot ROM, two timers, serial port, two low-power modes; in 132-pin PQFP, $38 to $45 (10,000).

  TMS320C32—40 to 60 MHz; 390 to 475 mA max at 5V; 512×32-bit data RAM; boot ROM; two timers; serial port; two low-power modes; flexible 8-, 16-, or 32-bit-wide system-memory interface; in 144-pin PQFP, $10 (250,000).

 

Texas Instruments 32-bit, floating-point TMS320C4x DSP family

The C4x has seven internal buses and on-chip memories that help deliver single-cycle execution when walking through X and Y memories for a series of multiply-accumulate (MAC) operations. TI built the C4x around a five-port register file, and, rather than time-sharing a single bus system, the C4x features separate buses for program and two data fetches. Additionally, the C4x has a floating-point-unit multiplier, an ALU, and a barrel shifter for parallel operations. The C4x also performs fixed-point math based on a 24-bit mantissa width on the inputs.

  A 128-word cache enables the processor to deliver single-cycle pipelined execution and still use slower external memory. (It does not use the cache with internal memory.) Key inner routines fill the cache as they run. The CPU accesses an instruction from external memory and automatically loads the instruction into cache, which is divided into four 32-word segments or lines. The CPU uses a least recently used algorithm to select the cache segment for the new instructions. You can freeze a segment in the cache by setting cache-freeze bits in the CPU-status register.

  Six 8-bit independent communications ports support point-to-point communications with networks of C4xs and peripherals. (The C4x has only four ports.) Each port comprises eight data pins and four handshake signals. These ports free the 31-bit local and global external memory buses for program or data accesses to the processor’s 4G-word address space (C40 only). Program and data occupy a unified address space that you can configure according to your memory requirements. The local and global buses have different memory-block assignments within each memory space. I/O can also use the external buses.

  A six-channel DMA subsystem with its own address and data buses moves data between the communications ports and memory without altering the CPU’s sequential threads. Such data movements do not overload the DSP processor with servicing overhead, although some data contention for memory may slow CPU execution.

  Addressing modes—The C4x supports register-direct, paged-memory-direct, register-indirect, immediate, and circular addressing to support single-sized circular buffers. The CPU applies bit-reversed operations to register-indirect addressing only.

  Special instructions—The C4x performs single or block instruction, zero-overhead hardware looping (nestable block repeats but without automatic save and restore of status), standard/delayed branches, interlocked access instructions for multiprocessing (load/store integer or floating-point value and signal interlocked), convert floating point to integer and vice versa, reciprocal and reciprocal square-root seed, and conversion to and from IEEE floating-point formats. The C4x performs bit test. You can specify certain instructions to execute in parallel.

  Support—Development system includes scan-based emulation via the C4x’s JTAG test port. External hardware can use the JTAG port to control the processor and to set and monitor registers or memory. You can also string multiple C4x chips on a JTAG circuit for parallel debugging. One processor breakpoint can halt execution in an array of C4x chips, and you can single-step them all in lock step. TI sells a C4x evaluation board with four processors that works with a number of host platforms. Software tools include a C compiler, Ada and C++ compilers, a source-level debugger for parallel debugging, an assembler/linker, and a simulator. TI also has an application library. Third-party support includes the Spox, Parallel C, Virtuoso, and Helios OSs, as well as a variety of hardware tools.

Variations

  TMS32C40—40/50/60 MHz, 680/850/1020 mA max at 5V, 4k×32-bit boot ROM, 2k×32-bit data RAM, two timers, 325-pin PGA, $160/$160/$176 (10,000).

  TMS32C44—50/60 MHz, 850/1020 mA max at 5V, 4k×32-bit boot ROM, 2k×32-bit data RAM, two timers, 24-bit external address buses, power-down mode; in 304-pin PQFP, $100/$110; in 388-pin BGA, $90/$99 (10,000).

Markus Levy, Technical Editor

You can reach Markus Levy at (916) 939-1642, fax (916) 939-1650, markuslevy@aol.com.


| EDN Access | Feedback | Table of Contents |


Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc.