|
||
April 23, 1998EDN's 1998 DSP 32-BIT Architecture DirectoryAnalog Devices SHARC DSP
The SHARC DSP uses a general-purpose, 10-port, 32-register data-register file to transfer data between the computation units and the data buses and to store intermediate results. The 48-bit instruction word accommodates a variety of parallel operations for concise programming. For example, the ADSP-2106x DSPs can conditionally execute a multiply, an add, a subtract, and a branch in one instruction. SHARC DSPs feature two data-address generators (DAGs), which implement circular data buffers. These DAGs contain sufficient registers to allow you to create as many as 16 primary and 16 secondary circular buffers. The DAGs, which may start and end at any memory location, automatically handle address-pointer wraparound. The ADSP-2106x SHARC chips have two high-speed serial ports and a host/parallel port, providing a direct interface to off-chip memory, peripherals, and a host processor. Link ports facilitate interprocessor communication and bus arbitration among as many as six ADSP-2106x chips. The ADSP-2106x's CPU executes using on- or off-chip memory. Some SHARC chips contain as much as 512 kbytes of on-chip memory organized into two banks of dual-port RAM. You can use this memory to store a combination of 16, 32-, or 40-bit data and 48-bit instructions and perform as many as four accesses per cycle: program memory for code and data, data memory for data, and an off-chip load using the chip's I/O controller. SHARC's I/O controller executes I/O transfers in parallel with CPU execution. The I/O controller offloads reads and writes between on- and off-chip memory, but delays occur when accesses contend for the same data. The I/O controller manages all DMA channels, transferring data among internal and external memory and all peripherals, such as the host port, as many as eight serial ports, and six link ports. All DMA operations generally do not interrupt or delay core thread execution. The DMA controller allows you to dynamically control the external-memory-bus width. The synchronous serial ports support time-division-multiplexed serial streams and hardware companding and can transfer data as fast as 40 Mbps. The six communication ports move data in 4-bit nibbles, transferring as much as 1 byte/clock cycle. With six links operating simultaneously, maximum throughput is 240 Mbytes/sec. The CPU, I/O controller, and peripherals interconnect and perform flexible, nonintrusive transfers through a multibus-crossbar-interconnection unit. To reduce bottlenecks, the interconnect crossbar permits unlimited data and instruction movement from external or internal memory or cache and permits I/O from on- or off-chip peripherals--all in one cycle. The 21060 and 21062 provide six communication ports for array multiprocessing. These ports feed through the I/O controller and let you create meshes of DSPs that can access each other's memory spaces. (Point-to-point connections between DSP ports define each processor in the mesh.) The on-chip I/O controller sets up, runs, and responds to these ports. Transfers pass through the I/O ports to and from internal memory. The I/O controller separates these transfers from mainstream DSP. A parallel port serves as a direct interface to off-chip memory, peripherals, or a host processor. As many as six ADSP-2106x chips can share this interface with a common system processor. SHARCs offer a unified address space using a 32-bit address bus and a 32- or 48-bit data bus. For a 40-MHz clock, the chip supports a 15-nsec access time with zero-wait-state memory. The special host interface supports both 16- and 32-bit µPs, as well as system buses, such as ISA and PCI. SHARC treats this host as a memory-mapped device with direct writes or reads to internal memory. The newest SHARC DSP, the ADSP-21065, also provides a synchronous DRAM (SDRAM) interface that transfers data to and from SDRAM as fast as 240 Mbytes/sec, or twice the clock frequency. The glueless SDRAM interface can access 16- or 64-Mbyte SDRAMs and enables you to connect to any one of four external memory banks. Addressing modes SHARC offers immediate, indexed, bit-reversed, circular-modulo, and register-direct and -indirect addressing. (It must use indirect addressing for off-chip memory access.) Special instructions SHARC provides bit manipulation, division iteration, reciprocal of square-root seed, conditional subroutine call, single and block repeat with zero-overhead looping, fixed- and floating-point compare, and conditional execution of most instructions. SHARC supports IEEE-754 single-precision, floating-point (23-bit data, 8-bit exponent, and sign bit), and a 40-bit extended IEEE format for additional accuracy (32-bit data). Support Analog Devices sells a full-speed, nonintrusive, JTAG-based emulator that uses the ADSP-2106xs built-in debugging capability. It runs under Windows and supports debugging for multiprocessor systems. The company also supplies an EZ-Lab Development System, a PC plug-in card for multiple 2106x processors, and an EZ-kit lite with a C compiler for $179. Third-party products include PC and VME multiprocessor cards and OSs. Analog Devices supplies a C compiler based on Gnu technology. This compiler supports Numerical C, which extends vector- and matrix-processing capabilities for signal processing. Other tools include an assembler/linker, a simulator, application libraries, a PROM splitter, and a C source-level debugger. The simulator simulates the 2106x core, including the pipeline and instruction cache, the memory subsystem and associated buses, interrupts, and the I/O processor and associated peripherals. The simulator accurately handles aborted pipeline stages, cache misses, and delay cycles associated with interrupts and bus contention and looping. Analog Devices developed the simulation engine as a dynamic-link library, isolating it from the graphical user interface via a series of public application-programming interfaces (APIs). You can use these APIs to connect the simulator with other software models and have them exchange and synchronize signals. Hyperstone E1-32
Zero-overhead looping on E1-32 requires you to execute two multiply-accumulate operations per loop and use the latency cycles to perform the address calculations, data loads, and compare instructions. The E1-32's 100-MHz operation helps in this area. The 4-Gbyte address space divides into four blocks; you can individually configure each block for bus width and timing. The E1-32 integrates a fast-page-mode DRAM controller in one of the block spaces. You can use the other blocks for glueless connection of SRAM, EPROM, or other memory devices, each with their own timing and bus width. A separate I/O-address space also allows each I/O device to have its own timing. Special instructions Instructions can be 16, 32, or 48 bits; this variation helps reduce code size. The variable-length instructions, which the E1-32 automatically prefetches, provide constants and native addresses as large as 32 bits. DSP instructions include multiply, complex and real multiply-accumulate and multiply-subtract, and complex addition and subtraction. Other special instructions include test-leading zeros. Development tools Hyperstone offers a development starter kit, a PC-based development board, and the hyICE serial connector for stand-alone operation. The $3500 development board comes with as much as 8 Mbytes of DRAM, as much as 512 kbytes of SRAM, and as much as 128 kbytes of EPROM. You can attach your prototype hardware using the board's I/O expansion connectors. The company provides an ANSI C compiler, a source- and task-level debugger, an assembler, a linker, and a software profiler. Hyperstone offers hyDSP, a collection of subroutines that include FFTs, discrete cosine transforms, multidimensional arithmetic, and a variety of digital filters. The hyRTK multitasking, real-time operating system performs pre-emptive task scheduling. The operating system, including an integrated debugging monitor and a floating-point library, takes less than 32 kbytes. Third-party vendors Eonic Systems (www.eonic.com) and Etnoteam (www.etnoteam.com) also provide RTOS support. Siemens Tricore
Tricore implements a Harvard architecture with separate address and data buses for program and data memories. Tricore is also a load/store architecture with 16 32-bit general-purpose data registers and 16 32-bit address registers. You can concatenate consecutive even-odd data registers to form eight 64-bit registers for extended precision. Unlike traditional DSPs, Tricore lacks separate X- and Y- memory spaces, which may require you to perform some loop unrolling to achieve the parallel performance of DSPs. As long as data is available for Tricore's execute unit, it can perform single-cycle MACs. The data side of the core has a 128-bit-wide bus to on-chip DRAM, which you can use to save two data and two address registers in one cycle to the cache. Addressing modes Tricore supports the typical addressing modes of a load/store architecture, including absolute, base+offset, preincrement, and postincrement. It also supports circular buffers for DSP filters and bit-reversed indexing for FFTs. You must align the start of the circular buffer to a multiple of the data size, which the instruction using the buffer prescribes. The length of the buffer must also be a multiple of the data size the instruction using the buffer references. Special instructions The instruction set supports operations on Booleans, bit strings, characters, signed fractions, addresses, signed and unsigned integers, single-instruction multiple-data, and single-precision floating-point numbers. In addition to a plethora of microcontroller-oriented instructions, such as bit manipulation, Tricore supports the traditional DSP instructions, including multiply and MAC, saturate, scaling, and rounding. Tricore also supports packed arithmetic. Conditional add, subtract, and select instructions let the device avoid using conditional jumps. Development tools Tasking (www.tasking.com) and Green Hills (www.ghs.com) offer C- and C++-compiler, debugger, simulator, and RTOS support for Tricore. Accelerated Technology (www.atinucleus.com) also supplies a Tricore RTOS. Nohau (www.nohau.com), Ashling (www.ashling.com), Hitex (www.hitex.de), and Lauterbach (www.lauterbach.com) supply Tricore in-circuit emulators. Tasking's instruction-accurate simulator allows you to analyze the basic functionality of your program. Siemens is also working on a cycle-accurate simulator, which the company expects to be available this year. The new simulator implements a flexible cache model that provides options, such as defining start and end addresses, the number of ways and cache lines, and the line size and banks. This simulator also includes branch-prediction logic and determines interrupt latency. Texas Instruments TMS320C3x
The C3x family does not support IEEE floating-point formats. The C3x format uses an implied sign bit to increase precision. In most applications, the difference in data format is relevant only if you are passing the data to another processor. The TMS320C3x DSP comprises memory/access, central-core, and I/O subsystems. The memory/access subsystem comprises separate program, data, and DMA buses, which allow parallel program fetches, data reads and writes, and DMA operations. This internal busing scheme enables programs to access the next instruction and two data values simultaneously and to transfer data to or from the I/O subsystem in one cycle. The data-address buses share a data bus that can make two sequential RAM accesses in one cycle because the buses run at twice the speed of the processor core. Two 32-word, lockable, on-chip caches automatically load as the DSP accesses instructions from external memory. The two 4-kbyte RAM blocks hold parameters and constants for sum-of-products MAC processing, and a 32-kbyte ROM can hold code or coefficients for MAC processing (C30 only). The central core has its own set of buses to move data and results. These buses move data among internal registers; an integer/floating-point multiplier; a parallel, 32-bit barrel shifter/ALU; and the memory subsystem. The core stores results in extended-precision or auxiliary registers that hold the values. Two address generators in the subsystem generate the addresses to access the data memories. The core registers, eight 40-bit extended-precision registers, auxiliary registers, and key-control registers reside in a central multiported register file. The C3x uses a software stack to support context switching. The third C3x subsystem, the I/O, comprises a single-channel DMA controller (dual channel in the C32) and a collection of peripherals that interlink with the peripheral-address and data-bus set. The memory-subsystem buses pass through a multiplexer and link to the peripheral bus, which serves the DMA controller and peripherals. On the C30, the peripheral bus links to an external expansion bus with a 13-bit address and 32-bit data bus. Addressing modes The C3x supports register-direct, paged-memory-direct, register-indirect, and immediate addressing. A single circular buffer supports circular addressing and bit-reversed addressing for FFTs. The circular buffer requires block-size and base-pointer registers plus an auxiliary register that the buffer shares with X and Y memories. Special instructions The C3x performs single- or block-instruction hardware looping (supports nestable block repeats but lacks automatic save and restore of status); standard branches, which empty the pipe; delayed branches, which wait three cycles before changing program counter; interlocked access instructions for multiprocessing (load/store integer or floating-point value and signal interlocked); computed gotos (dynamic subroutine calls); and conversion of floating-point to integer and vice versa. The C3x can perform bit test. You can specify instructions to execute in parallel. Support TI supplies a full-speed in-circuit emulator and an evaluation module. The C3x lacks JTAG support but has a proprietary five-pin emulation interface. TI sells a tool set that includes a C compiler, an assembler/linker, a source-level debugger, a code profiler, a simulator, and an application library. Third-party tools include C and Ada compilers, multiple OS products, filter-design packages, advanced graphical-design tools, and hardware tools. Texas Instruments TMS320C4x
A 128-word cache enables the processor to deliver single-cycle pipelined execution and still use slower external memory. (It does not use the cache with internal memory.) Key inner routines fill the cache as they run. The CPU accesses an instruction from external memory and automatically loads the instruction into cache, which is divided into four 32-word segments or lines. The CPU uses a least recently used algorithm to select the cache segment for the new instructions. You can freeze a segment in the cache by setting cache-freeze bits in the CPU-status register. Six 8-bit independent communications ports support point-to-point communications with networks of C40s and peripherals. (The C44 has only four ports.) Each port comprises eight data pins and four handshake signals. These ports free the 31-bit local and global external-memory buses for program or data accesses to the processor's 4G-word address space (C40 only). Program and data occupy a unified address space that you can configure according to your memory requirements. The local and global buses have different memory block assignments within each memory space. I/O can also use the external buses. A six-channel DMA subsystem with its own address and data buses moves data between the communications ports and memory without altering the CPU's sequential threads. Such data movements do not overload the DSP with servicing overhead, although some data contention for memory may slow CPU execution. Addressing modes The C4x supports register-direct, paged-memory-direct, register-indirect, immediate, and circular addressing to support single-sized circular buffers. The CPU applies bit-reversed operations to register-indirect addressing only. Special instructions The C4x performs single or block instruction, zero-overhead hardware looping (nestable block repeats but without automatic save and restore of status), standard/delayed branches, interlocked access for multiprocessing (load/store integer or floating-point value and signal interlocked), conversion of floating point to integer and vice versa, reciprocal and reciprocal square-root seed, and conversion to and from IEEE floating-point formats. It performs bit test. You can specify certain instructions to execute in parallel. Support Development system includes scan-based emulation via the C4x's JTAG test port. External hardware can use the JTAG port to control the processor and to set and monitor registers or memory. You can string multiple C4x chips on a JTAG circuit for parallel debugging. One processor breakpoint can halt execution in an array of C4x chips, and you can single-step them all in lock step. TI sells a C4x evaluation board with four processors that works with a number of host platforms. Software tools include a C compiler, a source-level debugger for parallel debugging, an assembler/linker, and a simulator. TI has an application library. Third-party support includes the Spox, Parallel C, Virtuoso, and Helios OSs, as well as a variety of hardware tools. |
||
| Copyright © 1998 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Business Information, a unit of Reed Elsevier Inc. | ||