|
|||||||||||||||||||||||||||||
March 2, 1998Microprocessor and DSP technologies unite for embedded applicationsMarkus Levy, Technical EditorMany embedded applications require a mixture of µP and DSP functionality; semiconductor vendors are creating hybrid devices to handle both types of processing. Before you begin your next design with discrete µP and DSP devices, check out the benefits and limitations that these hybrid devices offer.What's the difference between a µP and a DSP? According to many industry cohorts, there is no difference. Diverse, high-volume applications, including cell phones, disk drives, antilocking brakes, modems, and fax machines, require both µP and DSP capabilities. This requirement has led many µP vendors to build in DSP functionality. In some cases, such as in Siemens' Tricore architecture, the functional merging is so complete that it's difficult to determine whether you should call the device a DSP or a µP. At the other extreme, some vendors claim that their µPs have high-performance DSP capability, when in fact they've added only a "simple" 16×16-bit multiply instruction. Your key to a successful "DSPless" system design is understanding how much DSP capability you need and determining whether a µP can meet those needs. Most µPs come nowhere near implementing all the standard DSP functions (see box "Features that make DSPs DSPs"). But how much DSP functionality do you need? The answer depends on the trade-offs you're willing to make between having enough DSP horsepower to get the job done on the one hand and practical factors, such as cost, power consumption, ease of development, and board space on the other. The fundamental choices for implementing DSP in your application range from discrete DSP devices to a DSP core and µP on the same die to a hybrid µP-DSP to a µP with an integrated multiply-accumulate (MAC) unit. To reach the top of the DSP-performance ladder, you must use discrete DSP devices. But you trade off higher cost and power consumption, increase board space, and deal with two development environments (the µP and the DSP). Putting DSP and µP cores on one die yields even better DSP performance and also helps reduce power consumption and board space (see box "Dueling with dual-core designs"). Hybrid processors represent the next step down the DSP-performance ladder. The prominent architectures in this category include ARM's Piccolo, Hitachi's SH-DSP, Hyperstone's E1-32, and Siemens' Tricore. Although these implementations differ, a hybrid processor is basically a µP tightly coupled to a DSP (or vice versa, depending on your viewpoint). In general, the hybrid approach permits the µP and DSP portions to share on-chip resources, such as memory and other peripherals, thereby helping to eliminate redundancies and to reduce power consumption. Furthermore, a hybrid processor typically allows you to use one set of software-development tools, a single RTOS, and a one system task scheduler.
Hyperstone's E1-32, like Tricore, is a load/store architecture with an integrated DSP unit that works in parallel with the ALU: It can perform DSP calculations while the ALU performs loop counts, address calculations, or load/ store operations. Hyperstone based the E1-32 on a two-stage pipeline, and the device can issue only one instruction per cycle. The DSP instructions require two or more cycles to complete, and the ALU executes its instructions during the latency cycles of DSP instructions. You or the compiler must arrange your code to take advantage of these latency cycles. And, because the E1-32 does not support separate X and Y memory blocks, you would have to perform all loads and stores during the latency cycles to achieve reasonable DSP performance. Zero-overhead looping on E1-32 requires you to execute two MAC operations per loop and use the latency cycles to perform the address calculations, data loads, and compare instructions. The 100-MHz operation of the E1-32 helps reduce the penalties incurred by having to execute the extra instructions. The SH-DSP takes a slightly different approach to integrating its DSP functionality. In this case, Hitachi designers grafted the DSP unit onto the RISC's pipeline; the DSP unit shares the five-stage pipeline with the integer unit. Although the SH-DSP is a single core, the instruction stream comprises both integer and DSP instructions. When the CPU fetches and decodes instructions, it routes instructions to the appropriate unit. Furthermore, each unit has its own set of registers; the DSP unit's registers are visible only to the DSP unit and to DSP extended load/store instructions. The SH-DSP also differs from Tricore and Hyperstone in that its DSP unit has its own data-memory space. Similar to more traditional DSPs, Hitachi's architecture has separately addressable X and Y memories. The main integer ALU calculates X addresses, and a separate, 16-bit pointer-arithmetic unit calculates Y memory addresses. And, although the integer and DSP units can't execute instructions in parallel because they share the chip's internal address and data buses, the bus structure does allow the DSP to access two data operands and fetch an instruction during one cycle. During that same single cycle, the SH-DSP can also execute one ALU operation and a 16×16-bit multiply. The SH-DSP contains an instruction bus (I-bus), which the CPU also uses to load/store the RISC or DSP registers. If the CPU is loading or storing registers over the I-bus, the DSP cannot simultaneously perform other parallel operations. However, the DSP does allow parallel operations if you are using the X and Y buses for the moves to or from the on-chip X and Y memories, thus yielding sustained single-cycle MAC operations. On another note, ARM's Piccolo (SP7) is more of a DSP coprocessor module than a hybrid processor. ARM licensees can attach SP7 to an ARM-7TDMI core to add DSP functionality to an ARM chip design. SP7 instructions are incompatible with the standard ARM instruction set; the coprocessor executes them in a different pipeline, which means that you must debug two separate instruction streams. SP7 features a 16×16-bit single-cycle multiplier, a 32-bit barrel shifter, four 48-bit extended-precision accumulators, a saturation unit, and other DSP functions, but the coprocessor lacks separate X and Y memories for operand access. To help SP7 achieve sustained single-cycle MACs, its interface includes a tagged input-queue structure and an output FIFO buffer. The ARM7TDMI core performs all the address generation for the DSP functions. The input queue, or reorder buffer, enables the ARM7TDMI to preload SP7 with data before SP7 requires the data, essentially demultiplexing multiple input-data streams for DSP algorithms. The reorder buffer allows ARM code to fetch DSP data or coefficients from memory and allows the DSP code to consume the items in the required order. The ARM7TDMI core can transfer as many as 16 16-bit words of data in nine 32-bit bus cycle. SP7 uses a remapping scheme to automatically and transparently refill its registers from the reorder buffer as it uses and replaces old data. Because Piccolo reloads only from the reorder buffer, a data-intensive algorithm may starve the register file. Furthermore, the programmer must ensure that the ARM core responds to the needs of the DSP. In other words, Piccolo cannot interrupt or notify the ARM core when Piccolo needs data or has full output buffers. The ARM core performs bit-reversed addressing in software, but it can accomplish this task while SP7 is executing the first stage of the FFT. ARM claims that this approach eliminates any overhead. Register remapping allows ARM to logically remap physical registers within a loop to other register values. This approach effectively creates small circular buffers in the register banks. TMS320C6x: hybrid µP or DSP? Despite the fact that TI is known as a DSP company, its TMS320C6x blurs the line between µPs and DSPs. The C6x, which TI based on a very-long-instruction-word architecture, contains two 16×16-bit multipliers and six 32-bit arithmetic units with a 40-bit ALU and a 40-bit barrel shifter. You can use these functional units for integer and DSP operations. To maintain the flexibility of its functional units, the C6x lacks a dedicated MAC unit, which is fundamental to most DSPs. It instead performs MAC operations by using separate multiply and add instructions; the pipeline allows the C6x to effectively execute MAC operations in one cycle. Also unlike most DSPs, the C6x does not support separate X and Y memory spaces. Instead, it provides a single data memory with two 32-bit paths for loading data from memory to the register banks. The C6x lacks dedicated address-generation units and must calculate addresses, including circular buffers, using one or more of its functional units. The third step down the DSP performance ladder is a µP that integrates a MAC unit. Similar to the hybrid approach, this DSP implementation helps to eliminate redundancies; in most cases, the MAC unit becomes part of the µP's pipeline. Depending on the processor supporting the MAC unit, this approach can yield enough DSP performance to handle low-end-disk-drive servo-control loops, soft modems, software JPEG compression and decompression, and other types of low-data-rate DSP algorithms. For example, using software from AltoCom (Mountain View, CA; www.altocom.com), Integrated Device Technology's (IDT's) RV4640 µP can run a 56-kbps soft modem, but the µP may limit processing-power head room, according to IDT. Although an integrated MAC unit lacks the functionality of a hybrid or discrete DSP, vendors play tricks to increase DSP performance. One obvious trick is implementing a high-frequency clock. For example, Digital runs its StrongARM µP at 200 MHz to compensate for the µP's inability to achieve sustained single-cycle MAC operations. SGS-Thomson's ST10x262 is one of the few 16-bit µCs to include a MAC unit. The 262's MAC unit comprises a 16×16-bit parallel multiplier with a 40-bit accumulator and a 40-bit ALU to help support saturation. The MAC has two 16-bit datapaths to deliver two operands each cycle; these operands come from the 262's 256-word, dual-port RAM. This approach allows the 262 to deliver single-instruction-cycle MACs with two cycles per instruction until it has to reload the RAM. To improve DSP performance, the 262's MAC unit contains an interruptible repeat unit that supports as many as 8196 iterations through a loop. The 262 also includes special MAC instructions comprising multiply and MAC, 32-bit arithmetic, shifts, compares, and transfer instructions. The MAC unit first sign-extends each signed 32-bit product, then you can optionally either negate or round the product before the MAC adds it to the 40-bit accumulator register. At the other end of the µP spectrum, IDT's 64-bit RV4640 supports a two-cycle MAC operation; high-frequency operation allows this chip to achieve good DSP performance. However, you can achieve a sustained two-cycle MAC only by loading operands into the register file during the intervening clock cycle while the CPU executes the MAC instruction. The hardware automatically interlocks. As with many µPs with integrated MAC units, you must also perform loop unrolling to obtain two-cycle MAC operations. Although the RV4640 supports a MAC instruction, the lack of zero-overhead-looping capability makes it impossible for you to write a single-instruction inner loop; instead, you must include a branch with a decrement at the end of the loop. A predicted branch consumes one instruction cycle, but you can make up for this extra cycle by increasing the µP's operating frequency. Another issue common to µPs with integrated MAC units is the lack of circular buffers; you must perform address manipulation in software. For example, using the RV4640's support for register and offset addressing, you can partially unroll a loop and use explicit offsets to an address pointer within the unrolled loop. The bottom of the loop can implement the pointer update and circular addressing.
The ColdFire's MAC unit supports no zero-overhead looping. This lack means that its inner loop is a minimum of three instructions: the MAC, decrement the loop counter, and the branch. The ColdFire's 90- to 100-MHz operation helps to compensate for some of these performance disadvantages. Circular-buffer support on ColdFire's MAC comprises a mask register with an autoincrement address mode. Without additional overhead, the CPU ANDs the mask register with the contents of the address register to constrain addresses into a circular loop; that is, the CPU prevents carry generation. This addressing scheme uses a 16-bit register, which yields circular buffers as large as 64 kbytes. Another important ColdFire feature for DSP performance, as well as other time-critical applications, is that it lets you store code in internal SRAM or cache. The ColdFire Version 3 core allows you to lock the cache after storing the desired code. This approach minimizes the latency for time-critical algorithms. NEC's V83x family also uses a pipelined MAC unit, but it runs at twice the internal frequency of the chip to help it maintain an effective single-cycle throughput; a MAC takes a minimum of three clock cycles to run through the pipeline. Again, operand availability is crucial to achieving this level of performance. Because the V83x's MAC operation is a three-operand instruction, the CPU should supply three 32-bit operands during the register-fetch stage of the pipeline. In one clock cycle, the CPU fetches two operands from the chip's dual-ported register file. The third operand must be available through internal forwarding from a previous operation for the MAC to hit its three-cycle latency. Usually, the compiler changes the instruction order to help minimize data dependencies. However, some code does not allow the CPU to supply three operands during one clock cycle. This code imposes a one-clock penalty, bringing the MAC latency up to four clocks.
|
|||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||
| EDN Access | Feedback | Table of Contents | |
|||||||||||||||||||||||||||||
| Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc. | |||||||||||||||||||||||||||||