|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
June 5, 1997 C compilers for DSPs flex their musclesMARKUS LEVY, TECHNICAL EDITOR As DSPs become more popular, so too has using the C language to program them. EDN investigated the quality of code that DSP compilers produce. So, before you hand-code your next DSP algorithm, check out the results of the EDN compiler benchmark challenge. Have you ever programmed a DSP in assembly language? If so, then you're probably familiar with instructions such as "macr x0,x0,a x:(r3)+,x0 y:(r6)+,y0." This instruction, which performs a multiply-accumulate (MAC), two data moves, and two pointer updates, demonstrates one of the key attributes of a DSP--the ability to perform multiple operations in one cycle instruction. A C compiler generated this instruction from the original source of "y[i]+=scaler×x[i]." Obviously, C programs are not usually this simple, and, therefore, DSP compilers face the overwhelming challenge of dissecting C programs and associating the pieces to the abstract resources of a DSP. C compilers for DSPs have been around for many years, but only recently have vendors put serious efforts into making their compilers effective tools. This situation is especially true for fixed-point DSPs, which have been notoriously difficult to program in C. Unlike floating-point DSPs, one reason for the difficulty is that the C language doesn't support fixed-point data types. DSPs also put a strain on C compilers because the processors use small register files, no software stack for passing parameters, a lack of orthogonal instructions, multiple data memories, circular buffers, and limited addressing modes. But, although DSPs have become more C-friendly and C programming has become more DSP-friendly, DSPs are also becoming less assembly-language friendly. Take Texas Instruments' new C6x DSP, for example. This processor has eight functional units that operate in parallel. Imagine trying to write assembly code that can maximize the efficiency of those units. A good C compiler is worth its weight in gold. But what makes a good C compiler? To test the capability of the compilers and associated DSPs, EDN gave a group of DSP- and C-compiler vendors a collection of C-coded, DSP-related functions. These functions ranged in complexity from a simple dot product to an integer math-intensive JPEG DCT. The challenge was to see how efficiently each of the vendor's compilers could handle these functions. EDN gave specific benchmark instructions to the vendors as the following steps:
The contenders Vendors submitted results for six fixed-point and two floating-point DSPs. The following key points will help you relate the benchmark results to a specific DSP's architecture: DSP Group's 16-bit, fixed-point OakDSPCore includes a single-cycle MAC unit, a 36-bit barrel shifter, two 36-bit accumulators, six general-purpose and four user-definable registers, two data buses and one program bus, two RAM data blocks for X and Y memory, and a software stack. At each cycle, the three buses move X- and Y-memory data to the MAC unit while program-control fetches an instruction from on-chip memory. The OakDSPCore has a 16-bit loop counter for repeating as many as 65,536 instructions or blocks. The OakDSPCore supports as many as four levels of block nesting and is available as a "core" for embedding within ASICs.- Hitachi's SH-DSP µP contains a separate 16-bit, fixed-point DSP that uses a modified Harvard architecture and separately addressable X and Y memories. The bus structure allows the DSP to access two data operands and fetch an instruction during one cycle. During that cycle, the SH-DSP can also execute one ALU operation and a 16×16-bit multiply. The DSP unit's registers comprise six 32-bit registers and two 32-bit accumulators. The DSP also supports a circular buffer, a zero-overhead loop, and conditional instruction execution of some ALU and shift operations. Motorola's 24-bit, fixed-point DSP5600x is accumulator-based but allows bit manipulation on registers and memory. It has a single-cycle MAC unit with two 56-bit accumulators and four 24-bit registers to feed the unit from separate X and Y memories. Three internal address- and data-bus pairs allow an instruction fetch and two data accesses in one cycle. When the 56000 stores 56-bit values to 24-bit memory or registers, you can deploy an optional 1-bit shift operation and saturate the value to ±1.0. The DSP performs do/end-do-, single- or block-instruction hardware looping. Motorola's 24-bit, fixed-point DSP563xx uses a seven-stage pipeline to achieve single-cycle instruction execution. The 563xx is a register-based architecture that supports separate X and Y memories and a barrel shifter that supports multibit-shift instructions in both directions and by any number of bits. This DSP implements an overflow mechanism for the on-chip hardware stack to off-chip data memory. The device can also conditionally execute all ALU instructions. TI's 16-bit, fixed-point TMS320-C54x incorporates three buses for data memory and one for program memory. The C54x can generate two data-memory addresses per cycle. The four internal buses enable multiple operand operations. The C54x has two 40-bit accumulators and a 40-bit adder dedicated to MAC operations. The ALU also features a dual 16-bit configuration that enables dual single-cycle operations and a software stack. The C54x performs dedicated-function instructions, such as FIR filters. Other special instructions include repeat and support for eight parallel instructions (for example, parallel store and MAC). TI's 16-bit TMS320C6x DSP very-long-instruction-word (VLIW) architecture comprises dual datapaths and dual matching sets of four functional units. The functional units comprise two 16×16-bit multipliers and six 32-bit arithmetic units including a 40-bit ALU and 40-bit barrel shifter. Each functional-unit set has 16 32-bit registers but can access the other functional-unit set's register bank. Each set can perform as many as two reads and one write per cycle from a register in its own bank. The processor conditionally executes all instructions--a method that reduces branching and therefore keeps the 11-stage pipeline flowing. Analog Devices' 32-bit, floating-point SHARC DSP comprises a floating-point multiplier with dual, 80-bit-wide, fixed-point accumulators, a 32-bit barrel shifter, and a floating- and fixed-point ALU. These computational units can operate in parallel. The most complex instruction can perform three computations, two data moves, and two pointer calculations in one cycle. Operations center around a 32×40-bit, 10-ported register file that holds multiple accumulators and registers. The data registers support either fixed- or floating-point formats. SHARC provides single and block repeat with zero-overhead looping and conditional execution of most instructions. TI's 32-bit, floating-point TMS-320C3x DSP integrates a von Neumann architecture with a 32-bit, single-cycle, floating-point MAC unit and a 32-bit barrel shifter/ALU. The C3x also performs fixed-point math based on a 24-bit mantissa width. The core registers, eight 40-bit registers, auxiliary registers, and key-control registers reside in a multiported register file. The C30's busing scheme enables programs to access the next instruction and two data values and to transfer data to or from the I/O subsystem in one cycle. The C3x performs single- or block-instruction hardware looping and can convert floating point to integer and vice versa. Benchmarks raise issues The biggest difficulty of comparing the various DSP compilers is figuring out how to compare them. Ideally, you'd like to compare each compiler's output with handwritten assembly code (assuming that the assembly code is the "best" that it could be). Regardless, it is difficult to quantify or generalize how much better assembly code is than C, because the quality level varies per algorithm or application. Another difficulty of comparing DSP compilers is that you cannot separate the compiler from the DSP architecture that it supports. To a large extent, the quality of a compiler's output is directly tied to the DSP's resources, including the number of registers, the orthogonality of the instructions, and the availability of a software stack. In an ideal world, for a specific DSP, you could choose from several compilers. However, unlike with many µPs, there is only one compiler for each DSP. The only exception to this rule is Motorola's fixed-point DSP; compilers from Motorola and Tasking support this device. Although Motorola claims many customers use its compiler, which it gives away for free, the company didn't supply EDN with benchmark results for this compiler. In lieu of the above approaches, you can analyze the quality of a compiler by examining the assembly code that it generates. Specifically, in most DSP applications, the focus should be on the contents of the inner loops. Obviously, if the compiler produces a single-instruction loop, this coding is as good as any handwritten assembly could possibly be. On the other hand, in more complex applications, you may find it difficult to analyze the output of the compiler and the resulting performance of the DSP. This fact is especially true in the case of DSPs such as TI's C6x or any other heavily pipelined architecture, because the compiler has many degrees of flexibility in the way it schedules operations. The verdict You can judge a compiler's capability by comparing the cycle-count results of the portable C code to the results achieved after the vendors applied their DSP-specific optimizations (Tables 1A and 1B). In all but a few cases, the unmodified and hand-optimized results differ, indicating that you must perform various levels of "tuning" to tell the compiler exactly what you're trying to accomplish on the DSP hardware. Although tuning your code is a reasonable way to achieve maximum compiled performance, it de-tracts from your code's portability. On the other hand, optimal, tuned C code is still easier to write and understand than hand-coded assembly. This will become even truer as DSP architectures become more complex. In general, tuning within this benchmark fell into two categories: DSP extensions and recoding. For most of the DSPs, the lowest hanging fruit to increase performance was to use extensions that define the X- and Y-memory spaces. This step allowed the compiler to use multioperational instructions that access data from X and Y memories while performing a MAC operation. Motorola used extensions, or intrinsics, that redefined the integer data types as fractional data types. This step allowed the compiler to generate inner loops without performing 15-bit shifts (Listing 1). Using intrinsic functions to specify fractional fixed-point operations in C can be clumsy when the algorithm is commonly thought to use normal C operators, such as '+' and '*'. Green Hills solved the fractional data-type issue by implementing C++ classes to act as these fixed-point types. You have to be familiar with only the operators available to the classes; you do not explicitly need intrinsics. On some of the functions, TI's programmers could not perform any hand-optimizations on either the C54x or the C30; therefore, some of the unmodified and hand-optimized results in Tables 1A and 1B are the same. One reason the C54x code was unmodified is that the compiler supports no intrinsics. (The benefit is more portable code.) Regarding the C30, TI indicated that its code generator is immune to small changes in code or tricks such as changing arrays to pointers. On the other hand, the optimized code from Green Hills for the SH-DSP yields good performance but minimal portability, both because of the company's extensive use of intrinsics. On all of the benchmarked functions, the performance level of TI's C6x is in a category of its own. This is largely because the C6x's architecture lends itself well to C. Some of the reasons for this behavior include the core's large, relatively orthogonal register set, flexible functional units, and a RISClike instruction set. These features make it easier for the compiler to schedule operations. However, without using intrinsics, keeping the DSP's eight functional units busy may still be difficult for the C6x compiler. In this benchmark, unrolling the loop of the JPEG DCT function was one of the most straightforward ways that a compiler vendor recoded the original function to increase performance. Even on TI's C6x, this modification resulted in a 15-times performance improvement over the original code. This improvement resulted mainly from the compiler's inability to handle software pipelining on the triply nested loop. However, once this pipelining became two nested loops, the compiler was able to handle it. A less straightforward and less portable recoding exists in Analog Devices' version of the vector-multiply function (vec_mpy). In contrast to the original code (Listing 1), this modified code includes a software-pipelined loop and a strength reduction on the x and y variables. It is likely that this code shows you the performance of the chip rather than that of the compiler, because most of the compiler's "thinking" is already done.
Another benchmark observation is that TI's C6x also stands alone on code size. The C6x compiler generated code two to 20 times larger than the resulting code sizes of the other DSPs, depending on the function. This result can be explained in part because TI's C6x programmer did a little extra inner-loop unrolling, especially on the JPEG DCT function. Another factor explaining its large code is that all instructions in the C6x are 32 bits wide; the RISClike nature of its instruction set implies that it may take several instructions to complete an operation. Contrast this with Analog Devices' SHARC, which uses a 48-bit instruction word and can encode several operations per instruction. You can blame a large part of the C6x's resultant code size on the compiler. Although the compiler generates tight inner loops, it also generates lengthy loop prologues and epilogues that support the DSP's software-pipelining capability. TI states that future compiler releases will strive to reduce the prologues and epilogues by as much as 75%. Instead of using prologues and epilogues, the compiler will use the C6x's conditional instruction-execution capability to fill the pipeline within the inner loop. Portability and coding efficiency are the ultimate goals of using C vs assembly language to program a DSP. Today's DSP compilers, although much better than they were even a year ago, still have a long way to go before they completely reach these goals. Unfortunately, the ANSI C Committee (X3J11) is one of the factors inhibiting the attainment of these goals. DSP-compiler vendors anxiously await the ANSI C Committee's adoption of a common set of DSP extensions that would allow vendors to abandon the use of some of their pragmas and intrinsics. Some system designers argue that compilers will never completely eliminate assembly programming. They say that it's just too difficult for the compiler to understand complex DSP hardware. But compilers will become more intelligent, and DSPs will become more C-friendly. And, as long as a compiler delivers a single-cycle loop, who could ask for more? Acknowledgment This article took a lot of effort by a long list of people. I especially want to thank Rich Scales, a great application engineer for TI's DSP. Thanks to the TI compiler team. Also thanks to the programmers David Kleidermacher (Green Hills), Vince Del Vecchio (Analog Devices), Yuval Ronen (Motorola), and Chen Sagiv (DSP Group). And who could forget Peter Torelli, the expert C programmer? |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| EDN Access | Feedback | Table of Contents | |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Table 1A--Cycle counts for unmodified benchmark results1 | ||||||||
| vec_mpy1 | mac | fir | fir_no_red | latsynth | iir | codebook | jpegdct | |
| DSP
Group OakDSPCore2 |
1225 (36) | 1687 (66) | 18123 (66) | 26837 (198) | 4431 (158) | 11441 (110) | 458 (130) | 4419 (662) |
| Hitachi
SH-DSP (Green Hills)2 |
2560 (38) | 2570 (52) | 30966 (70) | 31175 (134) | 3117 (126) | 3319 (114) | 339 (98) | 7936 (938) |
| Motorola
56002 (Tasking)3 |
2416 (45) | 2420 (51) | 70916 (84) | 81520 (249) | 6202 (162) | 4712 (144) | 1336 (189) | 19144 (1251) |
| Motorola
56300 (Tasking)2 |
1661 (45) | 1515 (54) | 35909 (75) | 41762 (204) | 2727 (135) | 2361 (114) | 1255 (138) | 6973 (918) |
| TI C54x2 | 758 (30) | 1063 (44) | 5808 (54) | 23914 (148) | 2606 (114) | 1764 (102) | 313 (90) | 5043 (622) |
| TI C6x4 | 319 (204) | 173 (360) | 3658 (312) | 3161 (424) | 353 (500) | 326 (476) | 96 (484) | 3902 (1444) |
| Analog
Devices SHARC5 |
767 (96) | 643 (186) | 5728 (216) | 14548 (420) | 1139 (294) | 1061 (354) | 207 (264) | 2334 (1086) |
| TI C305 | 482 (28) | 794 (52) | 3180 (60) | 9538 (156) | 744 (84) | 694 (108) | 248 (140) | 6320 (1160) |
| 1These
results reflect the capability of the DSP's compiler and
may not indicate the maximum potential of the DSP.
Numbers in parentheses indicate code size in bytes. 216-bit, fixed-point DSPs. 324-bit, fixed-point DSP. 432-bit, very-long-instruction-word (VLIW) DSP. 5Floating-point DSPs. |
||||||||
| Table 1B--Cycle counts for hand-optimized benchmark results1 | ||||||||
| vec_mpy1 | mac | fir | fir_no_red | latsynth | iir | codebook | jpegdct | |
| DSP
Group OakDSPCore2 |
620 (36) | 327 (46) | 3068 (48) | 5883 (138) | 1439 (120) | 1181 (98) | 441 (130) | 3821 (644) |
| Hitachi
SH-DSP (Green Hills)2 |
477 (68) | 331 (64) | 3273 (100) | 4767 (96) | 757 (148) | 1509 (88) | 262 (120) | 7401 (1058) |
| Motorola
56002 (Tasking)3 |
914 (27) | 618 (30) | 6114 (48) | 7504 (51) | 2226 (84) | 1512 (57) | 640 (123) | 13132 (1086) |
| Motorola
56300 (Tasking)2 |
610 (27) | 314 (30) | 3458 (42) | 3853 (51) | 1132 (93) | 759 (48) | 378 (99) | 5821 (849) |
| TI C54x2 | 758 (30)6 | 614 (36) | 5808 (54)6 | 13412 (110) | 1708 (88) | 1764 (102)6 | 313 (90)6 | 3787 (1022) |
| TI C6x4 | 178 (560) | 173 (360)6 | 1986 (412) | 3161 (420)6 | 295 (568) | 225 (496) | 96 (484)6 | 255 (2024) |
| Analog
Devices SHARC5 |
328 (132) | 342 (162) | 2874 (132) | 13051 (426) | 455 (288) | 498 (264) | 207 (264)6 | 2348 (1092) |
| TI C305 | 334 (40) | 358 (80) | 3180 (60)6 | 9538 (156)6 | 554 (140) | 694 (108)6 | 248 (140)6 | 2618 (1308) |
| 1These
results reflect the capability of the DSP's compiler and
may not indicate the maximum potential of the DSP.
Numbers in parentheses indicate code size in bytes. 216-bit, fixed-point DSPs. 324-bit, fixed-point DSP. 432-bit, very-long-instruction-word (VLIW) DSP. 5Floating-point DSPs. 6No improvements could be made by hand-optimizations. |
||||||||