EDN Access

 

June 5, 1997


C compilers for DSPs flex their muscles

MARKUS LEVY, TECHNICAL EDITOR

As DSPs become more popular, so too has using the C language to program them. EDN investigated the quality of code that DSP compilers produce. So, before you hand-code your next DSP algorithm, check out the results of the EDN compiler benchmark challenge.

Have you ever programmed a DSP in assembly language? If so, then you're probably familiar with instructions such as "macr x0,x0,a x:(r3)+,x0 y:(r6)+,y0." This instruction, which performs a multiply-accumulate (MAC), two data moves, and two pointer updates, demonstrates one of the key attributes of a DSP--the ability to perform multiple operations in one cycle instruction. A C compiler generated this instruction from the original source of "y[i]+=scaler×x[i]." Obviously, C programs are not usually this simple, and, therefore, DSP compilers face the overwhelming challenge of dissecting C programs and associating the pieces to the abstract resources of a DSP.

C compilers for DSPs have been around for many years, but only recently have vendors put serious efforts into making their compilers effective tools. This situation is especially true for fixed-point DSPs, which have been notoriously difficult to program in C. Unlike floating-point DSPs, one reason for the difficulty is that the C language doesn't support fixed-point data types. DSPs also put a strain on C compilers because the processors use small register files, no software stack for passing parameters, a lack of orthogonal instructions, multiple data memories, circular buffers, and limited addressing modes.

But, although DSPs have become more C-friendly and C programming has become more DSP-friendly, DSPs are also becoming less assembly-language friendly. Take Texas Instruments' new C6x DSP, for example. This processor has eight functional units that operate in parallel. Imagine trying to write assembly code that can maximize the efficiency of those units. A good C compiler is worth its weight in gold. But what makes a good C compiler?

To test the capability of the compilers and associated DSPs, EDN gave a group of DSP- and C-compiler vendors a collection of C-coded, DSP-related functions. These functions ranged in complexity from a simple dot product to an integer math-intensive JPEG DCT. The challenge was to see how efficiently each of the vendor's compilers could handle these functions. EDN gave specific benchmark instructions to the vendors as the following steps:

  1. Compile the unmodified, "out-of-the-box" C code using compiler-optimization switches to generate the fastest code without yielding impractically large code. In this step, EDN allowed the vendors to modify the data types to match the DSP architecture. For example, vendors may have converted short variables to int or float variables, depending on the targeted architecture. (Readers can see these modifications by downloading the C files that appear as xxx_uc.txt (or download xxx_uc.zip to obtain all xxx_uc.txt files) (uc= unmodified C code) from EDN's Web site, www.ednmag.com. A missing DSP file indicates that the vendor made no changes from the original code. Readers can also see the listings for the compiled code by downloading the assembly files that appear as xxx_ua.txt (or download xxx_ua.zip to obtain all xxx_ua.txt files) (ua=unmodified assembly code) from EDN's Web site.)

  2. Crank the compiled code through the corresponding cycle-accurate simulator and record the number of cycles to execute each function. (Readers can view this data, along with the code size per function in Table 1A.)

  3. Modify the code to achieve the highest performance without changing the functionality of the C function. EDN allowed the vendors to perform a minimal amount of loop unrolling, especially when it was necessary to support software pipelining for inner loops. Modifications also included the use of pragmas and intrinsics but included no inline assembly coding. (Readers can see these modifications by downloading the C files that appear as xxx_oc.txt (or download xxx_oc.zip to obtain all xxx_oc.txt files) (oc=optimized C code) from the Web site.)

  4. Compile the modified C code and crank the compiled code through the simulator. (Readers can view this data, along with the code size per function in Table 1B. As you review the results, keep in mind that the less the modified code looks like the original, the poorer the perform-ance of the compiler. You should not think of these performance ratings as the raw performance of the DSP. The listings for the compiled code are available on the Web site in the files that appear as xxx_oa.txt (or download xxx_oa.zip to obtain all xxx_oa.txt files) (oa=optimized assembly code).)

The contenders

Vendors submitted results for six fixed-point and two floating-point DSPs. The following key points will help you relate the benchmark results to a specific DSP's architecture:

DSP Group's 16-bit, fixed-point OakDSPCore includes a single-cycle MAC unit, a 36-bit barrel shifter, two 36-bit accumulators, six general-purpose and four user-definable registers, two data buses and one program bus, two RAM data blocks for X and Y memory, and a software stack. At each cycle, the three buses move X- and Y-memory data to the MAC unit while program-control fetches an instruction from on-chip memory. The OakDSPCore has a 16-bit loop counter for repeating as many as 65,536 instructions or blocks. The OakDSPCore supports as many as four levels of block nesting and is available as a "core" for embedding within ASICs.-

Hitachi's SH-DSP µP contains a separate 16-bit, fixed-point DSP that uses a modified Harvard architecture and separately addressable X and Y memories. The bus structure allows the DSP to access two data operands and fetch an instruction during one cycle. During that cycle, the SH-DSP can also execute one ALU operation and a 16×16-bit multiply. The DSP unit's registers comprise six 32-bit registers and two 32-bit accumulators. The DSP also supports a circular buffer, a zero-overhead loop, and conditional instruction execution of some ALU and shift operations.

Motorola's 24-bit, fixed-point DSP5600x is accumulator-based but allows bit manipulation on registers and memory. It has a single-cycle MAC unit with two 56-bit accumulators and four 24-bit registers to feed the unit from separate X and Y memories. Three internal address- and data-bus pairs allow an instruction fetch and two data accesses in one cycle. When the 56000 stores 56-bit values to 24-bit memory or registers, you can deploy an optional 1-bit shift operation and saturate the value to ±1.0. The DSP performs do/end-do-, single- or block-instruction hardware looping.

Motorola's 24-bit, fixed-point DSP563xx uses a seven-stage pipeline to achieve single-cycle instruction execution. The 563xx is a register-based architecture that supports separate X and Y memories and a barrel shifter that supports multibit-shift instructions in both directions and by any number of bits. This DSP implements an overflow mechanism for the on-chip hardware stack to off-chip data memory. The device can also conditionally execute all ALU instructions.

TI's 16-bit, fixed-point TMS320-C54x incorporates three buses for data memory and one for program memory. The C54x can generate two data-memory addresses per cycle. The four internal buses enable multiple operand operations. The C54x has two 40-bit accumulators and a 40-bit adder dedicated to MAC operations. The ALU also features a dual 16-bit configuration that enables dual single-cycle operations and a software stack. The C54x performs dedicated-function instructions, such as FIR filters. Other special instructions include repeat and support for eight parallel instructions (for example, parallel store and MAC).

TI's 16-bit TMS320C6x DSP very-long-instruction-word (VLIW) architecture comprises dual datapaths and dual matching sets of four functional units. The functional units comprise two 16×16-bit multipliers and six 32-bit arithmetic units including a 40-bit ALU and 40-bit barrel shifter. Each functional-unit set has 16 32-bit registers but can access the other functional-unit set's register bank. Each set can perform as many as two reads and one write per cycle from a register in its own bank. The processor conditionally executes all instructions--a method that reduces branching and therefore keeps the 11-stage pipeline flowing.

Analog Devices' 32-bit, floating-point SHARC DSP comprises a floating-point multiplier with dual, 80-bit-wide, fixed-point accumulators, a 32-bit barrel shifter, and a floating- and fixed-point ALU. These computational units can operate in parallel. The most complex instruction can perform three computations, two data moves, and two pointer calculations in one cycle. Operations center around a 32×40-bit, 10-ported register file that holds multiple accumulators and registers. The data registers support either fixed- or floating-point formats. SHARC provides single and block repeat with zero-overhead looping and conditional execution of most instructions.

TI's 32-bit, floating-point TMS-320C3x DSP integrates a von Neumann architecture with a 32-bit, single-cycle, floating-point MAC unit and a 32-bit barrel shifter/ALU. The C3x also performs fixed-point math based on a 24-bit mantissa width. The core registers, eight 40-bit registers, auxiliary registers, and key-control registers reside in a multiported register file. The C30's busing scheme enables programs to access the next instruction and two data values and to transfer data to or from the I/O subsystem in one cycle. The C3x performs single- or block-instruction hardware looping and can convert floating point to integer and vice versa.

Benchmarks raise issues

The biggest difficulty of comparing the various DSP compilers is figuring out how to compare them. Ideally, you'd like to compare each compiler's output with handwritten assembly code (assuming that the assembly code is the "best" that it could be). Regardless, it is difficult to quantify or generalize how much better assembly code is than C, because the quality level varies per algorithm or application.

Another difficulty of comparing DSP compilers is that you cannot separate the compiler from the DSP architecture that it supports. To a large extent, the quality of a compiler's output is directly tied to the DSP's resources, including the number of registers, the orthogonality of the instructions, and the availability of a software stack. In an ideal world, for a specific DSP, you could choose from several compilers. However, unlike with many µPs, there is only one compiler for each DSP. The only exception to this rule is Motorola's fixed-point DSP; compilers from Motorola and Tasking support this device. Although Motorola claims many customers use its compiler, which it gives away for free, the company didn't supply EDN with benchmark results for this compiler.

In lieu of the above approaches, you can analyze the quality of a compiler by examining the assembly code that it generates. Specifically, in most DSP applications, the focus should be on the contents of the inner loops. Obviously, if the compiler produces a single-instruction loop, this coding is as good as any handwritten assembly could possibly be. On the other hand, in more complex applications, you may find it difficult to analyze the output of the compiler and the resulting performance of the DSP. This fact is especially true in the case of DSPs such as TI's C6x or any other heavily pipelined architecture, because the compiler has many degrees of flexibility in the way it schedules operations.

The verdict

You can judge a compiler's capability by comparing the cycle-count results of the portable C code to the results achieved after the vendors applied their DSP-specific optimizations (Tables 1A and 1B). In all but a few cases, the unmodified and hand-optimized results differ, indicating that you must perform various levels of "tuning" to tell the compiler exactly what you're trying to accomplish on the DSP hardware. Although tuning your code is a reasonable way to achieve maximum compiled performance, it de-tracts from your code's portability. On the other hand, optimal, tuned C code is still easier to write and understand than hand-coded assembly. This will become even truer as DSP architectures become more complex.

In general, tuning within this benchmark fell into two categories: DSP extensions and recoding. For most of the DSPs, the lowest hanging fruit to increase performance was to use extensions that define the X- and Y-memory spaces. This step allowed the compiler to use multioperational instructions that access data from X and Y memories while performing a MAC operation. Motorola used extensions, or intrinsics, that redefined the integer data types as fractional data types. This step allowed the compiler to generate inner loops without performing 15-bit shifts (Listing 1). Using intrinsic functions to specify fractional fixed-point operations in C can be clumsy when the algorithm is commonly thought to use normal C operators, such as '+' and '*'. Green Hills solved the fractional data-type issue by implementing C++ classes to act as these fixed-point types. You have to be familiar with only the operators available to the classes; you do not explicitly need intrinsics.

On some of the functions, TI's programmers could not perform any hand-optimizations on either the C54x or the C30; therefore, some of the unmodified and hand-optimized results in Tables 1A and 1B are the same. One reason the C54x code was unmodified is that the compiler supports no intrinsics. (The benefit is more portable code.) Regarding the C30, TI indicated that its code generator is immune to small changes in code or tricks such as changing arrays to pointers. On the other hand, the optimized code from Green Hills for the SH-DSP yields good performance but minimal portability, both because of the company's extensive use of intrinsics.

On all of the benchmarked functions, the performance level of TI's C6x is in a category of its own. This is largely because the C6x's architecture lends itself well to C. Some of the reasons for this behavior include the core's large, relatively orthogonal register set, flexible functional units, and a RISClike instruction set. These features make it easier for the compiler to schedule operations. However, without using intrinsics, keeping the DSP's eight functional units busy may still be difficult for the C6x compiler.

In this benchmark, unrolling the loop of the JPEG DCT function was one of the most straightforward ways that a compiler vendor recoded the original function to increase performance. Even on TI's C6x, this modification resulted in a 15-times performance improvement over the original code. This improvement resulted mainly from the compiler's inability to handle software pipelining on the triply nested loop. However, once this pipelining became two nested loops, the compiler was able to handle it.

A less straightforward and less portable recoding exists in Analog Devices' version of the vector-multiply function (vec_mpy). In contrast to the original code (Listing 1), this modified code includes a software-pipelined loop and a strength reduction on the x and y variables. It is likely that this code shows you the performance of the chip rather than that of the compiler, because most of the compiler's "thinking" is already done.

 

{

int i;

float a,b,c;

float *z=y;

a=scaler * *x++;

b=*y++;

c=b+a;

a=scaler * *x++;

b=*y++;

for (i=0; i<148; i++) {

*z++=c;

c=b+a;

a=scaler * *x++;

b=*y++;

}

*z++=c;

c=b+a;

*z++=c;

}

 

Another benchmark observation is that TI's C6x also stands alone on code size. The C6x compiler generated code two to 20 times larger than the resulting code sizes of the other DSPs, depending on the function. This result can be explained in part because TI's C6x programmer did a little extra inner-loop unrolling, especially on the JPEG DCT function. Another factor explaining its large code is that all instructions in the C6x are 32 bits wide; the RISClike nature of its instruction set implies that it may take several instructions to complete an operation. Contrast this with Analog Devices' SHARC, which uses a 48-bit instruction word and can encode several operations per instruction. You can blame a large part of the C6x's resultant code size on the compiler. Although the compiler generates tight inner loops, it also generates lengthy loop prologues and epilogues that support the DSP's software-pipelining capability. TI states that future compiler releases will strive to reduce the prologues and epilogues by as much as 75%. Instead of using prologues and epilogues, the compiler will use the C6x's conditional instruction-execution capability to fill the pipeline within the inner loop.

Portability and coding efficiency are the ultimate goals of using C vs assembly language to program a DSP. Today's DSP compilers, although much better than they were even a year ago, still have a long way to go before they completely reach these goals. Unfortunately, the ANSI C Committee (X3J11) is one of the factors inhibiting the attainment of these goals. DSP-compiler vendors anxiously await the ANSI C Committee's adoption of a common set of DSP extensions that would allow vendors to abandon the use of some of their pragmas and intrinsics.

Some system designers argue that compilers will never completely eliminate assembly programming. They say that it's just too difficult for the compiler to understand complex DSP hardware. But compilers will become more intelligent, and DSPs will become more C-friendly. And, as long as a compiler delivers a single-cycle loop, who could ask for more?


Acknowledgment

This article took a lot of effort by a long list of people. I especially want to thank Rich Scales, a great application engineer for TI's DSP. Thanks to the TI compiler team. Also thanks to the programmers David Kleidermacher (Green Hills), Vince Del Vecchio (Analog Devices), Yuval Ronen (Motorola), and Chen Sagiv (DSP Group). And who could forget Peter Torelli, the expert C programmer?


  • DSP-compiler benchmarks that test the pure capability of the compiler are difficult, because you can't separate the compiler from the DSP.

  • To achieve good performance, most DSP compilers require you to use custom pragmas and intrinsics--detracting from the portability of the code.

  • The best DSP compiler requires you to perform the least amount of tuning to your ANSI C-compliant code.

  • The simple compiler benchmarks in this article dug up some surprising results. Interestingly, every vendor admitted that these benchmarks pointed out the flaws or shortcomings of their compilers, and vendors have thus recommended fixes to the compiler architects.

  • Missing from the benchmark results are Analog Devices' 16-bit, fixed-point products, TI's C2xx, and Motorola's 56800 and 56100. These vendors supplied no benchmark data.

Benchmark targets target-independent optimizations

DSP compilers, just like compilers for µPs, support various target-independent, or classical, optimizations. These optimizations are classified into two broad categories: removal of redundancy and simplification. Removal of redundancy involves finding multiple instances in which the program computes the same value and replacing these instances with one computation. An example of such an optimization is common subexpression elimination. Simplification is a broad category of transformations that tries to find more efficient ways to compute the same result expressed by the program. An important example, strength reduction, uses the inductive properties of computations performed in loops to replace array indexing with pointers that increment through the array. There are dozens of other examples of optimizations in both categories.

One of the central problems in compiler design is that many of the transformations conflict with each other. For example, common subexpression elimination, which precomputes expressions for multiple uses, requires the program to store the computed value in a register for as long as the value is needed. This step results in increased demand for registers, possibly exceeding the target's register capacity. Issues such as these pervade compilers, and typical DSP characteristics, such as small register files, special-purpose instructions and registers, and demanding scheduling requirements exacerbate these problems. Overcoming these challenges requires careful attention to the order of the processing steps, the parameterization of the target to maintain retargetability without sacrificing optimization, and the design of the algorithms themselves.

Conventional compiler-validation suites and benchmarks fail to test many of the optimizations that compilers support. Chip makers, computer-system vendors, and companies with several development seats must understand the optimizations that their compilers (or the ones they purchase) support. Nullstone Corp's Automated Compiler Performance Analysis Tool tests for these optimizations by performing more than 6500 tests covering more than 40 compiler optimizations. To demonstrate the tool's capability, Nullstone provided EDN with the sample test results shown in Table A. For more details, you can view Nullstone's Web site at www.nullstone.com.

Optimization definitions

Branch elimination removes a branch to a branch.

Common subexpression elimination identifies expressions that have the same value, removes the second computation, and replaces it with the value of the first computation, thus avoiding the recomputation of the second expression.

Constant folding evaluates expressions with constant operands at compile time, thus improving runtime perform-ance and reducing code size by avoiding evaluation at runtime.

Constant propagation replaces the use of a variable with the constant that was assigned to that variable in a previous assignment.

Dead-code elimination removes code that is either unreachable or that does not affect the semantics of the program, such as computations that are not used and dead stores.

Tail-recursive calls replace a function that calls itself with a goto or branch, thus avoiding the overhead of the call and return and also reducing stack-space usage.

Integer-multiply optimization replaces expensive multiply instructions with less expensive integer add and shift instructions and is typically done for power-of-two constant multiplicands and other special bit patterns.

Hoisting moves loop-invariant expressions out of loops, thus improving runtime performance by executing the expression only once rather than on each iteration.

Forward store moves an assignment to a variable inside a loop to outside the loop and maintains the value in the loop in a register, thus improving runtime performance by reducing memory-bandwidth requirements.

Table A--Nullstone performance results
Optimization TI C6X GNU 56k1 Tasking 56k
Constant propagation Partial2 Partial2 Yes
Branch elimination Yes Yes Yes
Constant folding (literal integer constants) Yes Yes Yes
Constant folding (literal float constants) Yes Yes Yes
Constant folding (constant variables) No No No
Common subexpression elimination Yes Yes Yes
Dead-code elimination Partial3 Partial4 Partial4
Hoisting Yes Yes No5
Forward store No No No
Integer divide Yes Partial6 Yes
Integer modulus Partial6 Partial6 Yes
Integer multiply Yes Yes Yes
Tail recursion Yes Yes No7
Unswitching Yes8 No Yes
1 Both the GNU 56k and Tasking 56k support Motorola’s 56xxx DSP products. The GNU 56k compiler is free on Motorola’s Web site.
2 Failed to propagate constants through complex flow graphs.
3 Eliminates dead stores to autos and statics but can’t eliminate dead stores’ structure members.
4 Eliminates dead stores to autos but can’t eliminate dead stores to statics or structure members.
5 Documentation claims that the compiler supports hoisting but fails to optimize test cases.
6 Failed to optimize negative power-of-two constants.
7 Documentation claims that the compiler supports tail recursion but fails to optimize test cases.
8 IF statement converts to conditional instructions, thus avoiding the branches inside the loop.
Tips for writing more efficient DSP C code
by Alan Davis, Senior Member Technical Staff,
TI's Software Development Systems Semiconductor Group

Perhaps the biggest problem facing DSP compilers is that the programmer has no direct way to express in C the DSP's expected behavior. Many operations, such as bit-reversed addressing or saturated arithmetic, are simply missing from the language. Others rely on information that there is no way to express (for example, that a pointer points only to data in program space, enabling a table-read operation). Finally, rules defined by the language standard often constrain the compiler. For example, ANSI C requires that unsigned arithmetic operations wrap around on overflow, even if this step is superfluous and detrimental to the algorithm. Another more severe problem is disambiguating memory references: Unless the compiler has enough information to prove otherwise, it must assume that two pointers in a loop may point to the same location and, therefore, must not reorder the pointers' references. Even though the programmer may know that the pointers point to distinct objects, thus making reordering safe, the compiler may be unable to derive this information from the source code.

How can a programmer overcome these inherent limitations? One important guideline is making sure that the compiler has enough information. For example, at least three optimizations rely on knowing something about the iteration count of a loop: Many zero-overhead-looping mechanisms must execute the loop at least once, software pipelining depends on knowing the minimum number of iterations, and loop unrolling usually relies on knowing that the count is an even multiple of the unrolling factor. If the loop is written in C as for(i=0; i<n; i++), the variable n determines the iteration count. If you know that n always has certain properties (such as having a known minimum value or, better yet, is a constant), you can take advantage of assert statements or other standard or extended constructs that provide such declarative information.

Along these lines, you can explicitly use named arrays (such as a[i]) rather than autoincremented pointers (such as *p++) to help the compiler disambiguate accesses that cannot possibly overlap.

Another important consideration is to write the code to match the DSP's capability. It is especially important that you use data types and write expressions that correspond to the data widths and formats of the DSP. For example, if the DSP has a 16×16-bit multiplier that produces a 32-bit result, and the compiler supports data types 'int' as 16 bits and 'long' as 32 bits, you could write an expression that matches the multiplier as long=(long)int1 * (long)int2. You would write a similar operation quite differently for targets with 32-bit or floating-point arithmetic.

For more information on the DSPs and their compilers
When you contact any of the following manufacturers directly, please let them know you read about their products on EDN's website.
Analog Devices
Norwood, MA
1-617-329-4700
www.analog.com
DSP Group
Santa Clara, CA
1-408-986-4315
www.dspg.com
Green Hills Software
Santa Barbara, CA
1-805-965-6044
www.ghs.com
Hitachi America Ltd
Brisbane, CA
1-415-589-8300
www.hitachi.com
Motorola Inc
Austin, TX
1-512-891-2030
www.motorola-dsp.com
Nullstone Corp
Fremont, CA
1-510-490-6222
www.nullstone.com
Tasking
Dedham, MA
1-617-320-9400
www.tasking.com
Texas Instruments Inc
Dallas, TX
1-800-477-8924, ext 3555
www.micro.ti.com
 
Listing 1--DSP-compiler-challenge C code


Markus Levy, Technical Editor

You can reach Markus Levy at 1-916-939-1642, fax 1-916-939-1650, markuslevy@aol.com.


| EDN Access | Feedback | Table of Contents |


Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc.
Table 1A--Cycle counts for unmodified benchmark results1
  vec_mpy1 mac fir fir_no_red latsynth iir codebook jpegdct
DSP Group
OakDSPCore
2
1225 (36) 1687 (66) 18123 (66) 26837 (198) 4431 (158) 11441 (110) 458 (130) 4419 (662)
Hitachi SH-DSP
(Green Hills)2
2560 (38) 2570 (52) 30966 (70) 31175 (134) 3117 (126) 3319 (114) 339 (98) 7936 (938)
Motorola 56002
(Tasking)3
2416 (45) 2420 (51) 70916 (84) 81520 (249) 6202 (162) 4712 (144) 1336 (189) 19144 (1251)
Motorola 56300
(Tasking)2
1661 (45) 1515 (54) 35909 (75) 41762 (204) 2727 (135) 2361 (114) 1255 (138) 6973 (918)
TI C54x2 758 (30) 1063 (44) 5808 (54) 23914 (148) 2606 (114) 1764 (102) 313 (90) 5043 (622)
TI C6x4 319 (204) 173 (360) 3658 (312) 3161 (424) 353 (500) 326 (476) 96 (484) 3902 (1444)
Analog Devices
SHARC5
767 (96) 643 (186) 5728 (216) 14548 (420) 1139 (294) 1061 (354) 207 (264) 2334 (1086)
TI C305 482 (28) 794 (52) 3180 (60) 9538 (156) 744 (84) 694 (108) 248 (140) 6320 (1160)
1These results reflect the capability of the DSP's compiler and may not indicate the maximum potential of the DSP. Numbers in parentheses indicate code size in bytes.
216-bit, fixed-point DSPs.
324-bit, fixed-point DSP.
432-bit, very-long-instruction-word (VLIW) DSP.
5Floating-point DSPs.
Table 1B--Cycle counts for hand-optimized benchmark results1
  vec_mpy1 mac fir fir_no_red latsynth iir codebook jpegdct
DSP Group
OakDSPCore
2
620 (36) 327 (46) 3068 (48) 5883 (138) 1439 (120) 1181 (98) 441 (130) 3821 (644)
Hitachi SH-DSP
(Green Hills)
2
477 (68) 331 (64) 3273 (100) 4767 (96) 757 (148) 1509 (88) 262 (120) 7401 (1058)
Motorola 56002
(Tasking)
3
914 (27) 618 (30) 6114 (48) 7504 (51) 2226 (84) 1512 (57) 640 (123) 13132 (1086)
Motorola 56300
(Tasking)
2
610 (27) 314 (30) 3458 (42) 3853 (51) 1132 (93) 759 (48) 378 (99) 5821 (849)
TI C54x2 758 (30)6 614 (36) 5808 (54)6 13412 (110) 1708 (88) 1764 (102)6 313 (90)6 3787 (1022)
TI C6x4 178 (560) 173 (360)6 1986 (412) 3161 (420)6 295 (568) 225 (496) 96 (484)6 255 (2024)
Analog Devices
SHARC
5
328 (132) 342 (162) 2874 (132) 13051 (426) 455 (288) 498 (264) 207 (264)6 2348 (1092)
TI C305 334 (40) 358 (80) 3180 (60)6 9538 (156)6 554 (140) 694 (108)6 248 (140)6 2618 (1308)
1These results reflect the capability of the DSP's compiler and may not indicate the maximum potential of the DSP. Numbers in parentheses indicate code size in bytes.
216-bit, fixed-point DSPs.
324-bit, fixed-point DSP.
432-bit, very-long-instruction-word (VLIW) DSP.
5Floating-point DSPs.
6No improvements could be made by hand-optimizations.