EDN logo


Design Feature: April 28, 1994

Mainstream applications require optimized assembly language for fast DSPs

John P. Sweeney,
AT&T Microelectronics

DSPs are fast, but you have to optimize your code to take advantage of their speed.

As DSP technology moves away from niches and into mainstream applications for consumer products, designers need to optimize DSP-code efficiency to minimize the power consumption and cost of the end product. Efficient code means you must take advantage of every nuance of your chosen DSP architecture. The examples in this article demonstrate how to use architectural features of the AT&T DSP1600 family to augment the computation-intensive inner loops of table search, filter, and convolutional encoder algorithms.

Optimization of DSP code is key to developing competitive products. Consider emerging digital cellular phones. Most prototypes demonstrated thus far require multiple fixed-point DSPs to perform the digital-cellular algorithms. Efficient coding can eliminate the need for one or more DSPs and result in a lower-cost product that requires less power, both crucial to product success in a cellular-phone market.

Three functions that are useful in a variety of communications-oriented DSP applications such as digital cellular phones include: vector-sum excited linear-prediction (VSELP) codebook search, VSELP 10-pole filter, and forward-error-correction (FEC) convolutional encoder.

You can use VSELP and FEC functions to encode digitized speech. The basic search, filter, and convolutional encoder algorithms are also inherent in many other applications.

A computation-intensive function required in many DSP table and codebook search algorithms is:

max {A(i)/B(i)},

where i=1, 2 ... N. The exact meaning of parameters A and B aren't important here. Understand, however, that the function requires a search that takes the elements A(i) and B(i) from calculated vectors of size N, and looks for the value of i that maximizes the expression.

In C-like pseudo code you could express the algorithm as:

It is inefficient to divide when using a DSP. You might want to rewrite the inner loop using an equivalent multiplication operation. You can accomplish this by rearranging the equation to eliminate divisions in favor of multiplications.

The computation-intensive parts of the algorithm are the multiplications and the comparison/selection function. Coding this algorithm efficiently in assembly code requires you to optimize your DSP code, taking advantage of architectural features that can accelerate the signal-coding function. For example, the following assembly-language algorithm for the AT&T DSP1610 or DSP1616 benefits from the conditional "ifc" instructions, which lend themselves to signal-coding applications.

This algorithm assumes RAM precalculates and stores A and B data elements as interleaved entries. The DSP1600 assembly code has a time-stationary C-like syntax. All instructions on one line execute in parallel in a single cycle, and the results are available for use by instructions on the next line (see box, "Reading DSP assembly code").

This algorithm benefits from the fact that the DSP1600 family allows both data (Y) and program/coefficient (X) addressing units to access RAM. Traditional DSPs limit program/coefficient access to ROM. The algorithm also takes advantage of the fact that the 36-bit accumulators and the 32-bit y register can store two 16-bit values or a single 32-bit value. For example, a1 initially stores the settings for A[best] in bits 16 to 31 (referred to as a1H for high) and B[best] in bits 0 to 15 (referred to as a1L for low).

Optimizing DSP code requires that you save every cycle by carefully considering each instruction step—even simple ones. For example, The XOR (exclusive or—a1 a1) instructions near the top of this code example set accumulator a1 to 0 without doing an immediate load. DSP1600 family processors can execute the XOR instruction in a single instruction cycle; an immediate load would require two instruction cycles.

Reading DSP assembly code

For the uninitiated, the DSP code in this article can appear rather cryptic, but the code is actually easy to read once you realize that the structure is analogous to the structure of a DSP. The AT&T DSP1600 family of processors and most other DSPs have four key pipelined elements: an X memory addressing unit typically used to access ROM, a Y memory addressing unit typically used to access RAM, a multiplier, and an ALU. The hallmark of DSPs is their ability to fetch two operands and perform a multiply and an accumulate in a single instruction cycle.

The structure of the DSP assembly code exactly matches the DSP architecture. A single line of code can include four operations. The code is called "time stationary" because all four instructions on a line execute simultaneously. The left-most instruction column includes only instructions relative to the ALU. The second column from the left includes multiplication instructions, and the third and fourth columns from the left are used to load the Y and X operands, respectively. The Y addressing unit has four dedicated pointers (r0, r1, r2, and r3) because you typically use it to access RAM. The X unit has a single dedicated pointer (pt).

The DSP1600 family processors also have three dedicated counters (c0, c1, and c2), two accumulators (a0 and a1) with alternate accumulators (aa0 and aa1), and three auxiliary BMU (bit-manipulation unit) registers (ar0, ar1, and ar2).

While all instructions on a single line execute in parallel in a single instruction cycle, the data flow is in different stages of the processor pipeline. Operands loaded into the X and Y registers in one line would multiply in a subsequent line, and the result would accumulate in a third line. The programmer controls the data flow through the pipeline.

The key to this algorithm's efficiency is cross multiplication, which eliminates processing overhead caused by division and the special conditional execution capabilities of the DSP. The DSP's pipelined DAU (data arithmetic unit) performs the calculations efficiently, just as most general-purpose DSPs would. After the comparison, the assembly code executes the equivalent of the pseudo code "if clause" in a single instruction cycle. The quick execution works because the DSP1610/16 instruction,

ifc CON a1=y,

performs the following operations in a single instruction cycle:

CON can be one of any number of tests on the DAU's logical flags. In the code above the instruction "ifc" pl, a1=y (if plus) tests the prior accumulator instruction a0=a0-p for a positive result. A positive result indicates that a new best value has been located. The nop in the pseudo code simply implies that the path length through the code is consistent, regardless of whether the condition passes or fails. Consistent path lengths allow for accurate prediction of worst-case power requirements—a key to commercial DSP implementation.

Another requirement for fast VSELP operation and for most DSP applications is the ability to perform adaptive synthesis filter algorithms efficiently. The following example demonstrates a 10th-order IIR filter implementation. The adaptation is carried out in the lattice-filter domain, but the implementation is applied in direct form. Because of the wide dynamic range of speech and the fixed-point nature of the DSP1600 family some scaling is required during the application of the filter.

This algorithm assumes that buffer r0 points to the feedback-state variable buffer, r1 points to the input-data buffer, r2 points to the output-data buffer, and pointer pt points to the filter coefficients. The basic function of the algorithm is to perform the dot product of the feedback states and the filter coefficients by multiplying each state/coefficient pair and summing the results.

The first pass of the filter calculates an initial value for the feedback state and stores the value in accumulator a1. By segmenting the code into first-pass and main-filter loops, you also ensure that the main part of the filter routine will fit into the instruction cache on the DSP1610/16, thereby speeding the repetitive multiply/accumulate process.

The first pass also demonstrates another way that you can code even simple instructions to save DSP cycles. The first two accumulator instructions, a0=p and a0=a0-p, actually serve to set accumulator a0=0. An explicit a0=0 instruction would have required two dedicated instruction cycles to perform the immediate load operation. Even an XOR operation requires one dedicated instruction cycle because the instruction can't execute concurrently with pipeline load instructions. The initialization is essentially performed for free (with no instruction cycle overhead). The first two accumulator instructions execute in parallel with the two instructions necessary to load the pipeline before performing the do loop.

This algorithm processes 40 data words in a VSELP speech coder.

Note the following features of the filter algorithm. Cyclic addressing is applied to the Y-space memory accesses. The DSP1600 allows you to define a buffer that is pointed to by r0, r1, r2, or r3 (four pointers that are dedicated to the Y memory space) and bounded by rb and re. In this case, registers rb and re define a 10-word cyclic buffer. The DSP1600 performs the pointer-reset operation with no instruction- cycle overhead. The X memory fetches operate in another RAM buffer, and the reset of the X pointer is reset explicitly using the post modification of the pt register by i.

The efficiency of this filter algorithm comes as a result of the code's use of the instruction cache integrated on the DSP1610 and DSP1616 processors. You can use the DSP1600 "do" operation to explicitly load the instruction cache. The cache holds the repeating section of the filter routine that computes the dot product. The instructions execute in two cycles the first time through the loop. Subsequent iterations of the filter don't require an instruction fetch from memory, and, therefore, execute in a single instruction cycle.

The algorithm works properly because the coefficient fetches from the X memory space are fetches from the internal RAM of the DSP. The internal RAM is dual-ported and accessible by both X and Y addressing units. Therefore, the adaptive coefficients can be read by the X space and written by the Y space. Each loop iteration performs an arbitrary scaling of the denominator before final output rounding. The two accumulators implement a delayed-feedback pipeline to minimize instruction cycles and memory accesses. The first pass of the filter sets up the delayed feedback by loading a1 with an initial state, and each pass through the main routine also passes a value for a1 back to the beginning of the loop.

You may notice that the main part of the filter loop includes the following line:

a0=a0+p p=x*y y=*r0++ x=*pt++

This line is repeated exactly seven times. The code for the first part of the filter used a do construct to simplify these load/multiply/accumulate lines and to take advantage of the instruction cache. The main part of the filter, however, needs the instruction cache to be loaded with instructions that speed iterations of the main loop. Adding a looping construct with a conditional "if" statement would reduce the number of lines of code and would also add instruction-cycle overhead. Explicitly repeating the seven lines ensures single cycle execution due to the use of the instruction cache.

Another computation-intensive algorithm found in many DSP applications is convolution. Convolutional encoders, for example, are used to add structure and redundancy to bit streams that need to be transmitted via a medium that isn't guaranteed error-free. The structure that convolution adds, FEC, allows errors to be corrected at the receiver.

A convolutional encoder uses a generating equation that expresses output bits as a combination of delayed input bits. The amount of delay is specified by an exponential power of the unit-delay operator D. You can combine the bits using an XOR function. For example, consider the expression:

g(D)=1 + D3.

The diagram in Fig 1 represents this generating expression. The output bit depends only on previous inputs and not on previous outputs. Thus, the encoder has a finite impulse response. The maximum delay (the current element plus the total number of delay elements) in the generating polynomial is called the "constraint length" of the encoder. For example, an encoder with a constraint length of six has a maximum of five delay elements and 32 discrete states that it can assume.

The delay elements are essentially single-bit latches. The group of three delays in Fig 1 can be thought of as a 3-bit shift register. The content of the 3 bits at any time constitutes a discrete encoder state out of eight possible discrete states. The algorithm below implements the 1+D3 encoder described in the example above:

The algorithm reads 16-bit words one at a time from the end of a bit-stream buffer. The loaded accumulator is left- shifted by 4 bits into the accumulator guard bits. By shifting the input word into guard bits, the algorithm ensures that the most-significant accumulator bit is not affected by the sign bit during later accumulator operations. The accumulator can be restored after the XOR operation. The delayed bits form in a1 by left-shifting a0 by 3 bits, representing the input-bit stream delayed by 3 bits. The XOR operation generates a 16-bit word that represents 16 bits of the bit stream processed by the polynomial.

Note that the algorithm reads two 16-bit words into accumulator a0 yet generates only a single 16-bit output. The second 16-bit input word is required to generate the proper output bits for the 3 least-significant bits of the first 16-bit word.

Advanced signal coders for applications such as digital cellular use two or more polynomials with several delay elements. A first polynomial adds structure to the bit stream, and a second polynomial adds redundancy. The two polynomials generate 2 output bits (called a di-bit) for each input bit. This is known as a rate ½ convolutional encoder.

You can implement such a convolutional encoder using an efficient parallel algorithm that encodes 16 bits in a single pass. To optimize the implementation, you must store the input-bit stream in discrete 16-bit words and feed it efficiently into the generating equations. The key to efficient implementation is a DSP architecture that includes barrel- shift capabilities and an instruction set that supports arbitrary shift distances.

A convolutional encoder for North American digital cellular creates the two output words using the equations specified by the following expressions:

g0(D) = 1 + D + D3 + D5

g1(D) = 1 + D2 + D3 + D4 + D5.

An efficient way to implement this algorithm is to calculate g0(D) and then calculate g1(D) using the following expression:

g1(D) = g0(D) + D + D2 + D4.

The XOR operation used to sum bits allows the dual occurrence of the D term to cancel each other thus properly removing the D term from the g1(D) expression.

The following algorithm implements an encoder for the above equations:

Note that counter c0 is set to -5 but is expressed as 1-6 for readability; this example makes six passes through the loop starting at ConvLoop. Six passes are necessary to encode each speech segment in a North American digital cellular phone.

Also note that c0 is never explicitly incremented because the DSP1600 "if c0" statement found later in the code automatically increments the counter.

Registers ar0, ar1, and ar2 are auxiliary BMU registers. Using these registers to store shift distances allows the actual shift instructions to execute in a single instruction cycle. A shift instruction that uses immediate operands such as 1, 2, and 4 would require two instruction cycles.

The instruction a1=a0:aa0 performs three operations. The instruction swaps the contents of accumulator a0 with the contents of alternate accumulator aa0. The instruction stores the value that was originally in aa0 in a1, although that isn't important here because a1 is reassigned in the next instruction, and the instruction sets the DAU logical flags based on the value moved into a1.

Calculating g0 and g1 is straightforward, provided that the DSP can efficiently handle arbitrary shifts . The above algorithm generates the delayed terms by left-shifting the 32 bits of data in a1. Delayed terms are accumulated using XOR operations on a0. At this point in the code, alternate accumulator aa0 holds g0 and accumulator a0 holds g1.

Almost any encoding application, however, requires that the di-bits are interleaved for transmission. The following DSP1610/16 code performs the interleave operation, taking advantage of the instruction cache and the ability to swap the contents of aa0 and a0 in a single instruction cycle, eliminating the overhead that traditional DSPs would encounter due to dedicated data-load instructions.

The left shift of a1 makes room for each succeeding bit of the interleaved result. The next instruction swaps g0 and g1 between a0 and aa0 and sets the DAU flags based on the new contents of a0. The "if minus" instruction essentially tests the most-significant bit of a0 from the previous instruction and adds 1 to a1, which holds the result, if the most-significant bit in a0 is a 1. By adding 1 to a1, the instruction sets the least significant bit in a1 to a 1. The final left-shift of a0 prepares the current contents of a0 to test the next bit two iterations of the loop later (because of the swap of a0 and aa0). After 32 iterations of the loop, the 16-bit values that g0 and g1 represent have been interleaved, and the resulting 32 bits have been stored in a buffer pointed to by r1.

Once the instruction-cache do loop completes the interleave for one 16-bit input, the "if c0" instruction tests to see if the counter has reached 0. The instruction automatically increments c0 after each test, so that, after the sixth time through the loop, the test fails, and processing is done.


Author's biography

John P Sweeney is a senior applications engineer with AT&T Microelectronics, where he provides support for system hardware and software used in digital-cellular applications. He holds a BSEE degree from Villanova University, Villanova, PA, and in his spare time enjoys music, high-fidelity equipment, racquetball, tennis, and golf.


| EDN Access | feedback | subscribe to EDN! |
| design features | design ideas |


Copyright © 1995 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.