
As DSP technology moves away from niches and into mainstream applications for consumer products, designers need to optimize DSP-code efficiency to minimize the power consumption and cost of the end product. Efficient code means you must take advantage of every nuance of your chosen DSP architecture. The examples in this article demonstrate how to use architectural features of the AT&T DSP1600 family to augment the computation-intensive inner loops of table search, filter, and convolutional encoder algorithms.
Optimization of DSP code is key to developing competitive products. Consider emerging digital cellular phones. Most prototypes demonstrated thus far require multiple fixed-point DSPs to perform the digital-cellular algorithms. Efficient coding can eliminate the need for one or more DSPs and result in a lower-cost product that requires less power, both crucial to product success in a cellular-phone market.
Three functions that are useful in a variety of communications-oriented DSP applications such as digital cellular phones include: vector-sum excited linear-prediction (VSELP) codebook search, VSELP 10-pole filter, and forward-error-correction (FEC) convolutional encoder.
You can use VSELP and FEC functions to encode digitized speech. The basic search, filter, and convolutional encoder algorithms are also inherent in many other applications.
A computation-intensive function required in many DSP table and codebook search algorithms is:
where i=1, 2 ... N. The exact meaning of parameters A and B aren't important here. Understand, however, that the function requires a search that takes the elements A(i) and B(i) from calculated vectors of size N, and looks for the value of i that maximizes the expression.
In C-like pseudo code you could express the algorithm as:


This algorithm assumes RAM precalculates and stores A and B data elements as interleaved entries. The DSP1600 assembly code has a time-stationary C-like syntax. All instructions on one line execute in parallel in a single cycle, and the results are available for use by instructions on the next line (see box, "Reading DSP assembly code").

Optimizing DSP code requires that you save every cycle by carefully considering each instruction stepeven simple ones. For example, The XOR (exclusive ora1 a1) instructions near the top of this code example set accumulator a1 to 0 without doing an immediate load. DSP1600 family processors can execute the XOR instruction in a single instruction cycle; an immediate load would require two instruction cycles.
Reading DSP assembly codeThe structure of the DSP assembly code exactly matches the DSP architecture. A single line of code can include four operations. The code is called "time stationary" because all four instructions on a line execute simultaneously. The left-most instruction column includes only instructions relative to the ALU. The second column from the left includes multiplication instructions, and the third and fourth columns from the left are used to load the Y and X operands, respectively. The Y addressing unit has four dedicated pointers (r0, r1, r2, and r3) because you typically use it to access RAM. The X unit has a single dedicated pointer (pt). The DSP1600 family processors also have three dedicated counters (c0, c1, and c2), two accumulators (a0 and a1) with alternate accumulators (aa0 and aa1), and three auxiliary BMU (bit-manipulation unit) registers (ar0, ar1, and ar2). While all instructions on a single line execute in parallel in a single instruction cycle, the data flow is in different stages of the processor pipeline. Operands loaded into the X and Y registers in one line would multiply in a subsequent line, and the result would accumulate in a third line. The programmer controls the data flow through the pipeline.
|
The key to this algorithm's efficiency is cross multiplication, which eliminates processing overhead caused by division and the special conditional execution capabilities of the DSP. The DSP's pipelined DAU (data arithmetic unit) performs the calculations efficiently, just as most general-purpose DSPs would. After the comparison, the assembly code executes the equivalent of the pseudo code "if clause" in a single instruction cycle. The quick execution works because the DSP1610/16 instruction,

Another requirement for fast VSELP operation and for most DSP applications is the ability to perform adaptive synthesis filter algorithms efficiently. The following example demonstrates a 10th-order IIR filter implementation. The adaptation is carried out in the lattice-filter domain, but the implementation is applied in direct form. Because of the wide dynamic range of speech and the fixed-point nature of the DSP1600 family some scaling is required during the application of the filter.
This algorithm assumes that buffer r0 points to the feedback-state variable buffer, r1 points to the input-data buffer, r2 points to the output-data buffer, and pointer pt points to the filter coefficients. The basic function of the algorithm is to perform the dot product of the feedback states and the filter coefficients by multiplying each state/coefficient pair and summing the results.

The first pass also demonstrates another way that you can code even simple instructions to save DSP cycles. The first two accumulator instructions, a0=p and a0=a0-p, actually serve to set accumulator a0=0. An explicit a0=0 instruction would have required two dedicated instruction cycles to perform the immediate load operation. Even an XOR operation requires one dedicated instruction cycle because the instruction can't execute concurrently with pipeline load instructions. The initialization is essentially performed for free (with no instruction cycle overhead). The first two accumulator instructions execute in parallel with the two instructions necessary to load the pipeline before performing the do loop.
This algorithm processes 40 data words in a VSELP speech coder.

The efficiency of this filter algorithm comes as a result of the code's use of the instruction cache integrated on the DSP1610 and DSP1616 processors. You can use the DSP1600 "do" operation to explicitly load the instruction cache. The cache holds the repeating section of the filter routine that computes the dot product. The instructions execute in two cycles the first time through the loop. Subsequent iterations of the filter don't require an instruction fetch from memory, and, therefore, execute in a single instruction cycle.
The algorithm works properly because the coefficient fetches from the X memory space are fetches from the internal RAM of the DSP. The internal RAM is dual-ported and accessible by both X and Y addressing units. Therefore, the adaptive coefficients can be read by the X space and written by the Y space. Each loop iteration performs an arbitrary scaling of the denominator before final output rounding. The two accumulators implement a delayed-feedback pipeline to minimize instruction cycles and memory accesses. The first pass of the filter sets up the delayed feedback by loading a1 with an initial state, and each pass through the main routine also passes a value for a1 back to the beginning of the loop.
You may notice that the main part of the filter loop includes the following line:
Another computation-intensive algorithm found in many DSP applications is convolution. Convolutional encoders, for example, are used to add structure and redundancy to bit streams that need to be transmitted via a medium that isn't guaranteed error-free. The structure that convolution adds, FEC, allows errors to be corrected at the receiver.
A convolutional encoder uses a generating equation that expresses output bits as a combination of delayed input bits. The amount of delay is specified by an exponential power of the unit-delay operator D. You can combine the bits using an XOR function. For example, consider the expression:
The delay elements are essentially single-bit latches. The group of three delays in Fig 1 can be thought of as a 3-bit shift register. The content of the 3 bits at any time constitutes a discrete encoder state out of eight possible discrete states. The algorithm below implements the 1+D3 encoder described in the example above:

Note that the algorithm reads two 16-bit words into accumulator a0 yet generates only a single 16-bit output. The second 16-bit input word is required to generate the proper output bits for the 3 least-significant bits of the first 16-bit word.
Advanced signal coders for applications such as digital cellular use two or more polynomials with several delay elements. A first polynomial adds structure to the bit stream, and a second polynomial adds redundancy. The two polynomials generate 2 output bits (called a di-bit) for each input bit. This is known as a rate ½ convolutional encoder.
You can implement such a convolutional encoder using an efficient parallel algorithm that encodes 16 bits in a single pass. To optimize the implementation, you must store the input-bit stream in discrete 16-bit words and feed it efficiently into the generating equations. The key to efficient implementation is a DSP architecture that includes barrel- shift capabilities and an instruction set that supports arbitrary shift distances.
A convolutional encoder for North American digital cellular creates the two output words using the equations specified by the following expressions:
The following algorithm implements an encoder for the above equations:

Also note that c0 is never explicitly incremented because the DSP1600 "if c0" statement found later in the code automatically increments the counter.



Almost any encoding application, however, requires that the di-bits are interleaved for transmission. The following DSP1610/16 code performs the interleave operation, taking advantage of the instruction cache and the ability to swap the contents of aa0 and a0 in a single instruction cycle, eliminating the overhead that traditional DSPs would encounter due to dedicated data-load instructions.

Once the instruction-cache do loop completes the interleave for one 16-bit input, the "if c0" instruction tests to see if the counter has reached 0. The instruction automatically increments c0 after each test, so that, after the sixth time through the loop, the test fails, and processing is done.
John P Sweeney is a senior applications engineer with AT&T Microelectronics, where he provides support for system hardware and software used in digital-cellular applications. He holds a BSEE degree from Villanova University, Villanova, PA, and in his spare time enjoys music, high-fidelity equipment, racquetball, tennis, and golf.