EDN logo


Design Feature: August 17, 1995

A distributed arithmetic approach to designing scalable DSP chips

Bernie New,
Xilinx Inc

You can use distributed arithmetic to design DSP chips. Using look-up tables, it lets you develop hardware implementations of DSP algorithms in FPGAs. Pipelining lets you trade off speed vs logic resources.

At the heart of most DSP algorithms is a multiply-and-accumulate (MAC) function that you can implement more efficiently with distributed-arithmetic techniques than by using conventional arithmetic methods. Distributed arithmetic makes extensive use of look-up tables (LUTs), which makes it ideal for implementing DSP functions in LUT-based field-programmable gate arrays (FPGAs).

An additional advantage is an easy trade-off between output sample rate and resource utilization. You can use a minimum set of LUTs and adders in a sequential implementation to minimize cost. For higher speed, you replicate the set of LUTs and include more adders. You can vary the number of LUTs used to achieve the desired speed. If enough LUTs are available to process all the incoming data bits simultaneously, the chip can perform MAC operations at the full clock rate.


Distributed arithmetic

Distributed arithmetic differs from conventional arithmetic only in the order in which it performs operations. As an example, Fig 1 is a conventional four-product MAC that uses a sequential shift-and-add technique to multiply four pairs of numbers and sum the results. Each multiplier forms partial products by multiplying the coefficient by 1 bit of data at a time in an AND function. The technique then adds these partial products into an accumulator that shifts 1 bit position each clock cycle, thus appropriately weighting the partial products. The arithmetic then simultaneously performs four multiplications and sums the results when the products are complete. During each clock cycle, the four multipliers create four partial products, all with the same weight, which eventually sum into the output. Distributed arithmetic adds the partial products before, rather than after, weighing and accumulating them.

Fig 2 shows the same MAC, which reorders the operations through the use of distributed arithmetic. The technique reduces the number of shift-and-add circuits to one, but does not change the number of simple adders.

The coefficients in many DSP applications are constants. Consequently, the output of the AND functions and first three adders of Fig 2 depends only on the 4 input bits from the shift registers. Replacing these AND functions and adders with a simple 16-word LUT gives you the final form of a distributed-arithmetic MAC (Fig 3).


LUT definition

The LUT data comprise various sums of the four coefficients. If the 4 data bits that address the LUT are all 1, the output is the sum of all four coefficients. Any data bit that is a 0 eliminates the corresponding coefficient from the sum. Because the address contains all possible combinations of 1 and 0 inputs, the LUT contains all possible sums of the coefficients (Table 1).

In the four-coefficient example, the width of the LUT is typically 2 bits greater than the coefficient width. The additional 2 bits allow for word growth from the addition of the coefficients. The circuit requires at most 2 bits because the circuit is summing no more than four coefficients. Fewer additional bits may be adequate in cases in which the combinations of coefficients result in less word growth. More than four coefficients require more space for word growth. In general, the number of additional bits should be at least log2 (number of coefficients). You can use fewer bits only when you have determined the coefficients and computed the LUT contents to fit into a narrower word.

As the number of coefficients increases, the size of the LUT grows exponentially. You can avoid the problem of large LUTs by partitioning the multiplications into smaller groups and combining the LUT outputs with adders. The adders are less costly than the larger LUT. For example, if an FPGA uses 16-bit LUTs as four-input function generators, the optimum partition is four products per LUT. You can partition a 16-product MAC in this way (Fig 4). Instead of increasing the LUT size by a factor of 16, four LUTs are the same size as the one used in the four-product MAC of Fig 3. Using built-in carry logic, the adders are approximately the same size as the LUTs, and the total growth is a factor of only seven. You can program the small LUTs in the same way as in the four-product MAC.

FIR filters and MACs differ only in how they handle the input data (Fig 5). The MAC has a series of parallel-to-serial converters, but the filter has one long, bitwide shift register that taps into every word and parallel-loads only in its first word. The shift register operates by parallel-loading the new data sample into the first word of the shift register. The shift register then shifts all the data until they move 1 word, presenting each word, 1 bit at a time, to the LUTs. When the shifting is complete, the first word is empty and is ready to again parallel-load. The data already in the shift register position for the appropriate coefficient to multiply the data in the next filter cycle and lose the oldest data word. You can construct such a shift register using flip-flops. A more efficient implementation, however, uses RAM for all words except the first.

In addition to providing ROM LUTs, you can configure the function generators in SRAM-based FPGAs as 16-bit RAMs, which suit the data-storage needs of digital filters. The following example uses a read-modify-write technique for a distributed-arithmetic FIR filter. The example uses Xilinx's XC4000 FPGA family. Individual 1-bit-wide RAMs each hold the 3 bits of a data word at successive addresses (Fig 6). As each bit is read and presented to the LUT, it is rewritten to the same address but in the adjacent RAM. As the multiplication continues, the data simultaneously transfer to the next RAM.

You have to implement only the first data word with flip-flops because of the need to parallel-load new data. As the first word shifts into the multiplier, the multiplier provides the input to the first RAM. All the RAMs operate synchronously and can share an address counter. This approach leads to the FIR filter design shown in Fig 7.


Size and performance

Each 8x1-bit RAM that stores input data uses one-half of a configurable-logic block (CLB) in the XC4000 family. A single 3-bit counter, requiring 1.5 CLBs, addresses all of the RAMs. The parallel-to-serial converter for the newest data word uses four CLBs. You can implement the complete data storage using only 14 CLBs.

The maximum value that an LUT must store is the sum of four coefficients. Consequently, the width of the LUT is at most 2 bits greater than the coefficient width. With 8-bit coefficients and four inputs per LUT, each LUT uses five CLBs. The 16-coefficient filter, thus, uses 20 CLBs to implement the LUTs.

To maintain full arithmetic precision through the filter, the first adder stage that combines the LUT outputs must handle 11 bits, and the second adder stage must handle 12 bits. You can implement all three adders with 21 CLBs. The final accumulator must be able to hold as many as 20 bits; this task requires 10 CLBs. The overall CLB requirement is, at most, 65 CLBs. You may need fewer if some coefficients have low values or if the summation truncates or rounds some data. Even with the maximum CLB count of 65, you can implement the entire filter in two-thirds of an XC4003, the second smallest device in the XC4000 family, with its 10x10 array of CLBs. The same filter takes less than 7% of an XC4025, the largest family member, with its 32x32 CLB array.

Using the CLB's flip-flops, you can pipeline the filter at every stage at no additional cost. If you pipeline with the flip-flops, the speed-determining factor is the widest adder, which is in the final accumulator. The active width of this adder is only 13 bits, thanks to the sequential LSB-first operation. Consequently, you can clock the filter at approximately 40 MHz when you use the XC4000 FPGA family. With eight clock cycles for each data sample, the maximum data rate through the filter is approximately 5 MHz.

The previously described design is the slow-speed/low-cost endpoint in a speed/cost continuum. Storage cost is at its lowest because both data and coefficient stores are at their minimum. Conceivably, you could multiplex the LUTs into a smaller adder tree. With the XC4000 family, however, additional multiplexers are no smaller than the adders they replace. You would be sacrificing performance for little or no saving of FPGA logic resources.

You can improve performance by implementing additional logic on the FPGA. Generating a filter output sample accesses each LUT as many times as there are bits in the data word. Consequently, both the adder-determined clock period and the data-determined number of clocks per output sample limit the throughput.

Simply duplicating the LUTs so that they can simultaneously handle 2 data bits halves the clock-cycle requirement and doubles the throughput. You must correspondingly increase the size of the adder tree, but this increase significantly affects only the filter latency, not the clock period. You can obtain even higher throughput by further replicating the LUTs. Ultimately, you can replicate the LUTs as many times as there are bits in the input data. When you reach this point, all the necessary LUT accesses occur simultaneously, and every clock cycle can produce a filter output. This technique yields a fully parallel distributed-arithmetic filter.

Fig 8 shows part of the parallel version of the FIR filter. The version replicates each of the four LUTs in the sequential design eight times and weights the outputs according to the weight of the input bits. This version of the filter can simultaneously handle all of the data bits. To implement the circuit, you need 160 CLBs, five each for the 32 LUTs. The adder tree that combines the 32 LUT outputs comprises 31 adders. The adders vary in size from 11 bits in the first stage to 16 bits in the final stage. The tree provides a full 20-bit output, but a 16-bit adder is adequate because the 4 LSBs do not participate in the final addition. You can implement all 31 adders in a little more than 200 CLBs. To maximize performance, the flip-flops in these CLBs provide interstage pipelining at no additional expense.

You must also use flip-flops for the input data store. RAM storage is inappropriate because all 128 data bits must be available simultaneously. This store is organized as a 16-word, 8-bit parallel-shift register and uses both available flip-flops in each of 64 CLBs. This store does not use the logic resources in these CLBs, which remain available for other purposes. The store requires approximately 430 CLBs, or 75% of the 24x24 CLB array in the XC4013 FPGA. You can operate the filter at clock frequencies up to 40 MHz, which the settling time of the final 16-bit adder determines. The latency is seven clock cycles with one clock per sample, and 40 MHz is also the maximum sample rate.

This example demonstrates the speed/cost scalability of the distributed-arithmetic technique. Between the sequential design and the fully parallel design is an eightfold performance increase for a sevenfold increase in CLB count. In between, opportunities exist to increase performance by a factor of two or four, with corresponding increases in logic requirements.

The distributed-arithmetic technique is not limited to basic FIR filters; you can also implement the following other common types of filters:

Linear-phase FIR filters: Linear-phase delay is synonymous with constant-time delay, a highly desirable filter characteristic. These filters delay all frequency components of the signal for the same length of time and do not change their phase relationships. Other than the attenuation of some frequencies, the output signal is simply a delayed replica of the input. This filter avoids the dramatic distortion of the waveform that can result if you allow nonlinear phase delay to disturb the phase relationships.

Only FIR filters can provide an ideal linear-phase response and, surprisingly, linear-phase FIR filters are easier to implement than are general FIR filters. The phase response is linear only if the filter coefficients are symmetrical. The first and last coefficients must be equal, the second and the second-to-last must also be equal, and so on. You can add data words that are to be multiplied by equal coefficients before they are multiplied, thus reducing the required number of multiplications. Sequential distributed-arithmetic implementations halve the number of LUTs and correspondingly reduce the size of the adder tree. Because new premultiplication adders operate on serial-data streams, you can use simple 1-bit carry-save adders (Fig 9). These 1-bit adders are inexpensive and require only one CLB each, resulting in a net saving of CLBs.

Decimating FIR filters: Another unique ability of FIR filters is simple, easily implemented decimation. "Decimation" refers to limiting the signal bandwidth of a filter, which reduces the sample rate. Rather than calculating all the output samples and then discarding the unwanted ones, a FIR filter permits only the needed ones to be computed. This simplification greatly reduces the arithmetic bandwidth and is possible because each output is an independent calculation, depending only on input data.

Fig 10 shows how you get decimation by a factor of two by entering alternate input samples into separate shift registers. You then interleave the shift-register outputs into the LUTs. Two input-sample times are available for the filter to produce an output. The sample that arrives during the filter cycle is stored and entered at the same time as the one that arrives at the end of the cycle.

IIR filters: IIR filters involve feedback. Each output sample depends not only on current and prior inputs, but also on prior outputs. Consequently, an IIR filter does not reduce arithmetic because you must compute all outputs for internal filter use, even if the filter's output does not required them. You can only approximate linear-phase response because you cannot achieve the ideally symmetrical impulse response that FIR coefficients guarantee. IIR filters are very efficient in terms of filter quality for a given arithmetic bandwidth, however.

A common IIR implementation technique is to cascade small filters instead of building one large filter. Small filters are more stable and are less prone to arithmetic noise. The most common such building block, is the five-coefficient biquad-section filter (Fig 11). Biquad sections derive each output from three input samples and two previous outputs. You can conveniently implement a biquad section using a single 32-word LUT. Two shift registers, one from the input and the other from the output, provide LUT addresses.


Author's Biography


Bernie New has served at Xilinx (San Jose, CA) for four years as a staff applications engineer. His duties include writing application notes, assisting customers, and participating in new-product design. Before joining Xilinx, he served as product-planning manager at Advanced Micro Devices.


Acknowledgment
I would like to thank Les Mintzer of Momentum Data Systems and Bruce Newgard of Xilinx for their work on distributed arithmetic that led to this article.


Table 1—Contents of a 16-word look-up table
LUT inputLUT contents
Ai
Bi
Ci
Di
0
00000
1000
A
0100
B
1100
A+B
0010
C
1010
A+C
0110
B+C
1110
A+B+C
0001
D
1001
A+D
0101
B+D
1101
A+B+D
0011
C+D
1011
A+C+D
0111
B+C+D
1111
A+B+C+D


| EDN Access | feedback | subscribe to EDN! |
| design features | design ideas |


Copyright © 1995 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.