EDN logo

Design Features


Implement DSP functions in FPGAs to reduce cost and boost performance

Gregory Ray Goslin, Xilinx Inc


A programmable DSP isn't always the best choice to implement a signal-processing function, such as an FIR filter or Viterbi decoder. By hardwiring specific DSP functions in an FPGA, you can boost system performance and reduce cost.

  Traditionally, you implement DSP algorithms using general-purpose, programmable DSPs for low-data-rate applications or using application-specific, fixed-function DSP chip sets and ASICs for high-data-rate applications. Advancements in FPGAs provide new options for DSP-design engineers. FPGAs maintain the advantages of custom functionality, like an ASIC, but avoid high development costs and the inability to make design modifications after production. When building a DSP system in an FPGA, designers can take advantage of parallel structures and data flow through the FPGA and implement DSP algorithms in a way that minimizes resources and exceeds the performance of systems that use general-purpose DSPs.

  An SRAM-based FPGA is well-suited for arithmetic DSP functions, including multiply and accumulate (MAC) functions, FFTs, convolutions, and other filtering algorithms. You can also reconfigure the FPGA on the fly to perform different DSP functions or other system-level functions. FIR-filter and Viterbi-decoder designs show the efficiency of an FPGA approach. Moreover, you can use these techniques to increase efficiency in any system that relies on programmable processors—even microcontroller-based systems.

Viterbi-decoder design

  A Viterbi decoder (Reference 1) provides a good example of how an FPGA can accelerate a DSP function. You can use Viterbi decoders in a variety of communication applications to provide forward error correction. Viterbi decoders require a noise-cancellation function, and you can implement the computation-intensive portion of that function using an FPGA. In this case, the FPGA works alongside a DSP that implements the remainder of the Viterbi-decoder algorithm.

  Figure 1 shows a block diagram that determines minimum and maximum values from a two-channel incoming data stream. Registering new minimum and maximum values allows a DSP handling noise cancellation to adjust the gain and other parameters in the filters. One way to approach the design is to code the function using a DSP. The problem with doing so is the serial nature of a DSP. A flow chart (Figure 2) illustrates the way a DSP has to sequentially execute a function, even though the lack of data dependencies allows parallel execution.

  Running on a 66-MHz Motorola 56300 DSP, an algorithm used to calculate the "New_1'' and "New_2'' outputs (Figure 1) requires 17 computational clock cycles, plus seven additional clock cycles due to the wait state associated with external SRAM. The algorithm requires 360 nsec (24 clock cycles315 nsec) of processing time. This time does not include the calculation of the "Diff_1'' and "Diff_2'' outputs. These two outputs require an additional seven clock cycles.

  Two factors limit the data throughput for this DSP-based design. First, the wait state associated with the DSP's external SRAM requires two 15-nsec clock cycles for each memory access. Second, each add/subtract and multiplexer stage must be performed sequentially with additional wait states. The add/subtract stages require four additional operations with multiple instructions.

An FPGA-based Viterbi decoder

  An FPGA implementation can leverage the ability to process parallel datapaths and take advantage of the parallel structures of the four add/subtract blocks in the first stage and the two subtract blocks in the second stage. The two multiplexer blocks take advantage of the ability to register and hold the input data until needed with no external memory or additional clock cycles.

  Based on a 66-MHz clock, an FPGA-based implementation requires 135 nsec (nine clock cycles315 nsec) of total processing time. This timing assumes that the three inputs and four outputs must be sequentially read or written to from a common I/O bus. The actual propagation time through the add/subtract and multiplexer blocks is only two clock cycles. The 135-nsec propagation time total includes production of the "Diff_1'' and "Diff_2'' outputs as opposed to the 360 nsec required for the partial output data by the DSP. The FPGA-based implementation replaces a programmable DSP and three SRAM chips and significantly boosts performance.

  You can take advantage of the performance boost to increase data throughput through the noise-cancellation function, or you can minimize the FPGA implementation to make room for other functions. For example, because the FPGA implementation is more than double the speed of the DSP, you can also implement the design in a multiplexed version. The multiplexed implementation minimizes the number of configurable logic blocks (CLBs) in the FPGA that are dedicated to the minimum/maximum function. Minimization is done by observing the symmetrical nature of the design. Note that "Old_1'' and "Old_2'' data follow the same path with only a minor difference in the magnitude of the second stage subblock. This implementation requires a specific ordering for data accesses on the I/O data bus.

  Recognizing the value of an FPGA in the Viterbi-decoder example is relatively simple. Moreover, implementing the function requires a simple combination of adders, multiplexers, and registers. To find such a candidate for an FPGA implementation, you need to recognize only cases where a typical DSP implementation contains many feedback loops or parallel structures (Figure 2). The software code for a DSP algorithm of this type is not efficiently implemented in a general-purpose DSP. Typically, approximately 10 to 30% of the DSP code uses 60 to 80% of the DSP's power. By analyzing the DSP algorithm and partitioning any parallel structures or repetitive loops into multiple datapaths, you can enhance the overall performance of the algorithm.

Distributed arithmetic and filters

  In some cases, however, it may be easier to recognize an algorithm that's inefficient on a DSP than to design an FPGA replacement. You may, in fact, need to rely on alternative ways of approaching math problems. For example, you can use distributed-arithmetic techniques for array multiplication in an FPGA to increase data bandwidth and throughput by several orders of magnitude over DSP implementations (Reference 2). Moreover, you can use distributed-arithmetic techniques to minimize the number of CLBs that an FPGA requires to implement a function, such as a filter.

  Consider the 16-tap, 8-bit, fixed-point FIR filter (Reference 3) in Figure 3. The keystone of this, and most, filter algorithms is a MAC function. Filter designs can require a few to thousands of MACs. As the number of MACs increases, the algorithm becomes increasingly complex for a CPU-based architecture. Hence, the algorithm becomes more computation-intensive for any conventional DSP.

  You can more efficiently implement the MAC function using various distributed-arithmetic techniques than with conventional-arithmetic methods. Although the FPGA has no dedicated multiplier, you can use it to implement a MAC using distributed arithmetic.

  Distributed arithmetic differs from conventional arithmetic only in the order in which it performs operations. To understand how to apply the techniques, consider the problem of designing a four-product MAC (Figure 4) using traditional arithmetic. Then, consider a form of distributed arithmetic called "serial distributed arithmetic" (SDA).

  The four-product MAC employs a conventional sequential shift-and-add technique to multiply four pairs of numbers and sum the results. Each MAC block forms partial products by multiplying the coefficient (Y) by 1 bit of the data (A) at a time, using an AND operation. As shading suggests, you need a 1-bit MAC block for each coefficient bit; a filter with 8-bit coefficients employs eight 1-bit MAC blocks to simultaneously multiply 1 bit of the data (A) with the entire 8-bit coefficient.

 The output of the AND gate feeds an adder and an accumulator. The shift-and-add technique adds partial products into the accumulator that keeps a running total. For each incoming data bit or clock cycle, the technique shifts the accumulator's feedback one bit-position to the right, thereby performing a divide-by-two operation to compensate for the bit-weighting of the ingressing partial product (Figure 5).

  In Figure 4, the four multiplications are performed simultaneously, and the results are then summed when the products are complete. This process requires n clock cycles for a data sample of n bits. Hence, the processing clock rate is equal to the data rate divided by the number of data bits. During each data/clock cycle, the four multipliers simultaneously create four product terms (PA, PB, PC, and PD) that eventually are summed to form the output.

Reordering operations with SDA

  By applying SDA to this multiplication process, you can minimize the FPGA implementation. In fact, the SDA implementation differs from the model in Figure 4 because you add the partial products before, rather than after, the bit-weighted accumulation. You can reorder the operation using SDA (Figure 6). This technique reduces the number of shift-and-add circuits to one but does not change the number of simple adders. The four-to-one reduction of shift-and-add circuits is necessary to fit the design into an FPGA, especially because the width of the coefficients acts as a multiplier in determining the total number of circuits required.

  With careful consideration, you can minimize the FPGA implementation even more. For example, the coefficients in many filtering applications are constants. Consequently, the output of the AND functions and the three adders in Figure 6 depend only on the 4 input bits from the shift registers. Replacing these AND functions and the three adders with a simple look-up table (LUT) gives the final reduced form of an SDA MAC (Figure 7).

  The LUTs in Figure 7 have four 1-bit inputs and store or output 16 1-bit results. As the shading suggests, you need a number of LUTs (or a LUT with depth) to accommodate the width of the coefficients. In fact, you need a minimum of 1 bit of depth per coefficient bit, and in some cases with large coefficients, you may need as many as 2 extra depth bits to allow for word growth from add operations. In other words, the output of the bank of LUTs can be as wide as 10 bits.

  As depicted, the circuit needs no more than two additional LUTs, because the circuit sums no more than four coefficients. More coefficients require more LUTs for word growth. In general, the number of additional LUTs used should be at least Log2 (number of coefficients). You can use less depth only when the specific coefficients are known and the LUT has been computed.

  Although deep LUTs can be more efficiently implemented in an FPGA than in deep AND/adder banks, you still must guard against the implementation's growing too large. As the number of coefficients increases, the size of the LUT grows exponentially. You can avoid a large LUT by partitioning the circuit into smaller groups and combining the LUT outputs with adders. The adders require fewer gates than the larger LUT.

Partitioning LUTs

  In the 16-tap FIR filter, the circuit requires a 16-input LUT bank that produces 16 products. If the FPGA architecture uses four input-function generators, however, the optimum partition is four products per LUT. You can partition the 16-MAC product and sum (Figure 8).

  Rather than enlarge the LUT by a growth factor of 16, use four LUTs the same size as the one in Figure 7. The additional three adders, each occupying approximately the same space as one 4-bit LUT, result in a growth factor of six to seven.

  The only remaining task in the FIR-filter design is efficiently implement the way you handle the input data stream. The MAC circuits in Figures 4, 6, and 7 use a series of parallel-load shift registers, although typical FIR filters use one long, bitwide shift register with taps at each word boundary.

  With the exception of the first parallel-in and serial-out shift register, you can construct the shift registers using traditional registers or by using a synchronous RAM-based shift register. This approach is efficient because the synchronous RAM needs to produce only 1 bit of data each sample (tap) per clock cycle. Such a synchronous-RAM-based shift register offers 16 times the density of a register-based shift register.

Symmetrical filters

  If the filter is symmetrical, it is possible to further reduce the resources required for the design. In a symmetrical filter (Figure 1) two data taps always share the same coefficient. Therefore, you can reduce the number of LUTs (or MACs) by a factor of two by adding the input data with a common coefficient before multiplication.

  A filter design implemented in an FPGA using SDA gives a significant amount of performance and uses only a modest number of CLBs. The SDA-FIR design supports an on-off chip I/O-data rate at more than 8M samples/sec (13.7 nsec3(8 bits+1)/16 taps=7.71 nsec/tap) and occupies 68 CLBs in a Xilinx XC4000E-3 FPGA. Note that increasing the number of MAC blocks or taps has no significant impact on the sample rate when you use SDA. The increase would affect only the number of CLBs required to implement the filter.

  Compare the same 16-tap FIR-filter algorithm implemented in various state-of-the-art, fixed-point DSPs with that of the Xilinx FPGA. A fixed-point DSP rated at 50 MIPS requires at least 20 nsec/tap to implement a 16-tap FIR filter, or a maximum sample rate of 3.125M samples/sec. Moreover, the performance of general-purpose, DSP-based filter devices degrades significantly with each additional MAC or tap.

  You can use other distributed-arithmetic techniques either to further minimize the number of CLBs required to implement the filter or to further boost the data rate through the filter. For example, you can use a serial-sequential-distributed-arithmetic (SSDA) technique at lower data rates to conserve more CLBs. The SSDA process is performed bit-serially and word-sequentially, whereas SDA is performed bit-serially and word-parallel. SSDA requires an internal, synchronous dual-ported RAM or FIFO to time-multiplex each of the data samples, bit serially, through the logic. The performance for SSDA implementations generally equals the maximum data rate for a given number of data bits in an SDA filter, divided by the number of taps. This technique is typically useful for data sample rates below 1 MHz.

  You can also use parallel distributed arithmetic (PDA) to increase the overall performance of SDA. With PDA, the number of bits processed during each clock cycle increases. Increasing the number of bits sampled has a significant effect on the number of CLBs you use for the design. Therefore, the number of parallel bits sampled should be increased only to meet the required performance.

  Increasing the number of bits processed from a 1-bit (in the case of SDA) to a 2-bit PDA results in half the number of processing clock cycles. Hence, 2-bit PDA results in twice the throughput. Using a 2-bit PDA, you can replace the serial-shift registers with two similar 1-bit shift registers at half the bit depth. The two parallel-shift registers are spilt such that one stores the even bits, and the other stores the odd bits. The 2-bit parallel data samples require twice the number of LUTs. You also need to add an adder that adds the two partial sums resulting from each of the two parallel sample bits. You expand the scaling accumulator's input bus to accommodate the larger partial sum, and you change the final scaling accumulator from a 1- to a 2-bit shift (divide by 4) for scaling. These changes double the resources required, compared to the SDA design.

  You can also implement this example using more than 2 bits, scaling all the way to a fully parallel PDA algorithm. In other words, the number of bits processed during each clock cycle can increase until you reach an n-bit PDA implementation, for n-bit data samples. Such a design can process more than 55M samples/sec (17.9 nsec/16 taps=1.12 nsec/tap) and occupies 400 CLBs in an XC4000E-3.


References

  1. Viterbi, AJ, and Omura, JK, "Principles of digital communication and coding," McGraw-Hill, New York, 1965.
  2. Goslin, Gregory R, "Using Xilinx FPGAs to design custom digital-signal-processing devices," DSPX 1995 Technical Proceedings, Jan 12, 1995, pg 595.
  3. Goslin, Gregory R, and Newgard, B, "16-tap, 8-bit FIR filter application guide" Xilinx Publications, Nov 21, 1994.

Author's biography
Greg Goslin is a DSP manager for Xilinx Corp (San Jose, CA), where he develops DSP applications with FPGAs. He holds a BS in physics from the University of Washington (Seattle) and a BSEE from the University of the Pacific (Stockton, CA). In his spare time, he enjoys kayaking, mountaineering, sailing, mountain biking, and designing and building new things.

| EDN Access | feedback | subscribe to EDN! |
| Table of Contents |


Copyright © 1996 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.