EDN Access

 

August 15, 1997


Integers out of sorts?
Program an FPGA to put them in order

Cory Peichel, Raytheon E Systems

Designing integer-sort algorithms into FPGAs isn't as difficult as you might think. The payoff can be huge: short design time, flexibility, and fast operation.

Engineers have traditionally implemented digital-signal-processing algorithms in dedicated DSPs, on conventional hardware, and even on general-purpose processors. With the proliferation of 100,000-gate FPGAs and the emergence of low-cost, powerful simulation and synthesis tools, implementing DSP algorithms in reconfigurable hardware is becoming a useful alternative. Depending upon the nature of the algorithm, programmable logic can implement an entire design or can function as a DSP coprocessor. Each approach offers advantages and disadvantages in design time, performance, cost, and reusability. Although hardware implementations provide the best performance, they can take a long time to develop. A software approach can get your product to market quickly and can let you reuse much of the code in subsequent products, but performance can suffer. An FPGA implementation often combines the speed of conventional hardware with the flexibility of software.

Generating page after page of schematics is becoming too costly and time-consuming for designers of large FPGAs. Therefore, these designers are turning increasingly to HDLs, such as Verilog and VHDL. HDLs describe the behavior and structure of system and circuit designs better than schematics do, although you may still have to instantiate vendor-specific modules as black boxes. Still, you should find gluing third-party intellectual property into an evolving HDL-based design easier than using a schematic editor. Even though it is more difficult than writing software for conventional processors, using RTL subsets of HDL code to design FPGAs is faster and easier than capturing schematics. Still, RTL code is tightly coupled with its target architecture, so you may need to spend some time understanding the device and the algorithm to ensure efficient resource use.

FPGAs provide an excellent medium in which to implement certain DSP algorithms. FPGAs provide a nice mix of density and speed, thanks to 100,000-gate chips and 2-nsec propagation delays through logic primitives. Also, as delays decrease and densities increase, an algorithm such as a basic numerical sort can grow with an application.

Sort algorithms accept an array of integer data, compare the integers, and then arrange the numbers in ascending or descending order. You can use the algorithm to determine the median, minimum, maximum, and any fractional information, such as the upper or lower quartile. You can use this information for statistical estimation, detection, and classification in nonlinear signal-processing applications, including speech analysis and image enhancement.

Computer-science classes have traditionally taught basic sort algorithms. You can convert many software methods for sorting numerical data into FPGA designs. The various sort algorithms are fairly easy to convert to VHDL, but simple translations do not take advantage of FPGAs' flexibility. Access to every bit in the array allows for parallelism, pipelines, and bit slicing. These techniques can dramatically speed an algorithm.

Sort algorithms

You can sort an array of integers by several methods. These methods vary in performance and complexity from the simple-but-slow bubble-sort algorithm to the fast and complicated Quicksort algorithm. Methods of straight insertion, such as Shell's method, perform better than the bubble sort but not as well as Quicksort.

In the bubble-sort technique, you compare and, possibly, interchange pairs of adjacent values in an array. If you sort the array in ascending order and the second value is lower than the first, you switch the values. This causes the lower values to bubble up toward the top. Bubble sort is the most inefficient sort algorithm and has little practical value on general-purpose computers.

To perform straight insertion, you compare each arriving array value with previously placed elements until you find the new element's position. Once you find the new element's place, the value that had occupied that spot and the values in the following positions move down. Then, the insertion occurs. Shell's method is a variant of straight insertion that you can appreciate from the following example. To sort 16 numbers, n1 through n16, first use the following pattern to sort eight groups of two: sort n1 and n9, n2 and n10...n8 and n16. Next, sort four groups of four and two groups of eight using maximum equal distances between sorted values. Finally, sort the array using the insertion method.

The Quicksort algorithm uses a divide-and-conquer approach. To sort an array, you choose an arbitrary value (the pivot) from the array. Index from the left of the array until you find an element that is greater than or equal to the pivot and index from the right of the array until you find an element that is less than the pivot. Exchange the two values. Continue indexing from both ends and swapping the values until the pointers cross. Place the pivot at this point. After you place the pivot, everything to the left is less than the pivot, and everything to the right is greater than or equal to the pivot. The array is now partitioned in two. Use Quicksort recursively to sort the two partitions and all subsequent partitions until the entire array is sorted. Quicksort's average runtime is very fast, especially for large arrays. In the worst case, though, completing a sort still takes a number of processing cycles equal to the number of array elements squared (N2).

FPGA architectures

Before attempting to implement a sort algorithm in FPGAs, you need to understand the architecture supporting the implementation. Unlike ASICs, FPGA primitives are not simply NAND or NOR gates. Each FPGA family breaks down into basic logic units that contain look-up tables (LUTs), flip-flops, and carry and routing resources.

In an FPGA, the most basic combinatorial logic primitive is the LUT. Its size varies from vendor to vendor. The equations that you can design into these LUTs can contain two to four inputs and produce only one output. Some vendors, such as Altera (San Jose, CA) and Actel (Sunnyvale, CA), also provide relatively large blocks of distributed SRAM, whereas others, such as Xilinx (San Jose, CA), provide code-generated macros to produce smaller banks of SRAM.

With this basic understanding of FPGA architecture, your designs can target the logic on your chosen device. For instance, one-hot encoding is more efficient than other techniques for implementing large state machines. One-hot encoding creates state machines that have one flip-flop per state and only a small amount of combinatorial logic. These requirements suit FPGA architectures that have large numbers of flip-flops and narrow, low fan-in function generators.

FPGA implementations

Once you understand the FPGA's underlying architecture, you can code your design using VHDL or Verilog. You need to be aware of how the code will synthesize and try to write it as efficiently as possible. You should keep your function inputs from extending beyond the width of the LUT, use one-hot encoding, and use parentheses to minimize the number of logic levels. If your design uses a statement such as "X<=A+B+C+D," the function may synthesize to three logic levels because some tools implement it with three cascaded adders. By inserting parentheses to change the statement to "X<=(A+B)+(C+D)," you force the tool to perform two additions in parallel and then add the results. Using parentheses yields a faster design that requires only two logic levels.

After you code the design, you can use synthesis and place-and-route tools to optimize it. You load your VHDL or Verilog design, select the target architecture, choose device options, and compile. The tool synthesizes, maps the part, and provides a file ready for placement and routing.

Many synthesis packages allow you to set timing constraints or make trade-offs between area and delay. To set your optimization goals, you may have to build a constraints file, set up attributes within the VHDL source file, or enter parameters directly from the graphical user interface. These constraints suggest to the synthesis tool which logic models it should choose. If, for example, you place a small clock-to-valid timing constraint on a counter, the synthesis tool chooses the fast counter circuit. If you put a low-gate-count constraint on the counter, the tool chooses the smaller model. You can also optimize your design with vendor-specific place-and-route tools. You provide constraints that force the tools to place the logic efficiently and route critical paths using short lines. You can also manually lay out your design to ensure coherent placement and reduce route-propagation delays.

Keeping in mind the techniques for optimizing FPGA designs, you may attempt to implement the fastest known software algorithm--Quicksort. To implement a complex algorithm such as Quicksort in an FPGA, you would have to store the array, compare and swap array values, and manage and store the pivot and indexing pointers for every partition the recursive algorithm creates. Storing the array and comparing values within it are not difficult, but managing and storing the pointers becomes fairly complex. Also, if you were to complete this design, you still could not guarantee anything better than N2 performance.

Do not become discouraged because implementing the fastest software algorithm in an FPGA is difficult. Sorting an array in software is excessively complex because of the limitations of the hardware in which the array resides. When developing a sort array in an FPGA, you have the option of performing comparisons in a predetermined order, of performing multiple comparisons at one time, and of accessing individual bits within the integers in the array. These options give you flexibility, which allows you to develop very fast algorithms.

The bubble sort

An FPGA can implement a bubble sort in much the same way as a general-purpose processor does. First, an internal or external bank of SRAM stores the array. If you use internal RAM, you must choose a device that has embedded memory or that allows you to build memory within the FPGA. If you use external RAM, you can choose any FPGA or CPLD that is large and fast enough to meet your requirements. For the external-memory implementation, you design the control and comparison logic into the programmable device, and you store the array in external RAM.

14MS2271The bubble-sort routine is fairly simple to implement in hardware (Figure 1). The control logic causes a multiplexer to bring the array into SRAM. After the data is resident, the control logic uses a counter to place the first two samples in the two D-type registers and address the registers. If the first value is greater than the second, the logic writes the values back into the RAM in reverse order. If the second value is less than the first, the control logic moves the second value from REGISTER2 to REGISTER1 and loads the next value from RAM into REGISTER2. The logic processes each remaining pair and then starts again at the beginning of the array. If the logic compares each pair of values in the array without swapping any of the values in RAM, processing ends.

This method is not the most efficient way to sort an array of integers. If the array arrives with the lowest value at the bottom of the list, the number of processing cycles to complete the sort equals the number of samples in the array squared (N2). The algorithm's average performance is not much better. This implementation does efficiently use the resources in each FPGA family, though. The two registers, two multiplexers, and the counter provide an even mix of logic and storage, which reflects the FPGA's similar mix. Also, if the array is small enough, you can internally instantiate the RAM.

An advantage of FPGAs is that they allow you to perform operations in parallel. You can divide the array into integer pairs and perform comparisons on each couple. If some of the pairs are out-of-order, you can immediately swap them. On the next cycle, the algorithm compares and swaps a new set of pairs. The algorithm chooses the pair sets so that the smaller values have a chance to traverse the entire array. The first set of pairs can be n1 and n2, n3 and n4...nLAST­1 and nLAST). The second set can be n2 and n3, n4 and n5...nLAST­2 and nLAST­1.

14MS2272Figure 2's implementation of multiple comparisons brings in the values through the first multiplexer and feeds them down the register pipeline. There are as many registers as there are values in the array. Figure 2 shows comparisons that occur in the first series on odd-numbered cycles and those that occur in the second series on even-numbered cycles. The control logic handles the comparisons and any resulting register enables or multiplexer selections.

The logic in this design compri ses little more than a series of multiplexers, registers, and comparators. Using a schematic editor, you would have to cut and paste these symbols and draw interconnects. Using VHDL, you can create entities for the multiplexer and register and then use a "generate" statement to expand the number of multiplexers and registers to the number of values in the array. By simply changing a constant in a header file you can increase the size of the array that the design can process. Listing 1 shows the MUX_N_REG entity as well as the main SORT entity that generates multiple multiplexers and registers.

The code generates one fewer MUX_N_REG entity than the size of the array because the last register requires no multiplexer. Separate statements handle the irregularities that are typical at the edges of regular structures.

This implementation is efficient but practical only for small arrays. If the array has an even number of elements that arrive in the worst possible order, completing the sort takes a number of processing cycles equal to the number of samples in the array (N) (Table 1). A sort often takes fewer cycles.

The problem with this method is the heavy use of registers and routing resources. The required number of flip-flops equals the number of samples times the sample width. An array of any significant length quickly claims the registers but does not use the function logic.

Sorting in dual RAM banks

Another advantage of FPGAs is that they allow you to perform bit-serial operations. You can perform comparisons on a single bit and determine if the bit meets the sort criteria. With this information, you can pull out all of the integers that have a 0 in the least significant bit (LSB) and then pull out all of the integers that contain a 1 in that position. You can continue this process from the LSB to the most-significant bit (MSB), storing data based on the sort criteria.

14MS2273In Figure 3, the array values enter one of two banks of SRAM. A state machine sequentially reads all values from this bank. Values whose LSB is 0 go into the second bank. The state machine then again sequentially reads the data from the first bank. This time, the data goes to the second bank if the LSB is 1. In the second bank, the new data follows the values in which the LSB is 0. Now, the state machine reads the values from the second bank. If the LSB+1 is 0, the values go to the first bank. Next, the state machine reads the second bank values for a second time. 14MS2274If LSB+1 is 1, the values go to the first bank, where they follow the values in which the LSB+1=0. Reading and writing data from RAM to RAM continues until processing of the MSB is complete. If the integer width is even, the sorted array is in the first bank of RAM. If the width is odd, the sorted array is in the second bank. Figure 4 shows the state machine that implements this algorithm.

If you create more complex logic to sort the array as it enters the first bank of RAM, you can save one instance of moving the data from one bank to the other. If the LSB is a 0, you can write the data starting at the lowest address and work toward the highest. If the LSB is a 1, you can write the data starting at the highest address and work toward the lowest. The increase in the complexity and the required amount of extra logic does not justify this modification, however.

The processing time of the dual-RAM-bank sort is moderately fast compared with the multiple-register sort and is very fast compared with the bubble sort. The resource use is more efficient than the multiple-register sort's and is moderately more efficient than the bubble sort's. An advantage of the bit-serial implementation is that the processing time is always the same. The order in which the data arrives in the array does not affect on the performance of the algorithm. As a result, performance calculations are simple because there is no best or worst order. The algorithm completes the sort in a number of processing cycles equal to two times the integer width times the array length (2·W·L). Also, you can use external RAM to expand this implementation to handle very large arrays.

Code, test, rewrite, optimize...

With sort algorithms, as with any design, you decide on a set of requirements that determine the speed, cost, and schedule. Sometimes, meeting the requirements is easy. Sometimes, you need a great deal of creativity. FPGAs let you code and test models relatively quickly. If the design doesn't quite meet the specification, you rewrite parts of your design or optimize portions with the synthesis and place-and-route tools. If the performance is dismal, you can seek a new approach.

The three sort algorithms de-scribed provide good examples of designs that satisfy different sets of requirements. If you must entirely encapsulate the design in a fairly small device and performance is not an issue, consider the bubble sort. If you need to sort a small array extremely quickly, choose the parallel-register algorithm. If you need the same processing speed on every array, or if you simply want a good mix of speed, device-resource usage, and array depth, the bit-serial approach may be for you.


References

  1. Perry, Douglas L, VHDL, Second Edition, McGraw-Hill, New York, 1993.

  2. Press, William H, SA Teukolsky, WT Vetterling, and BP Flannery, Numerical Recipes in Fortran: The Art of Scientific Computing, The Press Syndicate of the University of Cambridge, New York, 1992.


Listing 1-VHDL code for MUX AND REG and MAIN entities
14MS227L
Table 1--Multiple-register sort
Array Cycles
1 2 3 4 5 6 7 8 9 10
10 4 4 4 4 4 4 3 3 1 1
4 10 5 5 5 5 3 4 1 3 2
8 5 10 6 6 3 5 1 4 2 3
5 8 6 10 3 6 1 5 2 4 4
7 6 8 3 10 1 6 2 5 5 5
6 7 3 8 1 10 2 6 6 6 6
9 3 7 8 1 2 10 7 7 7 7
3 9 1 7 2 8 7 10 8 8 8
2 1 9 2 7 7 8 8 10 9 9
1 2 2 9 9 9 9 9 9 10 10

14M227CP

Author's biography

Cory Peichel is an electronic-systems engineer responsible for DSP hardware and software development at Raytheon E Systems in Garland, TX. He holds a BSEE from the South Dakota School of Mines and Technology, (Rapid City, SD) and has worked at E Systems for five years. His hobbies are skiing, reading, basketball, and camping.


| EDN Access | Feedback | Table of Contents |


Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc.