Feature
FROM EDN EUROPE: FPGA-based configurable computing delivers on its promises
In real-world applications, FPGAs are starting to demonstrate immense tangible advantages over traditional high-performance computing solutions - advantages that include increases in processing performance and system flexibility, together with reduced size, weight and power consumption.
By Malachy Devlin, Nallatech -- EDN Europe, 7/7/2005
Proponents of using FPGAs in high-performance embedded applications have for some time considered them to be the technology of the future. Sceptics have a standard jibe in response. As these claims have often been made, they say, and have never been delivered on, "It's the technology of tomorrow - always has been, always will be." But recent developments have focused this debate firmly in the present.
The key factor behind this breakthrough is simply the growth in capacity of FPGAs. In 1993, devices with 256 logic slices and 128 usable I/Os represented the leading edge of field programmable chips. Five years later, FPGAs offered not only 12000 logic slices and 512 I/Os, but also carried 128 kbits of dedicated RAM blocks. Today's largest programmable device has 89000 logic slices, 960 user I/Os, two orders of magnitude more block memory capacity, and a host of additional features. This makes a substantial impact on the way FPGAs can be used.
Early FPGAs were limited to bit operations and small integer arithmetic. They proved effective for 1-D pipelined data paths and for repetitive and data-intensive processing of basic functions. Next-generation devices with block RAM extended capability into the 2-D realm and proved successful in niche applications such as image processing and digital signal processing. Today's chips offer massively increased potential due to their ability to host complex floating point, 32 and 64-bit processing capability.
On-chip microprocessorsWhere early FPGA developers were limited to state machines for control and simple programmable logic computing functions, today's FPGA applications can host multiple microprocessors on-chip. They can run software on hard-coded processor blocks, or alternatively implement hundreds of soft custom mPs on a single chip. This adds an enormous degree of flexibility and offers risk reduction for the system engineer. Previously, dedicated processors were placed on PCBs. If these "ran out of steam", the fix involved adding or upgrading physical hardware. Once a design is committed to production, or deployed in the field, this can be expensive. With programmable logic, the designer can add processors where they are most advantageous, using the appropriate size of processor. For example, it may be better to deploy four 8-bit μPs rather than a single multitasking 32-bit μP.
Embedded communications adds a whole extra dimension to the FPGA's capability. Real-time high-performance systems succeed or fail by striking the balance between their I/O structure and their computational capability. This balance serves to ensure that the architecture can move data efficiently at high speed while at the same time having a suitable computational engine that can process that data, at just the speed required. Using FPGAs to eliminate complex interfaces, bridges or protocols to access external I/O can have a huge impact on system performance. Today's FPGAs enable the creation of direct interfaces to external devices, ensuring the streamlined transfer of data. Moreover, FPGAs today can handle and process multiple data streams that each exceed 1GB/sec.
Having achieved the crossover with conventional processor performance, the gap is set to widen. CPU architectures are inherently architected in one dimension—they are serial devices with little true parallelism. Apparent parallel execution of task is usually achieved only with the complexity of multi-threaded operation. FPGAs are architected in two dimensions—structures can be replicated to achieve true parallel operation. In this sense, FPGA technology provides a better foundation for continued performance growth.
One consequence of this square-law growth in performance is that demand for high-level tools has emerged. The traditional FPGA system development route was via basic VHDL compilers; today the range of options is substantially larger, including HDL, C and DSP packages such as MATLAB.
To accelerate adoption of FPGA-based systems, continuing improvement in tool flows and standardisation is critical. This is progressing, leading to greater interoperability between FPGA-based systems and conventional processor-based systems (on platforms such as that shown in Figure 1) as well as between devices in multi-FPGA systems. In the FPGA hardware and software architecture that Nallatech developed, scalable multi-FPGA systems are managed and controlled by system software (called FUSE) which includes a range of APIs (application programming interfaces) that are accessed from standard languages. DIMEtalk software facilitates communication between FPGAs and between processes within FPGAs; this software also performs the necessary FPGA hardware abstraction and provides standard interfaces for connectivity.
Together, DIMEtalk and FUSE represent something analogous to an FPGA operating system, but there are significant differences from traditional processor OSs. The most notable differentiator is that the DIMEtalk/FUSE combination adds reconfigurability and provides a structure which is very scalable.
In a similar way to versions of Linux in which the OS build can be tailored with configurable kernels, designers can include in their compilation only the elements required within DIMEtalk for their specific application. This approach wastes less silicon, and overheads are lower than is possible using conventional approaches to porting computational systems to FPGA. The outcome is a substantial reduction in size, weight and power as well as an increase of the available silicon real-estate for computations. What is more, users can reconfigure the "operating system" as well as the silicon itself—even in the field—to ensure optimal performance.
As with conventional CPU operating systems, the existence of a common platform, independent of its hardware implementation, streamlines the development process. Developers can use not only a range of third-party compilers, but also take advantage of a large resource of builds, libraries and reusable applications in order to shorten the design cycle.
Despite these considerations, the design effort involved should not be underestimated. In many cases, NRE (non-recurring engineering) costs for an FPGA-based system can be five to ten times higher than for the conventional processor-based approach. Development times are still more in keeping with what would be expected when programming a conventional project in assembler rather than a high-level language; and FPGA synthesis times can be hours or even days. Nonetheless, the rewards more than compensate, as some recent applications demonstrate.
High-performance applicationsUnmanned airborne vehicles exemplify the leading edge in high-performance embedded computing applications. Their demanding real-time processing requirements, as well as the extraordinary size, weight and power (SWAP) constraints inherent to UAV design, have led developers to look at compact FPGA-based systems in place of conventional processors.
Processing the high-bandwidth, real-time data associated with UAVs would require tens of processors using DSP or GPPs. Moreover, due to the small size of these vehicles, numerous flight and mission systems must compete for space in the extremely restricted UAV payloads and for the power from the finite on-board electrical power resource. As a result, performance density (GigaFLOPS/cm2) becomes a highly important metric: one where the FPGA has a distinct advantage over traditional processors.
ASIC technology could easily satisfy the UAV's data processing and SWAP requirements. However, the majority of UAV applications in the aerospace and military sector are low-volume designs that use specialised algorithms, many of which may be covered by security restrictions. This most certainly disqualifies ASICs from a commercial perspective; however, FPGA computing fits very neatly into this space. Figure 2 illustrates, qualitatively, some of the trade-offs. In a recent project porting an imaging application from a GPP-based processing platform to one using FPGAs, the new system delivered 36 GigaFLOPS (36 billion calculations per second)—approximately 60 times faster than the previous solution. But it is the performance density—the combined value of the SWAP and processing performance—that truly distinguishes this achievement. In the final FPGA-based design, the application ran on a PCI/104 stack under 12 cm square and 20cm high rather than two 42-in. racks full of processors.
If not yet commonplace, the use of FPGAs is already fully practical for such embedded solutions, where designers are familiar with hardware/software co-design and the implementation of fine-grained parallelism. This is less true in the software-oriented realm of high-performance computing, such as seismic data processing. Here, the quantities of data regularly processed, and the complexity of seismic processing, consistently overrun the capabilities of available processing solutions. Typical approaches using GPP-based cluster computing centres are proving increasingly untenable as power consumption, heat removal and the need to situate these facilities in remote, harsh locations impede further progress.
Here again, ASICs could potentially provide the solution. They deliver the performance levels required but are too expensive and inflexible for this application. Meanwhile, DSPs lack the raw processing efficiency. FPGAs represent the best of both worlds, as porting the Kirchhoff Time Migration algorithm—the computation that takes up the vast majority of CPU cycles consumed by seismic image processing systems—to an FPGA-based platform demonstrates. Besides its inherent complexity, the fact that the algorithm is repeated many billions of times for each dataset makes this application an ideal candidate for FPGA hardware.
Initial results from an FPGA platform implementation show that a single accelerator card fitted in a single work-station can outperform 50 conventionally clustered workstations, while consuming little more than the power of one. Looking ahead, three-card workstations will deliver the performance equivalent to nearly 150 workstations, but will have the power consumption of less than two.
Extending the benefitsAs FPGA capacities continue to grow, benefits like these are likely to extend throughout the high-performance computing arena. Undeniably, developers are becoming more aware of the advantages of FPGAs and are slowly shedding their prejudices. Challenges nevertheless remain. In the absence of industry standards, the lack of true interoperability and hardware independence remain barriers to widespread take-up. Code development is likely to continue being more time-consuming and complex—at least until the tool flow becomes more mature.
| Author Information |
| Dr Malachy Devlin joined Nallatech in 1996, as CTO, bringing with him his expertise in FPGA technologies. Dr Devlin obtained his PhD in Signal Processing from Strathclyde University. He is a software specialist with several years' experience in various companies including the National Engineering Laboratory, Telia in Sweden and Hughes Microelectronics (now part of Raytheon).
|














