Feature
Architectural-design considerations for implementing hardware acceleration
Exploiting hybrid software/hardware parallelism in algorithm-hardware accelerations can yield significant performance gains over a function-replacement approach.
By Ian Ferguson, QuickLogic -- EDN, 9/29/2005
Across a range of embedded-system applications, the combination of data-processing and system-throughput requirements is increasing to the point at which implementing algorithms purely in software on a single high-powered CPU is exposing two challenges. First, system power and cost are forced upward. Besides the obvious battery-life issues that exist for mobile platforms, rising power dissipation increases the requirement for heat sinks and supplemental cooling. Second is the issue of implementing value-added functions to a system when handling the baseline system functions fully occupies the CPU's processing capacity—especially when a designer cannot implement the new functions without including additional components.
What options are available? For the purposes of this article, the choices break down into three areas. Customizing the CPU's instruction set for the application can markedly improve algorithm-processing efficiency. The usability of development tools for harnessing such cores continues to significantly improve over that of a few years ago. This strategy could potentially bind a designer to a specific implementation that can over time cause legacy-software issues.
If a designer can segment an application into well-defined, somewhat-independent tasks, then using multiple CPU cores can improve the algorithm-processing efficiency. Unless the application design has the volumes to justify an ASIC development, the only viable options for this approach are using proprietary CPU cores in FPGAs or finding an ASSP (application-specific standard processor) that exactly meets the desired system requirements.
Another option, which is the focus of this article, is for the designer to migrate performance-hungry elements of the algorithm into hardware; the implementation could be as an ASIC, a structured ASIC, an ASSP, or an FPGA. To illustrate the process and the challenges associated with this approach, this article draws from a QuickLogic (www.quicklogic.com) project that involved constructing an MPEG-2 decoder as a hybrid combination of hardware and software modules on a programmable SOC (system on chip). Although the results are specific to the MPEG-2 decoder, the process and the system-design considerations apply to a broad set of embedded-system applications.
The first step is to identify elements of the algorithm that are suitable for accelerating. To accomplish this task, it is necessary to understand the areas of the code base in which the microprocessor is spending most of its time. You achieve this step by profiling the code. Although the insertion of profiling code can slow and intrusively affect the application performance, it will, as a first approximation, locate the main hot spots in the code. Indeed, newer Linux kernels are starting to build in profiling support that will significantly reduce this problem.
The project team used a GNU-C compiler to build special statistics-measuring code into the binary executables. A program built with profiling support generates a gmon.out file that contains statistics it gathers while the program is running. The gprof utility, which interprets the gmon.out file, generates text files that indicate how many times a function executed and the overall time spent in a function.
Grouping the individual functions for MPEG-2 decoding according to high-level function produces a histogram that summarizes the percentage of execution time by function and shows their aggregated percentage of the complete algorithm (Figure 1). For the MPEG-2-decoding algorithm, the top four elements comprise more than 99% of execution time.
Hardware implementationOn its own, a hardware-accelerator block simply takes data in, processes it, and then outputs the result. A particular system environment requires interfaces for the data to feed the accelerator. This module therefore needs to include an interface to the on-chip or external bus where the CPU is located and a DMA engine to feed data into and out from the accelerator block.
Replacing a software function (such as iDCT shown in Figure 1) that takes 61% of a microprocessor's time with hardware does not translate into a 2.5-times performance improvement. This scenario is primarily because of the communication overhead between the CPU core and the hardware-accelerator block as well as the efficiency of the storing/retrieval process of data from memory and the location of data in the system.
Systems employing only software for algorithm implementations typically spend most of their time executing mathematical operations within the core algorithm. When you express those core algorithms in hardware, the CPU acts as more of a controller, coordinating and scheduling the operations of the hardware accelerator. Typically, you accomplish this task by setting up DMA operations, configuring register bits, and placing instructions into command queues.
As an example, assume there are three transforms: T1, T2, and T3. Suppose that T1 and T3 each consume 30% of the CPU time and that T2 consumes only 1%. When T1, T2, and T3 pass significant amounts of data between them, excluding T2 from the acceleration may significantly impact the performance beyond what the 1% profiling may indicate.
Suppose that each transform function reads 1 kbyte of data and then writes back 1 kbyte of data. If T1 and T3 are hardware implementations, and T2 is a software implementation, the data flow will result in 6 kbytes of data traffic (Figure 2a). You cannot base your prediction of the resulting performance on the profile data because the exclusion of T2 creates additional system-data movement. As an alternative, you can implement all three transforms in hardware and localize the data transfer between T1/T2 and T2/T3 within the hardware accelerator (Figure 2b); the result is that only 2 kbytes of data traverses the system bus.
In the case of the histogram for the MPEG-2 decoder, after the iDCT and motion-compensation functions, the parse function is the next logical candidate for hardware acceleration. However, due to the data flow of the algorithm, accelerating parse without the acceleration VLD/iQ/iZZ (variable-length decode/inverse quantization/inverse zigzag) is likely to create the problem of data localization. Therefore, VLD/iQ/iZZ proves to be a better candidate for acceleration.
Achieving concurrencyThe most straightforward method of partitioning software into a hardware/software hybrid is the creation of a hardware block that exactly mimics the operation of a software function call. This process, function replacement, minimizes the impact on the software because there is no need to redesign the software's upper layers. Additionally, the designer implementing the hardware implementation needs no substantial or intimate knowledge of the algorithm.
This approach does not, however, deliver the highest level of system performance and power improvement, because the CPU core remains idle while the hardware accelerator is processing, and only one hardware block is active at a time. Additionally, the data flow is less than optimal, because each function block must communicate back to the system memory rather than follow a direct block-to-block communication. This approach therefore is most effective when you apply it to self-contained functions that perform a lot of processing on a small amount of data.
Designers can achieve higher system performance by executing elements of the algorithm concurrently in parallel. For this project, the team identified and considered three device architectures for concurrency. CPU and hardware-accelerator concurrency consists of both systems operating simultaneously. It can occur when the CPU core initiates hardware operations as soon as dependent data is in place and can perform tasks when the hardware accelerator is computing. Concurrency can also occur through the use of multiple hardware-accelerator blocks; this form requires the designer to use acceleration blocks that can execute independently of each other.
Finally, the design can use concurrency within a hardware-accelerator block. Partitioning algorithms as a pipeline of operations can be beneficial to building the hardware as a pipeline of operations that perform concurrently. This approach may require intermediate buffering of data to account for different pipeline stages requiring a different amount of time to complete and ensuring the synchronization of data for use by subsequent execution units.
A challenge to implementing concurrency is the additional engineering effort and knowledge of the algorithm that the designer must have. A number of industry-standard tools offer various levels of automation for function replacement, including Mentor's (www.mentor.com) Seamless profiling tool and ASAP hardware-conversion products, as well as offerings from Coware (www.coware.com), Celoxica (www.celoxica.com), and Critical Blue (www.criticalblue.com). If the function-replacement strategy delivers the necessary performance and power, these tools significantly reduce the effort required to deliver a hybrid implementation. Mentor's tools, for example, assist in the development of the hardware wrapper required for the accelerator and rewrite the software driver to use the hardware accelerator. However, these tools struggle with the development of concurrent implementations.
In these cases, developing an optimized implementation requires significant investment from individuals that understand the device architecture and the intricacies of the algorithm. This additional effort is justifiable only if the function-replacement approach cannot meet the required performance or power improvement.
The QuickLogic MPEG-algorithm project experienced a 100% system-performance improvement over the function-replacement approach by performing the extra effort to create an application-specific hardware module (Figure 3). Overall, the hardware/software implementation provided more than 10 times the performance of a software-only implementation. Additionally, when considering an implementation with no net increase in power dissipation, the system performance experienced a fivefold increase. Clearly, though, the additional performance improvement possible depends on the algorithm.
From this work, our key conclusion is that the magnitude of the performance gain that a designer can achieve using a hybrid implementation depends on two primary elements. The designer needs to understand the nature of the algorithm's characteristics to enable a migration of software to hardware without introducing time-consuming data transfers between modules. Also, for those willing to devote the time to exploit or introduce parallelism inside an algorithm, significant performance gains are possible over a function-replacement approach. The limitation of the off-the-shelf tools to help with this more involved task means that this method requires a significant investment in resources to invoke concurrent processing in the CPU core and the hardware-acceleration logic.
| Author Information |
| Ian Ferguson manages QuickLogic's division of Embedded Standard Products devices, including QuickMIPS. The division focuses on defining next-generation devices in partnerships with complementary-silicon suppliers, such as Intel, Renesas, and Atheros. Ferguson has a bachelor's degree in electrical and electronics engineering from Loughborough University (UK). You can reach him at iferguson@quicklogic.com.
|















