Autovectorization for GCC compiler
You can’t go out in public these days without seeing someone talking on the phone, listening to music, taking a digital photograph, or even watching a video. With all of these electronic devices available today, engineers are constantly seeking ways to make them smaller and more efficient. One way to accomplish this goal is to compress any data that must be stored or transmitted, but this task requires a powerful processor.
A common practice is to use a DSP to handle data compression and decompression and a general-purpose processor to handle everything else. The source code targeting the general-purpose processor is often written in C or C + +, whereas the source code targeting the DSP is often written in assembler. But most OEMs would prefer not to have two or more processors in their products, so many contemporary, general-purpose processors include SIMD (single-instruction-multiple-data) capabilities. General-purpose processors with SIMD capabilities can offer improved performance when executing complex algorithms, such as those in portable-device applications, and can run general-purpose code, such as a Linux OS.
The basic premise behind a vector/SIMD unit is to allow the processor to simultaneously execute a mathematical operation on multiple pieces of data within special vector registers. The benefit is that a single instruction inside the CPU effects the parallel operation. To support more programmers writing in C, developers are improving the GCC (GNU Compiler Collection) compiler to more efficiently take advantage of general-purpose processors with these SIMD capabilities.
Executing industry-standard EEMBC (Embedded Microprocessor Benchmark Consortium) benchmarks on the IBM PowerPC 970FX, compiled with the GCC compiler, illustrates the results of these compiler optimizations. Specifically, out-of-the-box EEMBC TeleBench scores for the 970FX PowerPC processor yielded more than 150% better results than those the consortium published more than a year ago using the same processor and platform and running at the same speed, but with a different compiler. You can attribute the improved scores to autovectorization and FDPR (feedback-directed program restructuring).
Applying autovectorization to GCC
Autovectorization improves the performance of programs compiled with GCC for processors that have vector/SIMD capabilities. You can download the GCC from the GNU Web site. For example, the IBM 970FX processor VMX (vector/SIMD-multimedia-extensions) unit provides 32 quad-word, 128-bit vector registers that can hold different-sized elements, such as signed and unsigned bytes, half-words, full words, and single-precision floating-point numbers. The VMX unit is a complex design, so using it involves penalties. For example, loading the vector registers causes overhead, and restrictions exist, mainly in the alignment of the data, to read into these registers. Until now, it has been difficult for compilers to automatically recognize when performance benefits you accrue with VMX outweigh the overhead.
GCC is a versatile and readily available compiler and finds wide use in server, desktop, and embedded-system settings. Typically, its optimization technology lags many commercially available compilers. For now, however, GCC is among the leaders in the compiler industry for autovectorization technology. IBM engineers contributed much of this new technology while working on the autovectorization-branch version of GCC. These changes will be part of the official Version 4.3 release.
Autovectorization targets loops in the program that a vector/SIMD unit can parallelize. The first step is to analyze the loop to see whether it qualifies for parallelization. For analysis of the loop, autovectorization uses recently added loop-analysis features of GCC in the tree-SSA (static-single-assignment) facility. If it qualifies, GCC will convert the body of the innermost loop to a number of parallel operations, reducing the number of iterations by that factor. For example, if GCC can modify the loop to do two simultaneous operations, that modification would halve the number of iterations. Similarly, if the GCC can modify the loop to do four simultaneous operations, that modification would reduce the loop by a factor of four.
Architectural differences pose the main difficulty with autovectorization. Currently available vector/SIMD units typically use special-purpose instructions and special vector registers; they have many features that a general-purpose compiler would not use. Recently, developers have significantly improved the autovectorizer of GCC. Some of these improvements include the ability to handle “strided,” nonconsecutive accesses to memory; the addition of an idiom-recognition engine that includes a widening multiplication idiom; the ability to handle loops operating on multiple-sized data, including type conversion; the addition of enhanced “if” conversion and store sinking to replace conditional branches; and the addition of efficient loop versioning to resolve aliasing.
Another new technology to improve the performance on out-of-the-box code is FDPR-Pro. The FDPR-Pro postlink tool, available on the IBM alphaWorks Web site, analyzes the behavior of the compiled program when it is running a typical workload and creates an executable program after modifying the code and data layout to increase performance. FDPR-Pro takes an executable program as input (Figure 1). This executable program works in the same way as any other program, except for the addition of one extra flag that tells the linker to retain relocation data; this extra flag results in a slightly larger file. The additional relocation information does not load to memory when the program is running.
FDPR-Pro first creates a new, instrumented, executable program by inserting additional code into the original program. This additional code is responsible for collecting profile information—typically, basic block- and control-flow edges’ execution counts. Next, using a representative workload, it runs the instrumented program to create the profile data. Finally, it generates a new, optimized executable program, using the original program and the profile data.
FDPR-Pro can optimize any program or shared library by reordering the code to reduce cache and TLB (translation-look-aside-buffer) misses, reduce page faults and branch penalties, and improve branch prediction. It also removes unneeded NOP (no-operation) instructions, sets branch-prediction bits, and applies branch folding when beneficial. IBM engineers used FDPR-Pro to achieve an improvement of as much as 67% for single-kernel execution in the out-of-the-box EEMBC benchmark score for the 970FX.
Benefits of GCC
Implementing effective autovectorizing optimization for a target processor is a difficult task. One of the many hurdles is the fact that test cases represent real programs. At this point, EEMBC telecommunication benchmarks come into play. IBM engineers running the EEMBC TeleBench kernels on the IBM PowerPC 970FX processor obtained data that they then used to improve the autovectorization capabilities of the GCC compiler. EEMBC’s TeleBench suite comprises kernels that include autocorrelation, convolutional-encoder, bit-allocation, inverse-FFT (fast-Fourier-transform), FFT-benchmark, and Viterbi-decoder tests. These benchmarks represent tasks that can benefit from vector/SIMD execution. For example, the Viterbi encoder computes the most probable transmitted sequence of a convolutional coded sequence. The most computationally intensive part of Viterbi performs a maximization of a likelihood function through a sequence of add-compare-select operations.
EEMBC publishes two types of scores: out of the box and “full fury,” or optimized. The organization obtains out-of-the-box scores by compiling unchanged source code, and it obtains full-fury scores by changing the source code to improve performance but still follow the EEMBC rules. In most cases, the changes engineers make to the code enable use of vector/SIMD instructions that compilers were unable to do automatically. The full-fury scores show an average improvement of more than 1500% over out-of-the-box scores for the same processor running at the same speed (Table 1).
You can make many optimizations by restructuring the code or rewriting it in assembler, but a compiler that recognizes when using vector/SIMD will be beneficial could automatically do a significant portion of these gains. To illustrate this point, compare the IBM 970FX optimized score of 1058.7 to the out-of-the-box score of 56.1 in June 2005 using Green Hills Software’s Multi compiler (Table 2). The percentage of improvement with hand-optimizing is on the same order of magnitude as many of the other processors’ optimized improvements. The out-of-the-box score when using GCC with autovectorization is 141.8. This score is 153% better than the previous out-of-the-box score. Internal testing at IBM shows that the compiler from Green Hills Software is better than the GCC compiler without autovectorization. The tests also show that, when comparing results of autovectorization with results compiled using the previous version of GCC, Version 4.1.1, the gain is closer to 190%.
The first column of Table 2 shows the EEMBC score in iterations per second of the test EEMBC compiled with autovectorization but before running FDPR-Pro. The next two columns show the results of running FDPR-Pro on the autovectorized executable and the resulting performance improvement expressed as a percentage. The next two columns compare the results of compiling the benchmarks with autovectorization and without autovectorization, ignoring gains from FDPR-Pro in both cases.
As you can see from the “autovectorization-gain” column, autovectorization provided more of a boost to some of the benchmarks, particularly Viterbi benchmarks. For other benchmarks, such as convolutional encoder, scalar replacement of aggregates, rather than autovectorization, improved scores. Vectorization can further boost the convolutional encoder, as the optimized-telemark version demonstrates, but it is more difficult to automate.
The benefit of improved compiler optimizations is not higher benchmark scores, but better performance in real products with less time and effort. Today, OEMs are demanding more performance from embedded processors, and processor vendors are responding by adding execution units that can do simultaneous operations in parallel, such as vector/SIMD engines. Adding capabilities to the CPU will do no good unless the software can take advantage of these features. Until now, the best choice for taking advantage of vector/SIMD functions has been to modify generic code to use these special features of the processor. This method has many disadvantages, including the fact that it binds you to a specific implementation. Now, with GCC 4.3 and beyond, you can benefit from vector/SIMD by simply recompiling your code.