Feature
Challenges and design decisions for measuring multicore performance
The type of multicore processor you choose and the type of parallelism you apply in your application code will greatly affect the performance you will achieve. To ensure that you meet your goals, closely examine the benchmark scores produced by an industry-standard suite of multicore benchmarks.
By Markus Levy and Shay Gal-On, EEMBC -- EDN, 9/4/2008
As the majority of the embedded industry rapidly makes the transition to using multicore technology, it is becoming increasingly clear that there must be standard benchmarks for measuring the performance of multicore architectures, devices, and platforms. Multicore benchmarks have many uses. You might be trying to decide which multicore platform gives you the best performance and energy tradeoffs. Or perhaps, as a multicore processor vendor, you want to see how your competition stacks up. Alternatively, you can also use benchmarks to debug your system design—by knowing what the approximate performance is expected to be, this allows you to check whether the various elements of the design are functioning properly.
Why standardized benchmarks?
|

The challenges of developing standardized multicore benchmarks
Before we can talk about multicore benchmarks, we need to know what we mean by "multicore." Heterogeneous, homogeneous, SMP, AMP are some of the multicore 'flavors' and each could require a different benchmark structure. Aside from the definition challenge, there are three primary challenges involved in creating a benchmark suite: portability, scalability, and flexibility. An industry-standard multicore benchmark suite must be portable to most, if not all of the many architectures available in the embedded market.
For a benchmark to be relevant for multiple cores, we must be able to execute the same amount of work regardless of the number of contexts used, and be able to show the performance improvement (or degradation) that results from the number of cores used. The benchmark must be able to utilize any number of computation contexts since it will be used across many different platforms.
There are many approaches to parallel programming and to utilizing the computing power inherent in multiple cores. Different routes may be taken depending on the task, the expected inputs, and the underlying architecture. Since we are looking at the embedded space as a whole, we must produce flexible benchmarks that capture the strengths and weaknesses of any of the methods on the platform being benchmarked.
These are the requirements that underlie MultiBench, a suite of embedded benchmarks that allows processor and system designers to analyze, test, and improve multicore architectures and platforms. MultiBench uses standardized workloads and a test harness that provides compatibility with a wide variety of multicore embedded processors and operating systems.
Some multicore benchmarking terminology
Before discussing these challenges, here are some important definitions:
- Kernel: An algorithm to be executed (Example: JPEG decompression).
- Work Item: Actual definition of a kernel to be executed on specific data (or data stream) (Example: JPEG decompression of 16 images of varying sizes and images).
- Workload: Actual definition of a benchmark to be executed that consists of one or more work items, and potentially synchronization requirements between the work items (Example: JPEG decompression of 16 images of varying sizes, rotation of the resulting images, and compression of the rotated images back to JPEG form).

Minimal API to establish portability and parallelism
Either due to accumulated experience, legacy code, or existing tools, C is the de facto standard in the embedded industry and supported on most if not all products in the market. For better or for worse, C does not incorporate any parallel constructs or programming concepts in the language itself. Therefore, at least for the first version of EEMBC MultiBench, we decided to minimize and abstract the API related to parallel programming as much as possible.
This API abstraction has allowed us to utilize C for coding all the benchmarks, and also makes it easy to support SMP architectures. Towards that end, the test harness as well as any kernels in the MultiBench suite use a strictly minimal API of 13 calls and 3 data structures that will manage thread creation and destruction as well as synchronization via mutex or signals. This means that when porting to a new platform, only these 13 calls are needed to allow the framework and all the benchmarks to run and take advantage of parallel execution. Furthermore, many systems support the POSIX threads interface, and the abstraction was carefully chosen such that there is a direct mapping to the more complex Pthreads interface. This means that if the system already supports Pthreads, no porting is necessary.
Other abstractions in the MultiBench test harness are similar to what is implemented in prior versions of EEMBC benchmarks. These abstractions are directed at I/O to support functions such as writing to the console and acquiring timing information so the benchmark can time itself. These abstractions are supported by many tool chains through the use of standard libc functions such as printf and require minimal porting effort.
The framework itself takes advantage of the API to create a thread pool that continually executes work items. The thread pool approach hides the latency of thread creation and destruction, and instead latency is affected by the mutex required in handling the queue of items to execute by the threads in the pool. Since on most systems the overhead associated with creating and destroying threads is higher than that associated with simple synchronization, we chose this approach.

Scalability to stress all hardware resources
The benchmark suite must be able to analyze different platforms with a wide variety of parallel capabilities, and its workloads can scale to take advantage of any number of computing resources. A workload is the actual definition of a benchmark to be executed that consists of one or more work items, and it potentially handles synchronization requirements between the work items. A work item is a kernel to be executed on a specific data set (or data stream).
To address scalability, EEMBC defines the performance metric as workloads per second rather than simply the time taken to accomplish a single workload. To generate this metric, the workload is iterated many times and each work item from each iteration may be issued to a different computing element depending on the operating system scheduler. For a workload that contains only one work item, this is similar to the "rate" concept utilized by other types of benchmarks. However, since workloads consist of many different items, the scaling is more realistic. To understand this, we can look at one of the workloads from the first standard subset of MultiBench—rotate16x4Ms1w4. This a workload that contains 16 image rotation work items, but each item is working on a different 4Mbyte image (in this particular benchmark, each thread operates on a single line slice size). These work items also take advantage of parallel computing resources by using 4 threads each to achieve speedup on processing the image.
With a multicore device, if the benchmark framework is only allowed to spawn one work item at a time, at most 4 cores would be utilized to process each image, and the 16 images would be processed sequentially. However, the MultiBench framework can spawn any number of items in parallel thereby allowing the processing of different images in parallel. The pool of executing threads in the framework will launch a new work item as soon as the previous one is done until the workload is completed; then it will repeat this process many times if required. Repeating the process for multiple iterations means we can scale to any number of contexts. Repetition also helps achieve stable results by cancelling out random system effects such as network traffic and other interrupts.
Normally one would worry about load balancing for the different items in the workload, but the framework allows the next work item to be issued as soon as possible—even if the previous iteration of the workload has not yet finished. On workloads that contain work items that are inherently different, that would mean that the slowest work item will execute multiple times in parallel, but this closely resembles the case in the real world, where most of the computing resources may be taken by a few of the multiple tasks that the system needs to complete.
Flexibility provides many options for parallelism
Perhaps the most challenging aspect of developing multicore benchmarks is that there are an infinite number of work item combinations that can be used to analyze the computing power inherent in parallel execution. Furthermore, as we discussed earlier, a workload that demonstrates the strength of one multicore processor can equally demonstrate the weakness of another. To establish a first version of MultiBench, EEMBC assembled a collection of about 30 workloads. However, to allow processor and system developers to experiment with a wider variety of parameters, EEMBC added a tool called the MultiBench Architect to the benchmark framework. This tool allows users to create new workloads via a simple drag and drop interface. Its possible uses are many; one might be to create workloads with finer levels of granularity, thus allowing the user to determine the exact inflection points in the performance curve.
Additional flexibility for multicore benchmarks can be gained by utilizing methods that aim to capture different aspects of parallel execution. As we've seen, the scalability through multiple items in workloads already captures some of that parallelism—a methodology sometimes known as task decomposition—by running several tasks (or work items) in parallel.
The benchmark suite also addresses other parallelizing methodologies by the nature of the algorithms used. For example, another form of parallelism is achieved through data decomposition. This is where benchmark kernels take control of the portable API to create multiple threads and use those threads to speed up the processing of a single piece of data.
Another form of parallelism can be achieved using a methodology sometimes referred to as functional decomposition or pipelining. In effect, a complex task is broken into parts and each part feeds into the next, similar to an assembly line. This approach is usually tailored very closely to a specific architecture and is not scalable by nature. Rather than creating new kernels which will only be applicable to a very narrow segment, to address this methodology MultiBench will allow work items to be chained together, feeding the result of one work item to the next one. This still amounts to using work items as building blocks to creating new workloads, while allowing creation of workloads that will test how well systems behave when functional decomposition is used.
Heterogeneous processors deliver greatest challenge to portability
Heterogeneous computing, the type most commonly found in embedded systems, is another aspect of parallel systems. These systems employ processors with different architectures typically in the form of specialized hardware that is more cost and power effective, since devices are expected to mainly be doing the same task repeatedly.
Multicore processors can also consist of some combination of homogeneous and heterogeneous processing elements. Benchmarking these devices in a relevant manner with portable code is a huge challenge, as each part of the system may use a different tool chain, and communication between different parts of the system is not standardized.
There are several options to benchmark such systems. Extending the existing MultiBench framework to support heterogeneous systems requires reliance on some standard to allow the benchmarks to be portable. For example, the Multicore Communications API (MCAPI) from the Multicore Association provides a standardized framework that can be used to partition benchmark code into blocks that use MCAPI to communicate.
Author information
Markus Levy is founder and president of EEMBC. He is also president of The Multicore Association and chairman of Multicore Expo. Levy was previously a senior analyst at In-Stat/MDR and an editor at EDN, focusing in both roles on processors for the embedded industry. Levy began his career in the semiconductor industry at Intel, where he served as both a senior applications engineer and customer training specialist for Intel's microprocessor and flash memory products. He is the co-author of Designing with Flash Memory, the one and only technical book on this subject, and received several patents while at Intel for his ideas related to flash-memory architecture and usage as a disk-drive alternative. He is also a volunteer firefighter.
Shay Gal-On is EEMBC's director of software engineering and leader of the EEMBC Technology Center. Previous to joining EEMBC, he was principal performance analyst in the Microprocessor Products Group at PMC Sierra, and his career has also included roles as a software engineer for Improv Systems and Intel. A compiler/tools expert, he has devoted considerable effort to analyzing the effects of various compilers on benchmark performance and has ported the EEMBC benchmarks using Wind River Diab, Green Hills MULTI, ARM.RVDS, Improv Jazz tools, the Stretch/Tensilica development environment, and many versions of gcc. He has also served as a member representative on the EEMBC Board of Directors and thus is well acquainted with EEMBC processes, having ported and optimized the benchmarks for a wide variety of architectures.















