Feature
Uncovering the truth in benchmarks
Benchmarks are supposed to save you time when analyzing and comparing systems. But getting true value from benchmarks usually means reverse-engineering what each score means and applying those insights to your situation.
By Robert Cravotta, Technical Editor -- EDN, 10/2/2003
|
A benchmark is a point of reference by which you can consistently measure, quantify, and compare the value and quality of two or more similar alternatives, such as business processes, tools, and embedded processors. For embedded processors, benchmarks are usually a consistent set of software code that you execute on candidate processors, so you can compare its performance with other processor options. Processor benchmarks are not limited to measuring processor architectural efficiency; they can indicate the efficiency of the compiler compared with hand-optimized coding.
Ideal benchmarks abstract and consolidate the essential performance metrics of a system to a simplified representation or score that, within a specific context, enables you to complete a meaningful apples-to-apples comparison between alternative systems. However, using an ideal benchmark is impractical if it does not save you time, save you money, or lower your risk more than less precise comparison efforts.
For systems or tasks that have a single clear objective, you can often find one performance metric that precisely captures the system's behavior and correlates well when you compare it with alternatives. Comparing the clock rate of two processor devices has been a popular benchmark metric for relative performance, but it has a narrow applicable context; it is useful only when the two devices are nearly identical except for the clock rate. Despite the seductive simplicity of just comparing benchmark scores, correctly interpreting those scores relative to each other requires you to understand the underlying details of the benchmark measurements and their relevance to your application. For embedded designs, the processor architectures that you compare can differ greatly, and using the clock rate as a benchmark metric can be inappropriate.
In general, embedded designs must simultaneously balance and satisfy many objectives, such as quickly and cost-effectively delivering the correct function with low power, high quality, and flexibility. For these types of situations, performance is multifaceted, and precisely characterizing the system performance as a simplified score for meaningful comparison with other options can be difficult and expensive. One challenge of comparing systems is balancing the ease and cost of capturing and deriving the benchmark scoring with the meaningfulness of ranking those scores among the same scores of dissimilar systems.
Lies and more liesMany processor vendors use benchmark scores as a marketing tool; however, some of the commonly published benchmark scores, such as MIPS (millions of instructions per second) and DMIPS (Dhrystone MIPS), are meaningless and inappropriate without a specific context (see sidebar "Gaming the benchmarks"). The continued generic use of these types of performance scoring has earned processor benchmarks the reputation of being inaccurate measures of processor performance. A Web search will yield many references to the following statement about standard benchmarks, "In the computer industry, there are three kinds of lies: lies, damn lies, and benchmarks."
A synthetic benchmark usually attempts to measure one or more raw-performance aspects of a system, processor, or compiler by using artificial instruction sequences or by trying to mimic general instruction mixes that you might find in real-world applications. In contrast, a real-world-application benchmark moves a step up from considering the processor's features and attempts to predict and quantify how a processor's architecture and development tools will handle the expected workload for a specific type of application. Application benchmarks may use one or more sets of real application-code blocks to perform representative functions for an application.
The continued practice of publicly disseminating standard synthetic benchmarks, such as MIPS and DMIPS, underscores the challenge of developing and using simple and meaningful metrics. These types of benchmarks are relatively inexpensive and among the easiest to derive, and there is no obvious public-domain replacement that is as easy and inexpensive to perform. Industry-standard benchmarks can provide a basis for comparing competing offerings; however, vendors usually incur a substantial cost to produce benchmark evidence. The BDTI (Berkeley Design Technology Inc), EEMBC (EDN Embedded Microprocessor Benchmark Consortium), and SPEC (Standard Performance Evaluation Corp) benchmark organizations address different aspects of computing by offering benchmark suites that provide an assortment of targeted performance measurements. The various benchmark suites form the basis for consistent testing and support a range of granularity in the performance scores that allows you to better assess a processor's features and system performance under a specific context.
Industry-standard benchmarks can be valuable inputs into your decision process, but they are generally not enough for you to base your design decisions solely upon. For one thing, it may be difficult or impossible for you to find relevant and comparable benchmark results for every processor platform you should be considering. Another challenge is that when you do find relevant and comparable benchmarks, the benchmark data may have been derived for different generations of the processors you are currently considering at substantially different points in time, and it may have employed significantly different memory structures from what you will be using. Ad-hoc or custom benchmarks that you perform as part of your analysis can help bridge the gap between industry-standard benchmarks and your application needs.
Several types benchmark users exits. Obvious benchmark users are embedded-system designers, or end users, whom marketing organizations target; however, according to several benchmark organizations, end users are not the primary group that uses industry-standard- benchmark results. The benchmarks are insufficient as the primary basis for decision-making, because they do not accurately reflect each embedded-system designer's application needs. For example, many benchmarks are kernel-level and do not account for the interactions with the operating systems that designers use in the final application.
When end users use benchmark results, they often work more closely with each processor vendor than with the benchmark organizations to obtain performance data, assistance with duplicating the test results, and an understanding of how the benchmark configuration applies to their design requirements. End users may use benchmarks as an input for paring down a list of processor candidates, but other items, such as the integrated feature set, supported I/O interfaces, development tools, training, documentation, third-party development- support infrastructure, and road-map risk can play an important role in eliminating processor choices. In general, end-user benchmarking is an ad-hoc process that applies only to the immediate design project.
Many OEMs (original equipment manufacturers) that deal in high-volume and price-sensitive designs rely on their own custom benchmarks to make processor selections. The custom benchmarks can consist of legacy application code but may also include the code from industry-standard benchmarks. Some OEMs become members of the benchmark organizations to gain access to the benchmark source code, so they can incorporate it, in part, into their own custom-benchmark suite. These benchmark suites can be more detailed and comprehensive, because they can address narrow target requirements; industry-standard benchmarks, in an effort to balance relevance, complexity, and cost, address a wider application scope. In general, OEM-benchmark efforts are competitively sensitive and for internal use only.
Benchmarks can help you perform feasibility studies and find a good processor for your application, but they can also help you verify system performance, validate change improvements, and determine processor headroom in your implementation during and after the design cycle. Benchmarks can help you identify performance shortfalls and help you focus your attention on improving performance. Benchmark-test specifics are not limited to approximating typical application workloads; they can include lessons learned that stress the system to validate performance—especially valuable for safety-critical applications.
Processor vendors and the vendors that provide supporting development tools are the primary users of benchmarks; they use internal and industry-standard benchmark results for architectural profiling and optimizing feedback. For architectural designers, the benchmark results can indicate whether a processor architectural change makes the processor more or less suited for an application. Benchmarks can also demonstrate that an architectural change made no difference for meeting the needs of the target application. For tool designers, the benchmark results can indicate whether the compiler is meaningfully accessing the functional features unique to a processor. This type of closed-loop optimization can lead to the positive evolution of processor architectures and development tools when the benchmark suites adequately capture the necessary types of application functions and exercise the appropriate processor features.
Marketing organizations use benchmark results much differently from the way that previously mentioned groups do. Marketers look at benchmarks as an opportunity to identify advantages over competing processors that have undergone the same or similar benchmarking tests and whose companies have published the results (see sidebar "Public or private disclosure?"). Relying on marketing material that touts one superior benchmark score over competing processors can be dangerous. Marketing materials often display benchmark scores as a single composite number in a graph; it is incumbent on you to research the test-configuration disclosures of the compared systems.
Modern benchmarks still do not encompass composite scoring that highlights the inclusion of artificial benchmark-specific functional units, the difference between the device costs, power consumption, or the programmability and flexibility of one processor versus another. Some of the highest benchmark scores belong to processors that perform one function well but perform others less favorably.
Using benchmarksTo get the most from any benchmark, you first need to understand the characteristics of your application, so that you can assess the relevance of a benchmark result to your situation. When using benchmarks to select a platform, vendor, or processor architecture, you should start by listing and ranking important capabilities and performance thresholds. Identify the contexts that your application will operate under and what tasks and algorithms it must be able to complete. Will you use hand-optimized assembly, or will all of your software be compiled? Will you be using an operating system? What is your power budget? How much memory will your program code and data require? Can you assume that all memory access will be from fast memory? What peripherals and I/O interfaces do you need? How much performance and I/O headroom do you need?
In addition to considering the performance and feature requirements, you should identify and rank in the same list project and business requirements. How important is using a development tool suite to avoid learning a new tool or operating system or to support legacy code? How much application expertise do you need access to from the processor vendor's development-support infrastructure? Do you need application notes, sample code, or reference designs? What are your cost targets? Do you need to be able to work within a specific test environment? How important is access to training? What is your vendor-road-map risk tolerance?
The most useful benchmarks accurately predict or characterize how well a processor and vendor will support your application and project needs for the least amount of cost, time, and effort. If an industry-standard benchmark closely correlates with your application, use it, but be sure to practice due diligence and verify the details of each benchmark test. Does the test setup aim to demonstrate calculation performance, prove sustained I/O efficiency, or show low power consumption? And is the bias in the setup relevant to your situation? Are the datapaths and memory structures realistic for your application? Benchmarks can characterize processor performance, but they currently do not capture project and business requirements. For this reason, benchmarks should serve only as an input into your processor-selection analysis.
Be sure to understand whether the benchmarks you use appropriately reflect processor features and compiler efficiency. Compiler efficiency is irrelevant to applications that rely heavily on handcrafted, assembly-language routines, such as those in the inner loops of high-performance signal-processing algorithms. Likewise, processor-feature performance is irrelevant if you are using C code and the compiler you use does not take advantage of the processor features you want to use.
When standard benchmarks do not closely enough approximate your application requirements, you can still gain value from benchmarks by applying ad-hoc and custom benchmarks with the assistance of the processor vendor or a benchmark service provider. When you are using benchmark data for analysis, always ask the processor vendor for the latest benchmark scores and full disclosure of how it generated those scores. This data may not be public, but full disclosure is critical to your understanding what the scores mean, how to duplicate them, and their relevance to your situation. You may decide to try to duplicate those results on the vendor's development tools, such as on an evaluation board, or, in some cases, you may give the processor vendor your own benchmark code to test and report back to you. In any case, the vendor should be able to discuss the results of all of the testing with specific comments relevant to your situation.
It takes time and resources for you to analyze and understand the benchmarks; however, trying to duplicate a vendor's benchmarks can uncover some valuable insights. Processor vendors are likely to make benchmarking efforts and share those results only if the investment is likely to pay off in sales. The more the processor vendor understands your application, the better the benchmark support it can provide. A vendor's benchmarks can highlight and provide you insight into tool and processor features that standard benchmarks cannot capture. You may discover, with assistance from the vendor, compiler optimizations and specifics of the processor-architecture features that challenge your implementation assumptions for better or for worse. The vendor should be able to help you map your code to the target device and provide coding examples on how to exploit the special features. Some vendors admit that a customer's benchmark-duplication effort highlighted key features that the customer did not fully understand, and the improved insight was critical to the customer's analysis. It is more beneficial to make these discoveries during the selection process than after you complete an implementation.
It's about saving timeThe primary purposes of benchmarks are to save you time and money and lower your overall risk. If you did not care about time and money, you could develop your application on multiple systems and then compare the end results before making your final decisions. In some situations, this approach may be appropriate, but for most situations, it is an impossible luxury. When you apply them appropriately, benchmarks should help you predict your success with each candidate without incurring the cost of a full design implementation but only if the benchmarks correlate with your application workloads.
Benchmark organizations are modifying their benchmark suites to better reflect changes in the market and to minimize the risk of vendors' gaming stable benchmarks test suites. Plans for future benchmarks are struggling with how to better reflect the changing workloads of today's and tomorrow's applications, including how to incorporate power consumption into the scoring. Power consumption depends on many system issues that make it difficult for designers to capture benchmarklike scoring in a standard and meaningful way. Another challenging area comes from the trend of applications converging in a single product or device; it forces future benchmarks to consider how to capture workloads that cross application boundaries.
If you can afford it, you should invest the time it takes to analyze the full disclosure of the system configuration and "reverse-engineer" a vendor's benchmark score; the insights you gain during this process can prove invaluable. You should be able to discuss the details of all benchmark tests with the vendor. The contrived circumstances of the test environment may invalidate the usefulness of the scores but may highlight a useful feature that you would otherwise have missed.
Ultimately, benchmarks can help you select a processor based on performance. But alone, benchmarks are insufficient to help you choose between two comparable processor systems. Going through the benchmarks in detail with the vendor may give you a qualitative sense of how well the vendor and accompanying development-support infrastructure will meet your needs. Once you have adequate processor candidates, factors such as a vendor's technical responsiveness, the ease with which you can port your code, the amount of software porting you need to perform, the quality of the documentation and training, and the likelihood that the processor road-map will cover your future needs will take on a larger priority in your final decision.
| For more information... | ||
| For more information on products such as those discussed in this article, contact any of the following manufacturers directly, and please let them know you read about their products in EDN. | ||
| Adelante Technologies +31-40-2353-100 www.adelantetech.com | AMD 1-408-749-4000 www.amd.com | Analog Devices 1-800-262-5643 www.analog.com |
| ARC 1-408-437-3400 www.arc.com | ARM 1-408-579-2200 www.arm.com | BDTI (Berkeley Design Technology Inc) 1-510-665-1600 www.bdti.com |
| EEMBC (EDN Embedded Micropcoessor Benchmark Consortium) 1-530-672-9113 www.eembc.org | Green Hills Software 1-805-965-6044 www.ghs.com | LSI Logic 1-866-574-5741 www.lsilogic.com |
| Motorola 1-512-895-2000 www.motorola.com | NEC 1-408-588-6000 www.necelam.com | Quick Logic 1-408-990-4000 www.quicklogic.com |
| Renesas 1-408-382-7500 www.renesas.com | SPEC (Standard Performance Evaluation Corp) 1-540-349-7878 www.spec.org | STMicroelectronics 1-718-861-2650 www.st.com |
| Texas Instruments 1-800-336-5236 www.ti.com | ||
| Author Information |
Technical Editor Robert Cravotta will be at the Microprocessor Forum in San Jose on October 14 and 15. You can reach him at 1-661-296-5096 and via e-mail at rcravotta@edn.com. |
|















Technical Editor Robert Cravotta will be at the Microprocessor Forum in San Jose on October 14 and 15. You can reach him at 1-661-296-5096 and via e-mail at 