Feature

Uncovering the truth in benchmarks

Benchmarks are supposed to save you time when analyzing and comparing systems. But getting true value from benchmarks usually means reverse-engineering what each score means and applying those insights to your situation.

By Robert Cravotta, Technical Editor -- EDN, 10/2/2003

AT A GLANCE
  • Benchmarks attempt to abstract and simplify complex systems so you can better perform apples-to-apples comparisons.
  • Benchmarks must be accompanied with a full disclosure to be meaningful.
  • It is incumbent on you to analyze benchmark disclosures to determine a given score's relevance to your situation.
  • Benchmarks should be only a data point in your total decision process; other qualitative factors, such as vendor-development support and platform flexibility, can override high benchmark scores.
Sidebars:
Gaming the benchmarks
Public or private disclosure?

A benchmark is a point of reference by which you can consistently measure, quantify, and compare the value and quality of two or more similar alternatives, such as business processes, tools, and embedded processors. For embedded processors, benchmarks are usually a consistent set of software code that you execute on candidate processors, so you can compare its performance with other processor options. Processor benchmarks are not limited to measuring processor architectural efficiency; they can indicate the efficiency of the compiler compared with hand-optimized coding.

Ideal benchmarks abstract and consolidate the essential performance metrics of a system to a simplified representation or score that, within a specific context, enables you to complete a meaningful apples-to-apples comparison between alternative systems. However, using an ideal benchmark is impractical if it does not save you time, save you money, or lower your risk more than less precise comparison efforts.

For systems or tasks that have a single clear objective, you can often find one performance metric that precisely captures the system's behavior and correlates well when you compare it with alternatives. Comparing the clock rate of two processor devices has been a popular benchmark metric for relative performance, but it has a narrow applicable context; it is useful only when the two devices are nearly identical except for the clock rate. Despite the seductive simplicity of just comparing benchmark scores, correctly interpreting those scores relative to each other requires you to understand the underlying details of the benchmark measurements and their relevance to your application. For embedded designs, the processor architectures that you compare can differ greatly, and using the clock rate as a benchmark metric can be inappropriate.

In general, embedded designs must simultaneously balance and satisfy many objectives, such as quickly and cost-effectively delivering the correct function with low power, high quality, and flexibility. For these types of situations, performance is multifaceted, and precisely characterizing the system performance as a simplified score for meaningful comparison with other options can be difficult and expensive. One challenge of comparing systems is balancing the ease and cost of capturing and deriving the benchmark scoring with the meaningfulness of ranking those scores among the same scores of dissimilar systems.

Lies and more lies

Many processor vendors use benchmark scores as a marketing tool; however, some of the commonly published benchmark scores, such as MIPS (millions of instructions per second) and DMIPS (Dhrystone MIPS), are meaningless and inappropriate without a specific context (see sidebar "Gaming the benchmarks"). The continued generic use of these types of performance scoring has earned processor benchmarks the reputation of being inaccurate measures of processor performance. A Web search will yield many references to the following statement about standard benchmarks, "In the computer industry, there are three kinds of lies: lies, damn lies, and benchmarks."

A synthetic benchmark usually attempts to measure one or more raw-performance aspects of a system, processor, or compiler by using artificial instruction sequences or by trying to mimic general instruction mixes that you might find in real-world applications. In contrast, a real-world-application benchmark moves a step up from considering the processor's features and attempts to predict and quantify how a processor's architecture and development tools will handle the expected workload for a specific type of application. Application benchmarks may use one or more sets of real application-code blocks to perform representative functions for an application.

The continued practice of publicly disseminating standard synthetic benchmarks, such as MIPS and DMIPS, underscores the challenge of developing and using simple and meaningful metrics. These types of benchmarks are relatively inexpensive and among the easiest to derive, and there is no obvious public-domain replacement that is as easy and inexpensive to perform. Industry-standard benchmarks can provide a basis for comparing competing offerings; however, vendors usually incur a substantial cost to produce benchmark evidence. The BDTI (Berkeley Design Technology Inc), EEMBC (EDN Embedded Microprocessor Benchmark Consortium), and SPEC (Standard Performance Evaluation Corp) benchmark organizations address different aspects of computing by offering benchmark suites that provide an assortment of targeted performance measurements. The various benchmark suites form the basis for consistent testing and support a range of granularity in the performance scores that allows you to better assess a processor's features and system performance under a specific context.

Industry-standard benchmarks can be valuable inputs into your decision process, but they are generally not enough for you to base your design decisions solely upon. For one thing, it may be difficult or impossible for you to find relevant and comparable benchmark results for every processor platform you should be considering. Another challenge is that when you do find relevant and comparable benchmarks, the benchmark data may have been derived for different generations of the processors you are currently considering at substantially different points in time, and it may have employed significantly different memory structures from what you will be using. Ad-hoc or custom benchmarks that you perform as part of your analysis can help bridge the gap between industry-standard benchmarks and your application needs.

Several types benchmark users exits. Obvious benchmark users are embedded-system designers, or end users, whom marketing organizations target; however, according to several benchmark organizations, end users are not the primary group that uses industry-standard- benchmark results. The benchmarks are insufficient as the primary basis for decision-making, because they do not accurately reflect each embedded-system designer's application needs. For example, many benchmarks are kernel-level and do not account for the interactions with the operating systems that designers use in the final application.

When end users use benchmark results, they often work more closely with each processor vendor than with the benchmark organizations to obtain performance data, assistance with duplicating the test results, and an understanding of how the benchmark configuration applies to their design requirements. End users may use benchmarks as an input for paring down a list of processor candidates, but other items, such as the integrated feature set, supported I/O interfaces, development tools, training, documentation, third-party development- support infrastructure, and road-map risk can play an important role in eliminating processor choices. In general, end-user benchmarking is an ad-hoc process that applies only to the immediate design project.

Many OEMs (original equipment manufacturers) that deal in high-volume and price-sensitive designs rely on their own custom benchmarks to make processor selections. The custom benchmarks can consist of legacy application code but may also include the code from industry-standard benchmarks. Some OEMs become members of the benchmark organizations to gain access to the benchmark source code, so they can incorporate it, in part, into their own custom-benchmark suite. These benchmark suites can be more detailed and comprehensive, because they can address narrow target requirements; industry-standard benchmarks, in an effort to balance relevance, complexity, and cost, address a wider application scope. In general, OEM-benchmark efforts are competitively sensitive and for internal use only.

Benchmarks can help you perform feasibility studies and find a good processor for your application, but they can also help you verify system performance, validate change improvements, and determine processor headroom in your implementation during and after the design cycle. Benchmarks can help you identify performance shortfalls and help you focus your attention on improving performance. Benchmark-test specifics are not limited to approximating typical application workloads; they can include lessons learned that stress the system to validate performance—especially valuable for safety-critical applications.

Processor vendors and the vendors that provide supporting development tools are the primary users of benchmarks; they use internal and industry-standard benchmark results for architectural profiling and optimizing feedback. For architectural designers, the benchmark results can indicate whether a processor architectural change makes the processor more or less suited for an application. Benchmarks can also demonstrate that an architectural change made no difference for meeting the needs of the target application. For tool designers, the benchmark results can indicate whether the compiler is meaningfully accessing the functional features unique to a processor. This type of closed-loop optimization can lead to the positive evolution of processor architectures and development tools when the benchmark suites adequately capture the necessary types of application functions and exercise the appropriate processor features.

Marketing organizations use benchmark results much differently from the way that previously mentioned groups do. Marketers look at benchmarks as an opportunity to identify advantages over competing processors that have undergone the same or similar benchmarking tests and whose companies have published the results (see sidebar "Public or private disclosure?"). Relying on marketing material that touts one superior benchmark score over competing processors can be dangerous. Marketing materials often display benchmark scores as a single composite number in a graph; it is incumbent on you to research the test-configuration disclosures of the compared systems.

Modern benchmarks still do not encompass composite scoring that highlights the inclusion of artificial benchmark-specific functional units, the difference between the device costs, power consumption, or the programmability and flexibility of one processor versus another. Some of the highest benchmark scores belong to processors that perform one function well but perform others less favorably.

Using benchmarks

To get the most from any benchmark, you first need to understand the characteristics of your application, so that you can assess the relevance of a benchmark result to your situation. When using benchmarks to select a platform, vendor, or processor architecture, you should start by listing and ranking important capabilities and performance thresholds. Identify the contexts that your application will operate under and what tasks and algorithms it must be able to complete. Will you use hand-optimized assembly, or will all of your software be compiled? Will you be using an operating system? What is your power budget? How much memory will your program code and data require? Can you assume that all memory access will be from fast memory? What peripherals and I/O interfaces do you need? How much performance and I/O headroom do you need?

In addition to considering the performance and feature requirements, you should identify and rank in the same list project and business requirements. How important is using a development tool suite to avoid learning a new tool or operating system or to support legacy code? How much application expertise do you need access to from the processor vendor's development-support infrastructure? Do you need application notes, sample code, or reference designs? What are your cost targets? Do you need to be able to work within a specific test environment? How important is access to training? What is your vendor-road-map risk tolerance?

The most useful benchmarks accurately predict or characterize how well a processor and vendor will support your application and project needs for the least amount of cost, time, and effort. If an industry-standard benchmark closely correlates with your application, use it, but be sure to practice due diligence and verify the details of each benchmark test. Does the test setup aim to demonstrate calculation performance, prove sustained I/O efficiency, or show low power consumption? And is the bias in the setup relevant to your situation? Are the datapaths and memory structures realistic for your application? Benchmarks can characterize processor performance, but they currently do not capture project and business requirements. For this reason, benchmarks should serve only as an input into your processor-selection analysis.

Be sure to understand whether the benchmarks you use appropriately reflect processor features and compiler efficiency. Compiler efficiency is irrelevant to applications that rely heavily on handcrafted, assembly-language routines, such as those in the inner loops of high-performance signal-processing algorithms. Likewise, processor-feature performance is irrelevant if you are using C code and the compiler you use does not take advantage of the processor features you want to use.

When standard benchmarks do not closely enough approximate your application requirements, you can still gain value from benchmarks by applying ad-hoc and custom benchmarks with the assistance of the processor vendor or a benchmark service provider. When you are using benchmark data for analysis, always ask the processor vendor for the latest benchmark scores and full disclosure of how it generated those scores. This data may not be public, but full disclosure is critical to your understanding what the scores mean, how to duplicate them, and their relevance to your situation. You may decide to try to duplicate those results on the vendor's development tools, such as on an evaluation board, or, in some cases, you may give the processor vendor your own benchmark code to test and report back to you. In any case, the vendor should be able to discuss the results of all of the testing with specific comments relevant to your situation.

It takes time and resources for you to analyze and understand the benchmarks; however, trying to duplicate a vendor's benchmarks can uncover some valuable insights. Processor vendors are likely to make benchmarking efforts and share those results only if the investment is likely to pay off in sales. The more the processor vendor understands your application, the better the benchmark support it can provide. A vendor's benchmarks can highlight and provide you insight into tool and processor features that standard benchmarks cannot capture. You may discover, with assistance from the vendor, compiler optimizations and specifics of the processor-architecture features that challenge your implementation assumptions for better or for worse. The vendor should be able to help you map your code to the target device and provide coding examples on how to exploit the special features. Some vendors admit that a customer's benchmark-duplication effort highlighted key features that the customer did not fully understand, and the improved insight was critical to the customer's analysis. It is more beneficial to make these discoveries during the selection process than after you complete an implementation.

It's about saving time

The primary purposes of benchmarks are to save you time and money and lower your overall risk. If you did not care about time and money, you could develop your application on multiple systems and then compare the end results before making your final decisions. In some situations, this approach may be appropriate, but for most situations, it is an impossible luxury. When you apply them appropriately, benchmarks should help you predict your success with each candidate without incurring the cost of a full design implementation but only if the benchmarks correlate with your application workloads.

Benchmark organizations are modifying their benchmark suites to better reflect changes in the market and to minimize the risk of vendors' gaming stable benchmarks test suites. Plans for future benchmarks are struggling with how to better reflect the changing workloads of today's and tomorrow's applications, including how to incorporate power consumption into the scoring. Power consumption depends on many system issues that make it difficult for designers to capture benchmarklike scoring in a standard and meaningful way. Another challenging area comes from the trend of applications converging in a single product or device; it forces future benchmarks to consider how to capture workloads that cross application boundaries.

If you can afford it, you should invest the time it takes to analyze the full disclosure of the system configuration and "reverse-engineer" a vendor's benchmark score; the insights you gain during this process can prove invaluable. You should be able to discuss the details of all benchmark tests with the vendor. The contrived circumstances of the test environment may invalidate the usefulness of the scores but may highlight a useful feature that you would otherwise have missed.

Ultimately, benchmarks can help you select a processor based on performance. But alone, benchmarks are insufficient to help you choose between two comparable processor systems. Going through the benchmarks in detail with the vendor may give you a qualitative sense of how well the vendor and accompanying development-support infrastructure will meet your needs. Once you have adequate processor candidates, factors such as a vendor's technical responsiveness, the ease with which you can port your code, the amount of software porting you need to perform, the quality of the documentation and training, and the likelihood that the processor road-map will cover your future needs will take on a larger priority in your final decision.


For more information...
For more information on products such as those discussed in this article, contact any of the following manufacturers directly, and please let them know you read about their products in EDN.
Adelante Technologies
+31-40-2353-100
www.adelantetech.com
AMD
1-408-749-4000
www.amd.com
Analog Devices
1-800-262-5643
www.analog.com
ARC
1-408-437-3400
www.arc.com
ARM
1-408-579-2200
www.arm.com
BDTI (Berkeley Design Technology Inc)
1-510-665-1600
www.bdti.com
EEMBC
(EDN Embedded Micropcoessor Benchmark Consortium)
1-530-672-9113
www.eembc.org
Green Hills Software
1-805-965-6044
www.ghs.com
LSI Logic
1-866-574-5741
www.lsilogic.com
Motorola
1-512-895-2000
www.motorola.com
NEC
1-408-588-6000
www.necelam.com
Quick Logic
1-408-990-4000
www.quicklogic.com
Renesas
1-408-382-7500
www.renesas.com
SPEC (Standard Performance Evaluation Corp)
1-540-349-7878
www.spec.org
STMicroelectronics
1-718-861-2650
www.st.com
Texas Instruments
1-800-336-5236
www.ti.com
  


Author Information
Technical Editor Robert Cravotta will be at the Microprocessor Forum in San Jose on October 14 and 15. You can reach him at 1-661-296-5096 and via e-mail at rcravotta@edn.com.

 

Gaming the benchmarks

Simple, standard benchmark scores can be valuable marketing tools for processing vendors—especially if you can boast better scores than your competitors. One reason processor vendors use benchmarks such as MIPS (millions of instructions per second) and DMIPS (Dhrystone MIPS) is because they are relatively easy and inexpensive to derive. As a designer, unless you practice due diligence to completely understand how a vendor derives a benchmark score, you can fall victim to using meaningless information in your trade study. The benchmark scores themselves are not necessarily inaccurate; rather, the conditions under which they are true may be irrelevant to your application-design needs. It is incumbent on you to ensure the scores you use to make decisions are relevant to your situation.

Comparing MIPS-performance scores better distinguishes between single-cycle and multi-cycle instruction-processor architectures than comparing system-clock rates. A processor's MIPS score is an indication of the amount of work it can perform; however, the amount of work each instruction accomplishes is not standard. Generically reporting the MIPS scores for superscalar, multi-instruction-issue systems further dilutes the value of comparing MIPS scores. The reported MIPS score is a synthetic number—a theoretical maximum that assumes that you can keep every datapath and execution unit active all of the time. This theoretical maximum-performance level can be an unreasonable assumption for the general case. The limiting performance factor may not be the processor architecture or the programmer but rather the inability to make your algorithms parallel enough to keep all of the execution units active.

DMIPS, another commonly reported benchmark score for processors, indicate the amount of functional work a processor could deliver. It differs from a MIPS score in that it does not rely on a theoretical maximum; instead, you derive the Dhry- stone benchmark score by running a standard, "representative" program on the target processor and measuring how much work it completed. You can obtain good Dhrystone scores on devices if they have large enough caches and properly optimized architectures, because the benchmark does not exercise these features.

Reinhold P Weicker, PhD, of Siemens AG, in 1984 created the Dhrystone benchmark; the current version of the benchmark, Dhrystone 2.1, was born in 1988. The world of computing has changed significantly since then, but the benchmark has not evolved to account for these changes. Modern processor offerings can include integrated floating-point units in the main processors, superscalar or multiple-execution-unit architectures, VLIW (very-long-instruction-word) architectures, large on-chip memory subsystems, branch prediction, and speculative execution. The types of applications today's processors support has also expanded from 1988 and includes multimedia and communications-intensive applications.

These types of large architectural and application changes are significant and can invalidate the relevance of a benchmark, especially when the benchmark is not evolving to accommodate the changes. When benchmarks don't evolve, you risk an environment that invites gaming the benchmarks. Such gaming can take the form of dedicated, no-to-low-value, benchmark-specific functional units integrated into a device for the sole purpose of improving the benchmark performance, or compiler enhancements that can recognize the benchmark code and substitute optimized benchmark-specific code.

As with any benchmark, full disclosure of the testing configuration is critical to understanding what the scores really tell you about the processor. You should consider a disclosure full only if it addresses all of the relevant components, such as the processor architecture, memory subsystem, peripherals, benchmark specification, and settings for the compiler and tools. Benchmark organizations such as BDTI (Berkeley Design Technology Inc), EEMBC (EDN Embedded Microprocessor Benchmark Consortium), and SPEC (Standard Performance Evaluation Corp) regularly modify their benchmarks to reflect changes in processor architectures and how designers are using those processors in end applications. These benchmark organizations also require access to a full disclosure of the testing configuration as an integral requisite for publicly releasing a benchmark score.

The efforts of these benchmark organizations do not eliminate your analysis and validation efforts when using benchmark scores for comparison, but they do help bring consistency to the available data that could be prohibitively expensive to obtain on a project-by-project basis. The onus is still on you, as a design-decision maker, to practice due diligence with the available data, so you can perform a useful analysis and avoid inappropriate interpretations of the benchmark data.

 

Public or private disclosure?

Performing comprehensive benchmark testing is resource-and time-intensive. Why then would a company incur the cost of benchmarking a processor and then refrain from publicly disclosing the results? Based on how much a vendor's marketing effort emphasizes comparing its good benchmark scores with other "less capable" offerings, it may seem reasonable for you to assume that, when a company does not publish a benchmark for a given processor, the device did not score well and is a poor candidate for your consideration.

If you make this kind of assumption without following up with the processor vendor, you are making a mistake—similar to taking all benchmark scores at face value and basing your processor choice solely on them. Just because a vendor does not publicly disclose benchmarks for a given processor doesn't mean the vendor doesn't have them or is unwilling to disclose them to you under an NDA (nondisclosure agreement). So why wouldn't a processor vendor publish benchmarks unless they were not good, and why would it disclose those benchmarks under an NDA?

Standard benchmark suites approximate the workload and stresses an application places on a processor and its supporting infrastructure. Because they use approximations, benchmark suites may not adequately cover and correlate with the performance drivers for a target application. In this case, the added complexity of trying to publicly expose how the benchmark test suite does and does not map to a target application can increase the opportunity for misinterpretation and out-of-context competitive misuse of the scoring. It is less complicated for a vendor to privately engage a potential customer with the details of such benchmarks in the context of the proposed application. But any benchmark discussion, private or public, should include full disclosure of the testing environment, so you can assess the accuracy and relevance of the data to your situation.

 



ADVERTISEMENT

ADVERTISEMENT

Feedback Loop


Post a CommentPost a Comment

There are no comments posted for this article.

Related Content

 

By This Author


ADVERTISEMENT

Knowledge Center



Technology Quick Links

EDN Marketplace


©1997-2008 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites

ADVERTISEMENT
You will be redirected to your destination in few seconds.