EDN Access

PLEASE NOTE:
FIGURES WILL LINK
TO A PDF FILE


May 21, 1998


When Dhrystone leaves you high and dry

Daniel Mann, AMD, and Paul Cobb, QED

A worthwhile evaluation of processor performance requires more insight into the underlying issues than a Dhrystone benchmark test or any other single program can provide. The results of three processors running Dhrystone and a larger piece of embedded code reveal just how misleading Dhrystone results can be.

The sheer range of processors vying for a share of the embedded-system market is overwhelming, and choosing between the µPs can be difficult and time-consuming. The resulting and understandable demand by engineers to analyze actual performance numbers led to the idea of µP benchmarks, such as Dhrystone--small, port-able programs that provide a convenient numeric output. This output usually reveals how long a µP takes to execute a code sequence, which you can also express as the number of times the code sequence executes per second.

Unfortunately, one program can't provide the performance insights embedded-system designers need because complex interactions between advanced architectural features, instruction sets, and sophisticated compilers ultimately determine the system-level performance of today's embedded processors. Second, there's no such thing as a "typical" embedded system, making it impossible for any one program--especially a small, portable program--to represent all such interactions.

Thus, it is dangerously misleading to project or compare system-level performance on the basis of small synthetic benchmarks unless you can show that they indeed represent the CPU behavior on larger programs of practical interest. Tests of processors running both the Dhrystone benchmark and a larger piece of embedded code show just how misleading benchmark results can be. This problem, as well as many other criticisms, is motivating the development of more dependable benchmark methods; the EDN Embedded Microprocessor Benchmark Consortium (EEMBC) hopes to better inform embedded-system designers about a processor's performance capability (see box "The EEMBC solution").

In 1984, the Dhrystone benchmark's authors wrote the initial version, 1.0, in Ada and have subsequently updated the benchmark and converted it to C (Reference 1). The most recent version, 2.1, is free via the Internet. The benchmark has no governing body or institution that reviews and approves results before publication.

Some benchmarks began life as useful programs to perform a practical computing task; their use as tools for performance measurement came later. In contrast, the Dhrystone program performs no directly useful action. Instead, it belongs to "synthetic" benchmarks--code with particular behavioral characteristics rather than programs that implement algorithms.

The authors of the Dhrystone benchmark characterized the behavior of various high-level-language programs in statement-type, operand-type, and operand-locality--global, local, parameter, or constant--characteristics. The authors then constructed the benchmark program itself to show similar distributions, according to these measures. For example, statement types contain assignment, control, and procedure/function-call statements.

The Dhrystone benchmark is readily portable, making it possible to obtain results for a processor through a relatively small investment of time and effort. This convenience is seductive and has led to widespread over-reliance on Dhrystone. In all too many cases, it's the most important, sometimes even the only, quantitative performance measurement that designers use to select the processor for a new embedded-system project.

Dhrystone is not without value, but many people in the industry frequently quote and use the benchmark's results in ways that don't withstand scrutiny. This misuse does a great disservice to all concerned, including the benchmark's authors.

Version 2.1 of Dhrystone contains 103 high-level statements within the main loop, which executes repeatedly during a benchmark run. The user chooses the number of iterations at runtime. If there are several iterations, the effect of the code before and after the main loop becomes negligible as a proportion of overall runtime. The benchmark does some self-checking to ensure that it compiles and runs as intended.

At the end of a benchmark run, Dhrystone prints the absolute time required per iteration; the number of iterations per second through the main loop; and the performance, measured in iterations per second, relative to a baseline machine. The baseline machine is Digital Equipment's VAX 11/780, which was in wide use when the authors created the benchmark.

This list of benchmark parameters indicates that Dhrystone requires the target system to have a timer and that Dhrystone measures only one performance parameter. The number of iterations per second is simply the reciprocal of the time per iteration, and you obtain the relative perform-ance figure by simply scaling the same result.

Most processor vendors quote either the number of iterations per second through the main loop, or the VAX-relative performance in iterations per second, or both. Both of these parameters are easy to understand: The higher the value, the better. However, you should consider what these numbers really say.  

For any CPU, the Dhrystone rating follows from this simple relationship:

Dhrystone iterations per second=CPU clock cycles per second/CPU clock cycles per Dhrystone iteration.

You may not be surprised that a slow memory system, for which each access incurs wait states at the CPU, constrains the Dhrystone rating. However, you may not know that the choice of compiler and the way you use it can also have a significant effect on the Dhrystone result, because the compiler also can affect the number of cycles each iteration requires.

For example, suppose you compile the benchmark and find that on the system you use to run the benchmark, each iteration takes 1000 machine-clock cycles. On a 10-MHz processor, each iteration therefore takes 100 µsec (1000×100 nsec=100 µsec), meaning that this platform can perform 10,000 Dhrystone iterations per second.

Things Dhrystone can't tell you

The first major limitation of Dhrystone is its size: Because it is small, it's easy to work with. When the authors created Dhrystone, caches for instructions or data were a rarity in embedded-system design. Since then, caches have become commonplace, as CPU speed has increased more rapidly than that of commodity memory devices. Dhrystone's strong locality allows caching to significantly boost performance on the benchmark. Dhrystone's size means that even a small cache can contain most or all of the information that each iteration uses.

The code for a larger piece of firm-ware can exhibit less locality than Dhrystone; even if the code exhibits as much locality, the code may still require a larger cache to gain appreciable speed. The CPU you're considering for your next design may have a more impressive Dhrystone result than the one you use now, but are you sure the system-level performance gain will be so impressive? (See box "How cache efficiency affects system performance.")

The second major limitation of Dhrystone arises from its execution profile, which is the proportion of overall execution time it spends in each function. A program in which each function makes roughly the same contribution to the total has a "flat" profile. In contrast, a program for which just a few functions account for a significant proportion of overall execution time has a "sharp" profile.

On most embedded CPU architectures, Dhrystone's profile is sharp, and it spends as much as 30 to 40% of its execution time in just two functions: strcpy() (string copy) and strcmp() (string compare). The following code fragment, which comes from within the main loop of Dhrystone, is an example of the way the benchmark uses these functions:

strcpy (Str_2_Loc, "DHRYSTONE PROGRAM, 3'RD STRING");

The string in this case is fixed; no matter how many times you run the benchmark, the string uses the same sequence of characters. Thus, data caching significantly boosts perform-ance on all but the first pass through the code. Furthermore, compilers for 16- or 32-bit machines are free to place the string in memory such that the first character is "address-aligned," meaning that the µP can read multiple characters in the first and subsequent operations. In these respects, Dhrystone performs rather specialized and more intense string handling than that found in a many embedded-system workloads.

If your embedded firmware does little specialized string handling, projections of the firmware's performance based on the Dhrystone results could be inaccurate. You can also turn that logic around: If your application does process a lot of strings in a similar way, Dhrystone could turn out to be a useful guide to performance (see box "How would you like your benchmark results cooked, sir?").

Various other criticisms of Dhrystone include the argument that its synthetic construction causes Dhrystone's characteristics to differ from those of code that implements practically useful computations--for example, that Dhrystone has an unusually low number of instructions per procedure call and is therefore oversensitive to the implementation efficiency of procedure calls and returns.

Similarly, Dhrystone's call sequences are nested only three or four levels deep. Thus, most register-window RISC machines would never spill or fill their register windows and thus would never need to save or restore registers from off-chip memory. Register windowing is a useful architectural technique, but its usefulness on the Dhrystone benchmark is higher than that found in a great deal of embedded-systems code.

All these lines of argument essentially proceed toward the same conclusion: It's possible--even likely--that the small, portable Dhrystone benchmark doesn't use the CPU in the same way as a large, complex piece of embedded software does.

Three processors and their Dhrystone results

As a practical demonstration of this point, take a look at the performance of Dhrystone and of a larger piece of embedded software on three processors. The processors are not necessarily comparable in price or perform-ance; this study compares the benchmarks, not the processors. Two of the processors are RISC designs--the i960 µP (Intel Corp, www.intel.com) and the Am29205 µC (AMD, www.amd.com)--and the third is a derivative of the well-known x86 CISC family, the Am186EM µC (AMD) (see box "About the processors").

Comparing apples with oranges

The RISC processors ran the benchmarks with a clock speed of 16 MHz, which was the speed of both the system bus and the internal pipeline. The Am186EM µC was tested with a 10-MHz bus and a 40-MHz internal speed. Even so, memory bandwidth still sets the limit on performance; recall that most x86 instructions require multiple CPU cycles. If these speeds strike you as either very low or rather high, remember that there's no such thing as a typical embedded system. The purpose here is simply to illustrate some points about the practicalities of benchmarking.

The i960 benchmarks ran on Intel evaluation boards. Intel's two-pass compiler technology was exploited as far as possible, wherever applicable. Because of i960 architecture characteristics, a number of i960 family members benefit from two-pass technology. The Am29205 and Am186 µC benchmarks ran on AMD evaluation boards.

Memory access times for the evaluated systems are in the notation initial/subsequent. For example, 2/1 denotes a system that returns the first word of a new access sequence with a delay of two cycles and denotes each subsequent word in the fetch sequence with a delay of one cycle. This memory system is sometimes referred to as "2-1-1-1." Similarly, a 4/3 DRAM has four-cycle initial access followed by three cycles for subsequent words within the same burst, sometimes written "4-3-3-3."

The i960 SA platform uses relatively fast 2/1 DRAM. Results for the Am29205 processor were obtained on one system based on 3/2 DRAM controlled by the on-chip DRAM controller and one system based on 2/1 SRAM in the ROM controller's address region. The on-chip ROM controller permits 2/2 access only when performing memory writes. The Am186EM µC had SRAM attached to its on-chip memory controller. The 70-nsec SRAM is inexpensive because of its 10-MHz operation. Slow SRAM devices are inexpensive but generally smaller in capacity than DRAM. Their size is typically not a problem for many embedded systems.

The benchmark tests use the latest versions of the best available compilers. Optimization levels produce the highest code-execution speeds. These settings allow a user to trade off between performance and code size.

The worst benchmark abuse

One of the most frequent abuses of benchmarking is that benchmarks present the results for different platforms as if they are actually similar. For example, a CPU vendor might benchmark its own processor on a board with plenty of fast SRAM and then benchmark competing products with slow DRAM. The vendor might then present the results as if all the processors are running with memory systems of equivalent price and performance. The writers of such reports usually carefully avoid any outright lies; instead, they omit the key information that would make the unfairness of the comparison obvious, thus denying any intention to mislead.

This dishonesty is precisely what gives benchmarks a bad name. Using fast components in a benchmarking platform is fair, but the only honest way to present benchmark results is to disclose full details of all parts of the system--enough to allow an independent verification. Then, users can draw their own conclusions and decide whether they can afford to implement their own production systems in the same way.

Dhrystone cache hit rates are atypically high, and Dhrystone spends much execution time in rather idealized string-handling activities. Performance results for a larger piece of embedded code more fully illustrate the limitations of Dhrystone performance. This piece of embedded code is an implementation of the Link Access Protocol used for the D-channel (LAPD) of an Integrated Services Digital Network connection. The protocol covers the mechanics of setting up and tearing down a call connection, rather than with bulk data transfer.

This LAPD code is based on the AmLink software from AMD. In the form used for performance measurement, the LAPD code comprises three sections: sending an information packet and receiving an unnumbered acknowledge, receiving an information packet and responding with an unnumbered acknowledge, and sending an information packet and receiving an information packet.

Results consist of geometric-mean values of the three sections' packet-switching speeds. Multiplying all N values and then taking the Nth-root of that aggregate product produces the geometric mean. When combining the results of several benchmarks, the geometric mean is preferable over the arithmetic mean because the variance among the values to be averaged has less effect on the geometric mean (see box "Use of the geometric mean").

Comparing Dhrystone performance

In the context of the overall embedded CPU market, the three devices for these tests fall within the middle ground--clearly above the lowest cost 4- and 8-bit components and certainly below high-end 32- or 64-bit devices. The i960 family spans a wider performance range than the 29K; the S-series i960 is more a scaled-down version of its immediate relatives than is the Am29205.

\TEXT\IMAGES\EDN\LINE\11MS3591With Version 2.1 of Dhrystone, both RISC processors outperform the Am186EM CISC processor. However, the Am29205, when it uses a slow 3/2 memory, fails to outperform the other two processors (Figure 1a). Also, the i960 SA's advantage is slight. The i960 SA benefits from using Dhrystone because of its on-chip cache, which exploits the benchmark's size and strong locality.

The Am29205's result gains a different advantage. You can use the processor's Cpbyte instruction to compare four pairs of characters at once, holding them in a pair of 32-bit operands. String-comparison operations account for a significant proportion of Dhrystone's overall activity.

Running the LAPD benchmark shows that the i960 SA system's performance is relatively weak (Figure 1b). The LAPD benchmark is considerably larger than Dhrystone, so the small cache offers little benefit. The Am29205 µC still performs relatively well, most likely because of the large on-chip register file. For the LAPD code's intensive data-handling performance, this large register file minimizes the frequency of slower accesses to off-chip memory. The Am186EM system shows the highest result of the three.

Contemplate the outcome

\TEXT\IMAGES\EDN\LINE\11MS3592The LAPD benchmark results may come as a surprise, especially to those who expected the relative performance to remain broadly the same as that for Dhrystone. The Dhrystone results suggest that, unless you use a slow memory system with the Am29205, this µC has a significant advantage over the Am186EM system. \TEXT\IMAGES\EDN\LINE\11MS3593However, the LAPD result tells exactly the opposite story (Figure 2). Similarly, the Dhrystone results suggest that the i960 system has a slight advantage over the Am186 system, but the LAPD results show a decisive advantage the other way (Figure 3).

It's important to understand the conclusions that you can draw from these results. You should not use the LAPD benchmark--or any other single program--instead of Dhrystone as the only measure of processor performance. On the contrary, the results from LAPD could be just as misleading as Dhrystone and for the same reason: The characteristics of LAPD in the way it exercises the CPU can differ widely from those of the software in your embedded system. Second, this article makes no claims on behalf of any vendor or architecture and doesn't want to launch yet another campaign in the RISC-versus-CISC war.

The point is simply that you can't project the performance of high-performance embedded systems simply by considering the execution time of one small, simple benchmark program. Many other aspects of CPU performance--multitasking support characteristics, such as interrupt latency or context switching--have great importance for embedded-systems design. High-end embedded processors as fast as 200 MHz are already available. Many of these processors feature large and sophisticated on-chip cache hardware, long pipelines, and high-speed floating-point support, yet are still priced for use in mass-market embedded and consumer-electronics applications. The time it takes for a repeated execution of a few hundred artificially selected instructions on a machine of this complexity tells you little about the device's ability to control, say, a 1-Gbit router or a high-resolution laser printer.

Although engineers frequently blame the widespread abuse of Dhrystone results on wicked marketing departments or irresponsible journalists, engineers must take their share of the responsibility. Processor vendors should work to create or select more useful benchmarks and ensure that they clearly and honestly report the results. Engineers who design embedded processors into systems also have a crucial role to play. Don't simply take benchmarks on faith; you should expect and demand better support from processor vendors.


Reference
  1. Communications of the ACM, Volume 27, No. 10, October 1984.


The EEMBC solution

Early in 1997, EDN helped to form the EDN Embedded Microprocessor Benchmark Consortium. The goal of the consortium is to establish architecture-independent standards for benchmarking the key performance characteristics of popular embedded applications.

Many embedded-processor manufacturers attended the EEMBC kickoff meeting. It is encouraging that all 20 participating companies share the same goal of developing reliable application-specific benchmarks. Vendors formed subcommittees to address automotive/industrial, telecomm, networking, consumer-electronics, and office-automation applications.

Rather than depend on the result of just one small benchmark, each subcommittee decided to prepare a suite of programs, each providing insight into processor performance on a relevant aspect of the application under consideration. By considering the results for such a suite, system designers can gain a true understanding of the performance available in future embedded systems.

How cache efficiency affects system performance

Suppose you run Dhrystone on a processor and find that the µP executes some number of iterations in P cycles with a cache hit rate nearing 100%. Now, suppose you lift a code sequence of similar length from your application firmware and run this code on the same µP. You would probably expect a similar execution time for this code.

To your dismay, you find that the cache hit rate becomes only 80%. In the target system, each cache miss costs a penalty of 11 processor cycles while the system waits for the cache line to refill from slow; 11 cycles for a 50-MHz CPU is only 220 nsec. Execution time increases from P cycles for Dhrystone to (0.8×P)+(0.2×P×11)=3P. In other words, dropping the cache hit rate to 80% cuts overall performance to just 33% of the level you expected if you had based your projection purely on the Dhrystone result.

About the processors

The i960 SA µP

The K and the S series of processors are 32-bit members of Intel's i960 family, which the designers intended for use in embedded applications. Both the K and S series emphasize low cost over sophistication. They lack on-chip floating-point hardware, memory-management units, and data caches but do carry a direct-mapped 512-byte instruction cache. All operation codes are 32 bits, so the cache can hold as many as 128 instructions. This cache is smaller than even the Dhrystone benchmark; it's unlikely that a cache can contain all instructions for one iteration. The K series has a 32-bit external data bus, and the S series has a 16-bit bus, trading off memory-access efficiency for a lower cost package. The on-chip instruction cache may mitigate the effect of this trade-off.

Whenever a cache miss occurs, the i960K/S initiates a block-refill operation, which always fills a complete cache line as an address-aligned block of four instructions, accessed in ascending address order starting from the beginning of the block. Again, this trades off performance for less die area. The CPU must remain idle until the critical word returns, which in the worst case is the last word in the block. On the other hand, the cache-controller logic is simple, and the cache needs only 1 valid bit per cache line.

The i960K/S resolves branches at the decoding stage without using the instruction-execution unit, so another nonbranch instruction can execute during the same clock cycle. The µP can't take a conditional branch until the third cycle after the instruction that sets the condition, because new condition-code information doesn't immediately be-come available at the decoding stage. However, the effective penalty reduces to two cycles if at least one intermediate instruction exists. The µP can handle an untaken branch only two cycles after the condition is set. Certain instructions require multiple cycles at the instruction-execution unit; during these cycles, the µP can resolve branches without any effective penalty, provided that the condition-code information became available far enough in advance. Unlike many other RISC designs, the i960 K/S series processors implement no delayed branching but rely on the compiler to schedule instructions to minimize the described penalties.

The Am29205 microcontroller

AMD's Am29205 microcontroller is a member of the 29KTM family of 32-bit RISC CPUs. The device includes a nonmultiplexed 24-bit address bus and a 16-bit data bus. All operation codes are 32 bits, so each instruction requires a pair of 16-bit memory transactions; there is no on-chip cache. As in the case of the i960 S series, this µC trades off memory access for a lower cost package. The device includes a large on-chip register file. The instruction set also takes advantage of the internal 32-bit datapath and allows for handling string data in four-character chunks for purposes such as comparison.

The Am29205 includes onboard controllers for common memory devices, so the µC can connect directly to ROMs, SRAMs, and DRAMs without additional logic devices. Thus, the Am29205 itself, rather than any external logic, determines memory-controller performance.

The µC supports a 32-bit address space, which divides into a number of regions, each dedicated to a class of external device. You can subdivide each region into banks with independently configurable access characteristics. On-chip registers control the address range, data width, and access-time parameters for each bank. Whenever the CPU requests a memory transaction, the µC decodes the address to determine which region is involved; the decoder activates the associated controller, which performs the requested transaction using the settings programmed for the address region involved.

The two on-chip controllers of primary interest support ROM and DRAM, respectively. The first allows direct attachment of many common external devices, including not only EPROMs, but also SRAMs and UARTs. The second controller supports direct connection of DRAM devices. By default, each access occurs in four CPU clock cycles, but the controller can exploit fast-page-mode devices to perform burst accesses in just two cycles, as long as the addresses remain consecutive. The DRAM is often referred to as "3/2'' rather than "4/2." (This notation refers to the number of memory-access cycles in initial/subsequent operations.) The four cycles comprise one cycle of precharge and three cycles of latency, although certain conditions hide the one-cycle precharge.

The Am186EM microcontroller

The AMD Am186EM is also a microcontroller but belongs to the E86TM family of embedded products using the x86 CISC architecture. This µC lacks on-chip caches, floating-point support, and memory-management hardware. True to the CISC heritage, the device uses variable-length operation codes, and many instruction types require multiple clock cycles for execution.

The 16-bit architecture of the Am186TM µC has been around for a long time. You might reasonably suppose that competing products gain some advantage from the considerable body of recent architectural research upon which they can draw. On the other hand, longevity presents some advantages; the AM186TM benefits from successive optimizations and die-size reductions over the years.

How would you like your benchmark results cooked, sir?

Dhrystone usually compiles using the string functions that are part of a runtime function library; the library itself is typically part of the compiler package. The vendor usually optimizes the library functions to strike a balance between code size and execution speed.

However, you can use specially coded string-handling functions in place of the regular library routines, which lead to a more favorable Dhrystone performance rating. Such practices aren't necessarily dishonest as long as the vendor discloses details.

This view may seem initially surprising, but think about it: If you find that your system spends 30% of its execution time in one function, would you want an optimized version of the function, or would you still insist on using the slower implementation from your compiler vendor's runtime library?

Use of the geometric mean

Suppose initially that you wish to average 5.2, 6.3, and 4.7. The arithmetic mean is (5.2+6.3+4.7)/3=5.4. The geometric mean is (5.2×6.3×4.7)1/3=153.971/3=5.36.

But now suppose you employ some dubious optimization on one of the tests, so that the third result comes out higher, say 19.7. The arithmetic mean is now (5.2+6.3+19.7)/3=10.4. However, the geometric mean becomes (5.2×6.3×19.7)1/3=645.371/3=8.64. Using the arithmetic mean, you could claim to have almost doubled performance, from 5.36 to 10.4, but what you really did was drastically improve the result for just one test and leave the rest unchanged.

Thus, if you want the final figure to reflect overall improvement, the geometric mean is a better measure; it's less sensitive to changes in just one component of the results. This fact makes the geometric mean useful for combining the results of several benchmarks. To most users, dramatic improvement in just one test is less interesting than a reasonable improvement across the board.


Authors' biographies

Daniel Mann is a senior member of the technical staff in AMD's Embedded Processor Division (www.amd.com). You can reach him at daniel.mann@amd.com. 

Paul Cobb is senior performance-analysis engineer at QED Inc (www.qedinc.com); he is QED's representative on the EEMBC board and chairman of the EEMBC networking subcommittee. You can reach him at paulc@qedinc.com.  


| EDN Access | Feedback | Table of Contents |


Copyright © 1998 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Business Information, a unit of Reed Elsevier Inc.