May 21, 1998
When Dhrystone leaves you high and dry
Daniel Mann, AMD, and
Paul Cobb, QED
A worthwhile evaluation of processor performance requires more insight
into the underlying issues than a Dhrystone benchmark test or any other single program can
provide. The results of three processors running Dhrystone and a larger piece of embedded
code reveal just how misleading Dhrystone results can be.
The sheer range of processors vying for a share of the
embedded-system market is overwhelming, and choosing between the µPs can be difficult and
time-consuming. The resulting and understandable demand by engineers to analyze actual
performance numbers led to the idea of µP benchmarks, such as Dhrystone--small, port-able
programs that provide a convenient numeric output. This output usually reveals how long a
µP takes to execute a code sequence, which you can also express as the number of times
the code sequence executes per second.
Unfortunately, one program can't provide the performance
insights embedded-system designers need because complex interactions between advanced
architectural features, instruction sets, and sophisticated compilers ultimately determine
the system-level performance of today's embedded processors. Second, there's no such thing
as a "typical" embedded system, making it impossible for any one
program--especially a small, portable program--to represent all such interactions.
Thus, it is dangerously misleading to project or compare
system-level performance on the basis of small synthetic benchmarks unless you can show
that they indeed represent the CPU behavior on larger programs of practical interest.
Tests of processors running both the Dhrystone benchmark and a larger piece of embedded
code show just how misleading benchmark results can be. This problem, as well as many
other criticisms, is motivating the development of more dependable benchmark methods; the EDN Embedded Microprocessor Benchmark Consortium (EEMBC) hopes to better inform embedded-system designers
about a processor's performance capability (see box "The
EEMBC solution").
In 1984, the Dhrystone benchmark's authors wrote the
initial version, 1.0, in Ada and have subsequently updated the benchmark and converted it
to C (Reference 1). The most recent version,
2.1, is free via the Internet. The benchmark has no governing body or institution that
reviews and approves results before publication.
Some benchmarks began life as useful programs to perform a
practical computing task; their use as tools for performance measurement came later. In
contrast, the Dhrystone program performs no directly useful action. Instead, it belongs to
"synthetic" benchmarks--code with particular behavioral characteristics rather
than programs that implement algorithms.
The authors of the Dhrystone benchmark characterized the
behavior of various high-level-language programs in statement-type, operand-type, and
operand-locality--global, local, parameter, or constant--characteristics. The authors then
constructed the benchmark program itself to show similar distributions, according to these
measures. For example, statement types contain assignment, control, and
procedure/function-call statements.
The Dhrystone benchmark is readily portable, making it
possible to obtain results for a processor through a relatively small investment of time
and effort. This convenience is seductive and has led to widespread over-reliance on
Dhrystone. In all too many cases, it's the most important, sometimes even the only,
quantitative performance measurement that designers use to select the processor for a new
embedded-system project.
Dhrystone is not without value, but many people in the
industry frequently quote and use the benchmark's results in ways that don't withstand
scrutiny. This misuse does a great disservice to all concerned, including the benchmark's
authors.
Version 2.1 of Dhrystone contains 103 high-level statements
within the main loop, which executes repeatedly during a benchmark run. The user chooses
the number of iterations at runtime. If there are several iterations, the effect of the
code before and after the main loop becomes negligible as a proportion of overall runtime.
The benchmark does some self-checking to ensure that it compiles and runs as intended.
At the end of a benchmark run, Dhrystone prints the
absolute time required per iteration; the number of iterations per second through the main
loop; and the performance, measured in iterations per second, relative to a baseline
machine. The baseline machine is Digital Equipment's VAX 11/780, which was in wide use
when the authors created the benchmark.
This list of benchmark parameters indicates that Dhrystone
requires the target system to have a timer and that Dhrystone measures only one
performance parameter. The number of iterations per second is simply the reciprocal of the
time per iteration, and you obtain the relative perform-ance figure by simply scaling the
same result.
Most processor vendors quote either the number of
iterations per second through the main loop, or the VAX-relative performance in iterations
per second, or both. Both of these parameters are easy to understand: The higher the
value, the better. However, you should consider what these numbers really say.
For any CPU, the Dhrystone rating follows from this simple
relationship:
Dhrystone iterations per second=CPU clock cycles per
second/CPU clock cycles per Dhrystone iteration.
You may not be surprised that a slow memory system, for
which each access incurs wait states at the CPU, constrains the Dhrystone rating. However,
you may not know that the choice of compiler and the way you use it can also have a
significant effect on the Dhrystone result, because the compiler also can affect the
number of cycles each iteration requires.
For example, suppose you compile the benchmark and find
that on the system you use to run the benchmark, each iteration takes 1000 machine-clock
cycles. On a 10-MHz processor, each iteration therefore takes 100 µsec (1000×100
nsec=100 µsec), meaning that this platform can perform 10,000 Dhrystone iterations per
second.
Things Dhrystone can't tell you
The first major limitation of Dhrystone is its size:
Because it is small, it's easy to work with. When the authors created Dhrystone, caches
for instructions or data were a rarity in embedded-system design. Since then, caches have
become commonplace, as CPU speed has increased more rapidly than that of commodity memory
devices. Dhrystone's strong locality allows caching to significantly boost performance on
the benchmark. Dhrystone's size means that even a small cache can contain most or all of
the information that each iteration uses.
The code for a larger piece of firm-ware can exhibit less
locality than Dhrystone; even if the code exhibits as much locality, the code may still
require a larger cache to gain appreciable speed. The CPU you're considering for your next
design may have a more impressive Dhrystone result than the one you use now, but are you
sure the system-level performance gain will be so impressive? (See box "How cache efficiency affects system performance.")
The second major limitation of Dhrystone arises from its
execution profile, which is the proportion of overall execution time it spends in each
function. A program in which each function makes roughly the same contribution to the
total has a "flat" profile. In contrast, a program for which just a few
functions account for a significant proportion of overall execution time has a
"sharp" profile.
On most embedded CPU architectures, Dhrystone's profile is
sharp, and it spends as much as 30 to 40% of its execution time in just two functions:
strcpy() (string copy) and strcmp() (string compare). The following code fragment, which
comes from within the main loop of Dhrystone, is an example of the way the benchmark uses
these functions:
strcpy (Str_2_Loc, "DHRYSTONE PROGRAM, 3'RD
STRING");
The string in this case is fixed; no matter how many times
you run the benchmark, the string uses the same sequence of characters. Thus, data caching
significantly boosts perform-ance on all but the first pass through the code. Furthermore,
compilers for 16- or 32-bit machines are free to place the string in memory such that the
first character is "address-aligned," meaning that the µP can read multiple
characters in the first and subsequent operations. In these respects, Dhrystone performs
rather specialized and more intense string handling than that found in a many
embedded-system workloads.
If your embedded firmware does little specialized string
handling, projections of the firmware's performance based on the Dhrystone results could
be inaccurate. You can also turn that logic around: If your application does process a lot
of strings in a similar way, Dhrystone could turn out to be a useful guide to performance
(see box "How would you like your
benchmark results cooked, sir?").
Various other criticisms of Dhrystone include the argument
that its synthetic construction causes Dhrystone's characteristics to differ from those of
code that implements practically useful computations--for example, that Dhrystone has an
unusually low number of instructions per procedure call and is therefore oversensitive to
the implementation efficiency of procedure calls and returns.
Similarly, Dhrystone's call sequences are nested only three
or four levels deep. Thus, most register-window RISC machines would never spill or fill
their register windows and thus would never need to save or restore registers from
off-chip memory. Register windowing is a useful architectural technique, but its
usefulness on the Dhrystone benchmark is higher than that found in a great deal of
embedded-systems code.
All these lines of argument essentially proceed toward the
same conclusion: It's possible--even likely--that the small, portable Dhrystone benchmark
doesn't use the CPU in the same way as a large, complex piece of embedded software does.
Three processors and their Dhrystone results
As a practical demonstration of this point, take a look at
the performance of Dhrystone and of a larger piece of embedded software on three
processors. The processors are not necessarily comparable in price or perform-ance; this
study compares the benchmarks, not the processors. Two of the processors are RISC
designs--the i960 µP (Intel Corp, www.intel.com) and the Am29205 µC (AMD, www.amd.com)--and the
third is a derivative of the well-known x86 CISC family, the Am186EM µC (AMD) (see box "About
the processors").
Comparing apples with oranges
The RISC processors ran the benchmarks with a clock speed
of 16 MHz, which was the speed of both the system bus and the internal pipeline. The
Am186EM µC was tested with a 10-MHz bus and a 40-MHz internal speed. Even so, memory
bandwidth still sets the limit on performance; recall that most x86 instructions require
multiple CPU cycles. If these speeds strike you as either very low or rather high,
remember that there's no such thing as a typical embedded system. The purpose here is
simply to illustrate some points about the practicalities of benchmarking.
The i960 benchmarks ran on Intel
evaluation boards. Intel's two-pass compiler technology
was exploited as far as possible, wherever applicable. Because of i960 architecture
characteristics, a number of i960 family members benefit from two-pass technology. The
Am29205 and Am186 µC benchmarks ran on AMD evaluation
boards.
Memory access times for the evaluated systems are in the
notation initial/subsequent. For example, 2/1 denotes a system that returns the first word
of a new access sequence with a delay of two cycles and denotes each subsequent word in
the fetch sequence with a delay of one cycle. This memory system is sometimes referred to
as "2-1-1-1." Similarly, a 4/3 DRAM has four-cycle initial access followed by
three cycles for subsequent words within the same burst, sometimes written
"4-3-3-3."
The i960 SA platform uses relatively fast 2/1 DRAM. Results
for the Am29205 processor were obtained on one system based on 3/2 DRAM controlled by the
on-chip DRAM controller and one system based on 2/1 SRAM in the ROM controller's address
region. The on-chip ROM controller permits 2/2 access only when performing memory writes.
The Am186EM µC had SRAM attached to its on-chip memory controller. The 70-nsec SRAM is
inexpensive because of its 10-MHz operation. Slow SRAM devices are inexpensive but
generally smaller in capacity than DRAM. Their size is typically not a problem for many
embedded systems.
The benchmark tests use the latest versions of the best
available compilers. Optimization levels produce the highest code-execution speeds. These
settings allow a user to trade off between performance and code size.
The worst benchmark abuse
One of the most frequent abuses of benchmarking is that
benchmarks present the results for different platforms as if they are actually similar.
For example, a CPU vendor might benchmark its own processor on a board with plenty of fast
SRAM and then benchmark competing products with slow DRAM. The vendor might then present
the results as if all the processors are running with memory systems of equivalent price
and performance. The writers of such reports usually carefully avoid any outright lies;
instead, they omit the key information that would make the unfairness of the comparison
obvious, thus denying any intention to mislead.
This dishonesty is precisely what gives benchmarks a bad
name. Using fast components in a benchmarking platform is fair, but the only honest way to
present benchmark results is to disclose full details of all parts of the system--enough
to allow an independent verification. Then, users can draw their own conclusions and
decide whether they can afford to implement their own production systems in the same way.
Dhrystone cache hit rates are atypically high, and
Dhrystone spends much execution time in rather idealized string-handling activities.
Performance results for a larger piece of embedded code more fully illustrate the
limitations of Dhrystone performance. This piece of embedded code is an implementation of
the Link Access Protocol used for the D-channel (LAPD) of an Integrated Services Digital
Network connection. The protocol covers the mechanics of setting up and tearing down a
call connection, rather than with bulk data transfer.
This LAPD code is based on the AmLink software from AMD. In the form used for performance measurement, the LAPD
code comprises three sections: sending an information packet and receiving an unnumbered
acknowledge, receiving an information packet and responding with an unnumbered
acknowledge, and sending an information packet and receiving an information packet.
Results consist of geometric-mean values of the three
sections' packet-switching speeds. Multiplying all N values and then taking the Nth-root
of that aggregate product produces the geometric mean. When combining the results of
several benchmarks, the geometric mean is preferable over the arithmetic mean because the
variance among the values to be averaged has less effect on the geometric mean (see box
"Use of the geometric mean").
Comparing Dhrystone performance
In the context of the overall embedded CPU market, the
three devices for these tests fall within the middle ground--clearly above the lowest cost
4- and 8-bit components and certainly below high-end 32- or 64-bit devices. The i960
family spans a wider performance range than the 29K; the S-series i960 is more a
scaled-down version of its immediate relatives than is the Am29205.
With
Version 2.1 of Dhrystone, both RISC processors outperform the Am186EM CISC processor.
However, the Am29205, when it uses a slow 3/2 memory, fails to outperform the other two
processors (Figure 1a). Also, the i960 SA's
advantage is slight. The i960 SA benefits from using Dhrystone because of its on-chip
cache, which exploits the benchmark's size and strong locality.
The Am29205's result gains a different advantage. You can
use the processor's Cpbyte instruction to compare four pairs of characters at once,
holding them in a pair of 32-bit operands. String-comparison operations account for a
significant proportion of Dhrystone's overall activity.
Running the LAPD benchmark shows that the i960 SA system's
performance is relatively weak (Figure 1b). The
LAPD benchmark is considerably larger than Dhrystone, so the small cache offers little
benefit. The Am29205 µC still performs relatively well, most likely because of the large
on-chip register file. For the LAPD code's intensive data-handling performance, this large
register file minimizes the frequency of slower accesses to off-chip memory. The Am186EM
system shows the highest result of the three.
Contemplate the outcome
The
LAPD benchmark results may come as a surprise, especially to those who expected the
relative performance to remain broadly the same as that for Dhrystone. The Dhrystone
results suggest that, unless you use a slow memory system with the Am29205, this µC has a
significant advantage over the Am186EM system. However, the
LAPD result tells exactly the opposite story (Figure 2).
Similarly, the Dhrystone results suggest that the i960 system has a slight advantage over
the Am186 system, but the LAPD results show a decisive advantage the other way (Figure 3).
It's important to understand the conclusions that you can
draw from these results. You should not use the LAPD benchmark--or any other single
program--instead of Dhrystone as the only measure of processor performance. On the
contrary, the results from LAPD could be just as misleading as Dhrystone and for the same
reason: The characteristics of LAPD in the way it exercises the CPU can differ widely from
those of the software in your embedded system. Second, this article makes no claims on
behalf of any vendor or architecture and doesn't want to launch yet another campaign in
the RISC-versus-CISC war.
The point is simply that you can't project the performance
of high-performance embedded systems simply by considering the execution time of one
small, simple benchmark program. Many other aspects of CPU performance--multitasking
support characteristics, such as interrupt latency or context switching--have great
importance for embedded-systems design. High-end embedded processors as fast as 200 MHz
are already available. Many of these processors feature large and sophisticated on-chip
cache hardware, long pipelines, and high-speed floating-point support, yet are still
priced for use in mass-market embedded and consumer-electronics applications. The time it
takes for a repeated execution of a few hundred artificially selected instructions on a
machine of this complexity tells you little about the device's ability to control, say, a
1-Gbit router or a high-resolution laser printer.
Although engineers frequently blame the widespread abuse of
Dhrystone results on wicked marketing departments or irresponsible journalists, engineers
must take their share of the responsibility. Processor vendors should work to create or
select more useful benchmarks and ensure that they clearly and honestly report the
results. Engineers who design embedded processors into systems also have a crucial role to
play. Don't simply take benchmarks on faith; you should expect and demand better support
from processor vendors.
Communications of the ACM, Volume 27, No. 10, October
1984.
|