| |
|
January 15, 1998
Hands-On Evaluation
Virtual processors and the reality of software simulation
Markus Levy, Technical Editor
DSP and µP simulators range from simple instruction-level analyzers to
cycle-accurate simulators that allow you to model an entire system. This Hands-On
Evaluation compares a sampling of the available simulators to help you weigh their
features and benefits.
A few years ago, advanced processor-simulation technology was feasible only on
expensive workstations, and, therefore, its use was limited to the processor vendors
themselves. System developers had to use other tools, such as in-circuit emulators and
hardware-development boards, to analyze their designs' performance. Or, if they were
lucky, they could use a software debugger that would give them minimal perform-ance
analysis, although it probably wouldn't provide insight into the processor's pipeline. The
problem with using in-circuit emulators is that they require you to have already committed
your untested design to silicon. The problem with hardware-development boards is that they
have standard features that may differ widely from your design.
Fortunately, high-performance PCs and low-end workstations are bringing simulation
technology to a practical level. System designers are beginning to depend on commercially
available simulators to help them achieve quicker time to market. They relish the ability
to simulate their designs well ahead of working silicon, giving them the confidence that
their designs are functional before committing them to hardware. Furthermore, system
designers developing products with custom processors use simulators to experiment with
various features, including cache size, operating frequency, and on-chip peripherals. Some
programmers are using simulators for almost all of their development work, especially as
improved simulator accuracy narrows the gap between virtual and reality (see box "Do simulators meet the needs of performance benchmarks?").
The capabilities of processor simulators range from simple instruction-level analyzers
to cycle-accurate simulators that allow you to model an entire system. (They all provide
various degrees of debugging capability.) If you need only high-level functional
verification for your development, then an instruction-level simulator is sufficient. But
if you need precise timing information that will guarantee your system's behavior, a
cycle-accurate simulator is essential. In this article, I take some commercially
available, PC-compatible simulators for a test drive. The simulators I compare are from Analog Devices, CARDtools,
Intel, Microchip
Technology, Software Development Systems, and Tasking. It is not the goal of this comparison to
generate competitive dissonance among the representative sampling of simulators. You can
find the specs of my test system here.
Accuracy-vs-performance trade-offs
A cycle-accurate simulator models all aspects of the target processor, including the
pipeline, the cache, the memory-management unit, and all phases of memory accesses. More
complex simulators can also model system-level performance and allow you to
"attach" external peripherals through an application-programming interface
(API). As their name implies, cycle-accurate simulators evaluate the processor's state at
every clock cycle. To prove the accuracy of their simulators, some vendors test them
against the same test vectors they use to validate their processors.
The granularity of an instruction-accurate simulator is at the instruction level. Only
in simple processor architectures, such as Microchip's
PICmicro, does an instruction-accurate simulator come close to modeling the processor's
architecture. Vendors build instruction-accurate simulators to help customers analyze
software functionality, perform code tracing, and port and initialize operating systems.
These simulators do not suit performance analysis or architecture benchmarks. However, as
you might expect, simulator complexity and accuracy is inversely proportional to the rate
of instruction execution. In other words, the more features and capabilities that a
simulator has, the fewer instructions it can execute per second.
Before delving into the features and capabilities of simulators, you should know what a
simulator comprises. From the highest level, a vendor writes its simulator code in VHDL or
a high-level language, such as C or C++. Although, simulators written in VHDL are
typically proprietary and not commercially available, many vendors derive their
high-level-language-based simulators from their processors' VHDL gate-level models. Other
vendors derive their simulators from a functional specification taken from the processor's
data sheet. Regardless of the language a vendor uses, a simulator is basically a software
program that contains a mixture of if-then-else statements and a variety of function
calls. In the simplest sense, the simulator "fetches" each instruction, checks
the current status of the CPU, and makes the appropriate function calls to update the
state of the virtual CPU based on the execution of the instruction. Instruction-accurate
simulators check and update only the fundamental CPU and system elements, such as
registers, ALU, and memory. The more complex and accurate a simulator is, the more
elements it checks and the more often it checks them.
Features and capabilities
Although instruction-accurate simulators can achieve more than 1.5 MIPS, cycle-accurate
simulators, such as Analog Devices' Visual DSP for the
company's ADSP-2106x DSPs (SHARC), typically perform fewer than 5000 instructions/sec. To
be precise, running the Win95-compatible, Visual DSP simulator on a 300-MHz Intel Pentium II system, the simulator hits 4613
cycles/sec. (Visual DSP also runs on NT and Sun workstations with Solaris or Sun OS.)
Considering the amount of work that the simulator performs, this number is relatively
high. First, Visual DSP goes beyond being a cycle-accurate simulator; it actually analyzes
the processor state on the rising and falling edges of every clock. The SHARC simulator
simulates the 2106x core, including the pipeline and instruction cache, the memory
subsystem and associated buses, interrupts, and the I/O processor and associated
peripherals. The simulator accurately handles aborted pipeline stages, cache misses, and
delay cycles associated with interrupts and bus contention and looping. Click here if you're interested in seeing the
basic program flow of Visual DSP.
Just as with real hardware, you can set up external memory regions and define the
number of wait states. The simulator's user-friendly graphical user interface (GUI)
displays the fetch, decode, and execute pipe stages. If you are using the simulator to
follow a program's flow and the program hits a "change of flow," the GUI
displays the exact way that the CPU fills and flushes its pipeline. The GUI also allows
you to display the instruction-cache contents, including all of the processor's internal
cache state information.
Visual DSP allows you to automatically generate external interrupts, which the
simulator examines on every clock edge. You can specify the interrupt's frequency and
whether you want the simulator to randomly vary the timing of the interrupt. As with other
simulators that handle external interrupts, the simulator may skew timing by one or more
clocks. In other words, in a real hardware implementation, when the external interrupt
hits, it takes a finite amount of time before the interrupt goes through the interrupt
controller and into the CPU. With simulators, no hardware delay exists, so the simulator
may process the external interrupt one or more clocks early.
Analog Devices' simulator also supports the DSP's
DMA features and accurately reads or writes data at the correct time. The simulator
recognizes when a DMA causes a cycle steal or stall. The simulator simulates the DSP's
serial ports and allows you to "connect" the data lines to a file to allow the
simulator to process data as if the DSP were connected to another device. Through one of
the simulator's APIs, you can map an I/O location to a data file to simulate reading and
writing to another I/O-mapped device in your system.
To add flexibility to its simulator, Analog Devices
developed the simulation engine as a dynamic-link library, isolating it from the GUI via a
series of public APIs. You can use these APIs to connect the simulator with other software
models and have them exchange and synchronize signals. For example, you can connect a
serial audio codec model to the "DSP's" serial port and have the codec respond
as if it were in the system. This process of merging simulations can be slow, so Analog Devices is modifying the simulation engine to
allow users to write modules that plug directly into the engine.
Visual DSP provides an intuitive GUI but requires you to have some experience with the
SHARC processor to understand what's happening in the pipeline. With a data book in hand,
you can easily program the DSP's on-chip peripherals through the simulator's interface.
Although it may not be an issue for you, the simulator presents timing information only in
terms of cycle counts. Furthermore, the tool lacks a mechanism to control the processor's
operating frequency, which is not a problem anyway because the simulator provides timing
information in terms of cycle counts. However, operating frequency be-comes more important
when you use external memory with one or more wait states.
Simulating an entire system
CARDtools Systems (computer-aided real-time
design tools) offers the codesign and cosimulation NitroVP tool. You can use NitroVP to
prototype and model embedded CPU applications, such as the Hitachi (Brisbane, CA) SH. This
modeling process covers the processor and its on-chip peripherals, system-level
peripherals and other hardware devices, the OS, and application software. The simulator
supports multiprocessor designs. You can run NitroVP on a PC with Win95 or NT or on a Sun
workstation.
NitroVP lets you model an entire system--at any level of abstraction, including the SH
cycle-count simulator. However, because CARDtools'
simulator is event-driven, it is faster than a cycle-count version. The level of
abstraction that you use depends on the time you want to spend developing the models, as
well as the time it takes to run the simulation. CARDtools
provides mixed-mode simulation, allowing you to simulate different sections of the model
in different levels of detail, independently of the other sections. Selecting the
appropriate level of detail in a model is a balancing act between accuracy and
performance. The greater the degree of accuracy, the slower the simulation. For example,
the Hitachi cycle-accurate simulator, including memory, cache, and pipelining effects,
runs at 50,000 cycles/sec on a 200-MHz Sun Ultra II workstation.
In addition to cycle-accurate simulation, NitroVP allows you to control memory wait
states, a critical factor for determining a system's performance. The CARDtools simulator is the only commercially available
simulator I know of that can model the processor's and system peripherals'
power-consumption. What's more, the simulator generates the power-consumption numbers on a
cycle-by-cycle basis by determining the active functional units or peripherals within the
processor. NitroVP can also model power consumption under various operating modes,
including active, standby, and sleep.
NitroVP allows you to model hardware devices using probabilistic inputs, such as a
periodic interrupt or reading a test file as input data. You can also use a C program. For
more accuracy, you can make direct calls to a Verilog/VHDL simulator or--at a slightly
higher level than VHDL--you can use CARDtools'
Device Behavioral Language (DBL). Although using DBL requires you to learn to program in
this proprietary language, it is based on general C constructs. Likewise, you can use C
code to model software tasks. Alternatively, CARDtools
has developed a Task Behavioral Language (TBL) for higher level task representations.
CARDtools' DBL allows you to specify the logic
and timing of a device, as well as to highlight the various device states and delays to
process data. Just as with a real device, you can re-enter, reset, or lock your DBL
device. The language also supports priority usage and pending critical resources, such as
a system bus. DBL supports a logical representation of hardware registers, allowing the
virtual system to share register information with software; your software or software
model can directly read, set, or modify these hardware registers.
One
of the most useful capabilities of the SH simulator is its ability to account for hardware
behavior and delays, OS overhead, and memory usage. The GUI allows you to review timing
issues, such as missed deadlines, starvation, and poor scheduling. A set-top-box
simulation demonstrates these capabilities of the simulator (Figure
1). This simulation ran on a 45-MHz SH-3 processor. In the figure,
the top section of the GUI shows the percentage of CPU usage; the most important result is
that the CPU was idle 30% of the time. From this result, a system designer could then
decide whether to operate the CPU at a lower frequency or leave some head room for
additional system functions, such as the ability to handle high-definition TV. The lower
part of the GUI shows resource usage as a function of time. The figure
also shows tasks fighting for CPU resources, when a task or an interrupt-service routine
(ISR) is active, and which task or ISR beat out its predecessor. You can also superimpose
critical-event flags. In other examples, you can see devices interacting with the
processor, as well as the software tasks.
The
CARDtools simulator also analyzes system-memory
requirements. If you use a 27-MHz SH-2 processor in the same set-top-box example, the
audio and ring buffers come close to their limits (Figure 2).
Additionally, only 2% of the CPU's resources remain for overhead. Click
here to see how switching to the SH-3 processor affects buffer usage.
Among the simulators I reviewed, the CARDtools
simulator provides the most comprehensive simulation capabilities. However, the company
needs to add better online help for the DBL and TBL to make these components of the tool
more user-friendly.
Motorola's new MCore
As with CARDtools' simulator, Software Development Systems' (SDS) SingleStep simulators
let you integrate the hardware and software codesign and cosimulation environments of
EagleI from Viewlogic (Marlborough, MA, www.viewlogic.com)
and V-CPU from Simulation Technologies (New Brighton, MN, www.winternet.com/~simtech). This integration
allows you to verify the hardware and software integration at the system level.
Although SingleStep supports Mo-torola's PowerPC, 68K, and MCore, only the MCore and
some 68K processors are cycle-accurate. The MCore simulator models the entire instruction
pipeline, including the prefetch, decode, and execution units. The MCore simulator also
accurately determines the clock cycles for each memory access. This simulator generates
more accurate simulations when multiple devices, such as the CPU and a DRAM controller,
are contending for the memory bus.
SingleStep's GUI displays runtime information, including each instruction as it
executes, all logical memory accesses, address translations, physical-bus cycles, and even
instruction timing. Although not applicable to MCore because MCore has no cache,
SingleStep also displays cache hits and misses. The simulator provides direct control of
the simulated caches, including the ability to selectively enable and disable caches, load
and lock memory ranges into cache, flush all dirty cache lines, and display cache
statistics. Direct cache control allows you to determine the effects of locking various
code or data segments into cache before implementing cache-management routines at the
software level.
SDS also offers a peripheral-adaptation kit with an API that developers can use to
create and model virtual peripherals to hook into the simulator. Using C functions or C++
classes, you can model I/O devices of any level of abstraction, ranging from a high-level
functional description to a detailed register-level interface. SDS chose the C/C++
strategy to maximize flexibility and performance and to avoid exposing software developers
to the complexities of VHDL and Verilog models. The peripheral API also includes
provisions to link simulated peripherals to real hardware through a memory-mapping scheme.
SDS is developing a modeling-extension kit for its simulation environment that will allow
the simulator to model internal and external processor buses, bus arbitration and timing
for multiple processors, and DMA units and other coprocessors. The kit will also provide a
higher level of integration with commercial hardware/software coverification tools.
SingleStep has a well-organized GUI with pulldown windows that help make the simulator
easier to use. However, if you want to unleash all the simulator's capabilities, you need
to use SingleStep's command window, which requires Unix-like commands. High performance is
one of SingleStep's strong points. On a 300-MHz Pentium II system, the simulator
demonstrated 1.65 MIPS; however, I was using a simple piece of code, and no virtual
peripherals were attached. Click here to
download some screen shots of my session with SingleStep.
When accuracy is enough
For instruction-accurate simulators, I tried Microchip's
MPLAB-SIM and Tasking's Power Package. MPLAB-SIM is a
C++ model that provides instruction-accurate simulation for the PIC architecture. In
addition to simulating the PIC's core functions, MPLAB-SIM supports most of Microchip's on-chip peripherals. However, be careful
in your performance analysis because the on-chip peripherals and interrupts function only
on instruction-cycle boundaries, in contrast to real hardware in which the peripherals and
interrupts operate asynchronously to the core. The simulator can detect events after the
first cycle of instructions that have two or more cycles.
At
any instruction-cycle boundary, you can provide synchronous or asynchronous stimuli to any
of the pins. In MPLAB-SIM's asynchronous dialogue box, you can assign input functions on
12 pins (Figure 3). The simulator provides a file-injection
capability that allows you to simulate random events, such as A/D conversions. MPLAB-SIM
does not support memory-wait-state configurations, but, because only the PIC 17cxx devices
have external memory, the lack of wait-state configuration is not an issue (especially
because the PIC 17cxx devices run at a maximum of 20 MHz, making it fairly easy to assume
zero-wait-state memory). Another limitation of the simulator, common to other
instruction-level simulators, is that peripherals may have limited accuracy. For example,
the PWM, which can generate 10-bit accuracy pulses on real hardware, has only 8-bit
accuracy on the simulator because a finer resolution is meaningless if the simulator can
change pin states only on instruction-cycle boundaries. Basically, MPLAB-SIM is a useful
simulator for analyzing the functionality of your program, and you can't beat the price:
It's free. Click here to download Microchip's MPLAB-SIM session.
Tasking's Power Package supports Siemens'
(Cupertino, CA) new Tricore architecture and, like most of the other companies' tools, is
a complete development environment with an instruction-accurate simulator. Like MPLAB-SIM,
this simulator is useful for analyzing the basic functionality of your program. But,
unlike the PIC architecture, the complexity of the Tricore architecture makes it difficult
for an instruction-accurate simulator to accurately model pipeline effects, code branches,
and interrupts. Although the simulator supports variations in the Tricore's
memory-wait-state configuration, it doesn't allow you to control the external-memory
width; the architecture specification stipulates the datapaths and instruction paths. To
study the effects of memory wait states, I ran the Dhrystone MIPS benchmark on the Tricore
simulator. Even with the cache enabled, the simulator demonstrated a huge performance
impact of changing from one to three wait states; the MIPS rating went from 137 to 109. I
couldn't determine or change the processor's operating frequency to affect the results of
this benchmark. Siemens is also working on a cycle-accurate simulator, which it claims
will be available this year. The new simulator will implement a flexible cache model that
gives you options such as defining start and end addresses, the number of ways and lines,
and the line size and banks. This simulator also includes branch-prediction logic and
interrupt-latency determination. Click here to
download Siemens' Tricore simulator session.
Simulating a simulator
Intel's Visual Tuning Environment, VTune, differs
from the other simulators because it runs on the processor that it is simulating--well,
sort of. Actually, VTune runs with Windows 95 or NT systems, and it can simulate Intel processors, including 486, Pentium, Pentium with
multimedia-extension instructions, Pentium II, and PentiumPro.
VTune operates at sampling, simulation, and Java program-call graph-generation levels.
In the sampling level, VTune interrupts the processor and saves the current execution
address in a buffer. The purpose of this part of VTune is to provide a high-level profile
of the CPU usage of every function, including the operating-system modules. Before VTune
begins its sampling operation, it makes a call into the OS to build a table of every
function and their memory ranges. When VTune is finished sampling, it matches the sampled
addresses with the table's information and displays the information in a bar graph. You
can control VTune's sampling rate, but the minimum granularity is 1 msec--too coarse for
any embedded application but more than enough in most PC applications. Besides, sampling
too frequently intrudes on the application.
If you double-click on any bar in the profile bar graph, VTune shows a detailed profile
of "hot spots" in the function you selected. Double-clicking on a hot spot takes
you into the source or disassembled code associated with that hot spot. You can now run a
simulation, or dynamic analysis, on that hot spot. The simulator accurately models the
entire CPU, including its pipelines, the instruction and data caches, and the
branch-predictor circuitry.
One
of VTune's best features is its simulation-analysis capability. After it runs the
simulation, VTune displays the number of clocks each instruction took to execute (Figure 4). In its analysis, VTune also shows you all penalties,
warnings, and pipeline-pairing issues associated with each instruction. This analysis
allows you to tune your code to keep pipelines full, minimize cache misses, and minimize
other types of operation stalls. Click here to
download Intel's VTUNE session.
VTune has excellent online help almost anywhere on VTune's displays. Online help not
only shows you how to use VTune, but also provides detailed explanations for each penalty,
warning, and pairing issue that you click on. One weakness of the simulator is that it
doesn't model system behavior. For example, you can't configure the memory model, which
prevents you from analyzing the performance differences associated with DRAM types. VTune
sells for $279 (retail), but it's generally free to software developers.
Processor simulators are an evolving technology. But every processor vendor is heading
toward the same goal: simulators that behave like real devices. And in the latest trend,
vendors are offering or developing simulators to represent the entire system. It is
reasonable to assume that simulators could replace in-circuit emulators. It also seems
reasonable that simulators should replace hardware-development and -evaluation boards,
especially when you consider that simulators let you easily change parameters, such as
memory wait states or processor frequency.
However, simulators just can't handle some system events, such as the precise timing of
external signals. Furthermore, a simulator will never be able to guarantee the outcome of
race events, as when two or more signals simultaneously come into the processor. Or
consider how a simulator handles a situation in which the CPU and the DRAM controller
simultaneously request the bus. Also, how does a simulator handle the effects of DRAM
refresh cycles on execution speeds? Signal timing between asynchronous events, such as the
assertion of an interrupt-request signal, may also vary, depending on when the CPU samples
the event. Even a difference of a single clock cycle when the processor recognizes an
interrupt request may impact the timing if, during that one clock cycle, the processor
masked out the interrupt. So, as good as the simulation technology gets, remember, there
ain't nothing like the real thing, baby!
Acknowledgments
Thanks to my simulator driving instructors: Gary Carlton of Intel, Darrell Johansen of Microchip, Mark Pauna of Software
Development Systems, Joseph Rothman of CARDtools,
Thomas Schaer of Siemens, and Greg Yukna of Analog Devices.
|