EDN Access PLEASE NOTE:
FIGURES WILL LINK
TO A PDF FILE.

January 15, 1998


Hands-On Evaluation
Virtual processors and the reality of software simulation

Markus Levy, Technical Editor 

DSP and µP simulators range from simple instruction-level analyzers to cycle-accurate simulators that allow you to model an entire system. This Hands-On Evaluation compares a sampling of the available simulators to help you weigh their features and benefits.

A few years ago, advanced processor-simulation technology was feasible only on expensive workstations, and, therefore, its use was limited to the processor vendors themselves. System developers had to use other tools, such as in-circuit emulators and hardware-development boards, to analyze their designs' performance. Or, if they were lucky, they could use a software debugger that would give them minimal perform-ance analysis, although it probably wouldn't provide insight into the processor's pipeline. The problem with using in-circuit emulators is that they require you to have already committed your untested design to silicon. The problem with hardware-development boards is that they have standard features that may differ widely from your design.

Fortunately, high-performance PCs and low-end workstations are bringing simulation technology to a practical level. System designers are beginning to depend on commercially available simulators to help them achieve quicker time to market. They relish the ability to simulate their designs well ahead of working silicon, giving them the confidence that their designs are functional before committing them to hardware. Furthermore, system designers developing products with custom processors use simulators to experiment with various features, including cache size, operating frequency, and on-chip peripherals. Some programmers are using simulators for almost all of their development work, especially as improved simulator accuracy narrows the gap between virtual and reality (see box "Do simulators meet the needs of performance benchmarks?").

The capabilities of processor simulators range from simple instruction-level analyzers to cycle-accurate simulators that allow you to model an entire system. (They all provide various degrees of debugging capability.) If you need only high-level functional verification for your development, then an instruction-level simulator is sufficient. But if you need precise timing information that will guarantee your system's behavior, a cycle-accurate simulator is essential. In this article, I take some commercially available, PC-compatible simulators for a test drive. The simulators I compare are from Analog Devices, CARDtools, Intel, Microchip Technology, Software Development Systems, and Tasking. It is not the goal of this comparison to generate competitive dissonance among the representative sampling of simulators. You can find the specs of my test system here.

Accuracy-vs-performance trade-offs

A cycle-accurate simulator models all aspects of the target processor, including the pipeline, the cache, the memory-management unit, and all phases of memory accesses. More complex simulators can also model system-level performance and allow you to "attach" external peripherals through an application-programming interface (API). As their name implies, cycle-accurate simulators evaluate the processor's state at every clock cycle. To prove the accuracy of their simulators, some vendors test them against the same test vectors they use to validate their processors.

The granularity of an instruction-accurate simulator is at the instruction level. Only in simple processor architectures, such as Microchip's PICmicro, does an instruction-accurate simulator come close to modeling the processor's architecture. Vendors build instruction-accurate simulators to help customers analyze software functionality, perform code tracing, and port and initialize operating systems. These simulators do not suit performance analysis or architecture benchmarks. However, as you might expect, simulator complexity and accuracy is inversely proportional to the rate of instruction execution. In other words, the more features and capabilities that a simulator has, the fewer instructions it can execute per second.

Before delving into the features and capabilities of simulators, you should know what a simulator comprises. From the highest level, a vendor writes its simulator code in VHDL or a high-level language, such as C or C++. Although, simulators written in VHDL are typically proprietary and not commercially available, many vendors derive their high-level-language-based simulators from their processors' VHDL gate-level models. Other vendors derive their simulators from a functional specification taken from the processor's data sheet. Regardless of the language a vendor uses, a simulator is basically a software program that contains a mixture of if-then-else statements and a variety of function calls. In the simplest sense, the simulator "fetches" each instruction, checks the current status of the CPU, and makes the appropriate function calls to update the state of the virtual CPU based on the execution of the instruction. Instruction-accurate simulators check and update only the fundamental CPU and system elements, such as registers, ALU, and memory. The more complex and accurate a simulator is, the more elements it checks and the more often it checks them.

Features and capabilities

Although instruction-accurate simulators can achieve more than 1.5 MIPS, cycle-accurate simulators, such as Analog Devices' Visual DSP for the company's ADSP-2106x DSPs (SHARC), typically perform fewer than 5000 instructions/sec. To be precise, running the Win95-compatible, Visual DSP simulator on a 300-MHz Intel Pentium II system, the simulator hits 4613 cycles/sec. (Visual DSP also runs on NT and Sun workstations with Solaris or Sun OS.) Considering the amount of work that the simulator performs, this number is relatively high. First, Visual DSP goes beyond being a cycle-accurate simulator; it actually analyzes the processor state on the rising and falling edges of every clock. The SHARC simulator simulates the 2106x core, including the pipeline and instruction cache, the memory subsystem and associated buses, interrupts, and the I/O processor and associated peripherals. The simulator accurately handles aborted pipeline stages, cache misses, and delay cycles associated with interrupts and bus contention and looping. Click here if you're interested in seeing the basic program flow of Visual DSP.

Just as with real hardware, you can set up external memory regions and define the number of wait states. The simulator's user-friendly graphical user interface (GUI) displays the fetch, decode, and execute pipe stages. If you are using the simulator to follow a program's flow and the program hits a "change of flow," the GUI displays the exact way that the CPU fills and flushes its pipeline. The GUI also allows you to display the instruction-cache contents, including all of the processor's internal cache state information.

Visual DSP allows you to automatically generate external interrupts, which the simulator examines on every clock edge. You can specify the interrupt's frequency and whether you want the simulator to randomly vary the timing of the interrupt. As with other simulators that handle external interrupts, the simulator may skew timing by one or more clocks. In other words, in a real hardware implementation, when the external interrupt hits, it takes a finite amount of time before the interrupt goes through the interrupt controller and into the CPU. With simulators, no hardware delay exists, so the simulator may process the external interrupt one or more clocks early.

Analog Devices' simulator also supports the DSP's DMA features and accurately reads or writes data at the correct time. The simulator recognizes when a DMA causes a cycle steal or stall. The simulator simulates the DSP's serial ports and allows you to "connect" the data lines to a file to allow the simulator to process data as if the DSP were connected to another device. Through one of the simulator's APIs, you can map an I/O location to a data file to simulate reading and writing to another I/O-mapped device in your system.

To add flexibility to its simulator, Analog Devices developed the simulation engine as a dynamic-link library, isolating it from the GUI via a series of public APIs. You can use these APIs to connect the simulator with other software models and have them exchange and synchronize signals. For example, you can connect a serial audio codec model to the "DSP's" serial port and have the codec respond as if it were in the system. This process of merging simulations can be slow, so Analog Devices is modifying the simulation engine to allow users to write modules that plug directly into the engine.

Visual DSP provides an intuitive GUI but requires you to have some experience with the SHARC processor to understand what's happening in the pipeline. With a data book in hand, you can easily program the DSP's on-chip peripherals through the simulator's interface. Although it may not be an issue for you, the simulator presents timing information only in terms of cycle counts. Furthermore, the tool lacks a mechanism to control the processor's operating frequency, which is not a problem anyway because the simulator provides timing information in terms of cycle counts. However, operating frequency be-comes more important when you use external memory with one or more wait states.

Simulating an entire system

CARDtools Systems (computer-aided real-time design tools) offers the codesign and cosimulation NitroVP tool. You can use NitroVP to prototype and model embedded CPU applications, such as the Hitachi (Brisbane, CA) SH. This modeling process covers the processor and its on-chip peripherals, system-level peripherals and other hardware devices, the OS, and application software. The simulator supports multiprocessor designs. You can run NitroVP on a PC with Win95 or NT or on a Sun workstation.

NitroVP lets you model an entire system--at any level of abstraction, including the SH cycle-count simulator. However, because CARDtools' simulator is event-driven, it is faster than a cycle-count version. The level of abstraction that you use depends on the time you want to spend developing the models, as well as the time it takes to run the simulation. CARDtools provides mixed-mode simulation, allowing you to simulate different sections of the model in different levels of detail, independently of the other sections. Selecting the appropriate level of detail in a model is a balancing act between accuracy and performance. The greater the degree of accuracy, the slower the simulation. For example, the Hitachi cycle-accurate simulator, including memory, cache, and pipelining effects, runs at 50,000 cycles/sec on a 200-MHz Sun Ultra II workstation.

In addition to cycle-accurate simulation, NitroVP allows you to control memory wait states, a critical factor for determining a system's performance. The CARDtools simulator is the only commercially available simulator I know of that can model the processor's and system peripherals' power-consumption. What's more, the simulator generates the power-consumption numbers on a cycle-by-cycle basis by determining the active functional units or peripherals within the processor. NitroVP can also model power consumption under various operating modes, including active, standby, and sleep.

NitroVP allows you to model hardware devices using probabilistic inputs, such as a periodic interrupt or reading a test file as input data. You can also use a C program. For more accuracy, you can make direct calls to a Verilog/VHDL simulator or--at a slightly higher level than VHDL--you can use CARDtools' Device Behavioral Language (DBL). Although using DBL requires you to learn to program in this proprietary language, it is based on general C constructs. Likewise, you can use C code to model software tasks. Alternatively, CARDtools has developed a Task Behavioral Language (TBL) for higher level task representations.

CARDtools' DBL allows you to specify the logic and timing of a device, as well as to highlight the various device states and delays to process data. Just as with a real device, you can re-enter, reset, or lock your DBL device. The language also supports priority usage and pending critical resources, such as a system bus. DBL supports a logical representation of hardware registers, allowing the virtual system to share register information with software; your software or software model can directly read, set, or modify these hardware registers.

One of the most useful capabilities of the SH simulator is its ability to account for hardware behavior and delays, OS overhead, and memory usage. The GUI allows you to review timing issues, such as missed deadlines, starvation, and poor scheduling. A set-top-box simulation demonstrates these capabilities of the simulator (Figure 1). This simulation ran on a 45-MHz SH-3 processor. In the figure, the top section of the GUI shows the percentage of CPU usage; the most important result is that the CPU was idle 30% of the time. From this result, a system designer could then decide whether to operate the CPU at a lower frequency or leave some head room for additional system functions, such as the ability to handle high-definition TV. The lower part of the GUI shows resource usage as a function of time. The figure also shows tasks fighting for CPU resources, when a task or an interrupt-service routine (ISR) is active, and which task or ISR beat out its predecessor. You can also superimpose critical-event flags. In other examples, you can see devices interacting with the processor, as well as the software tasks.

The CARDtools simulator also analyzes system-memory requirements. If you use a 27-MHz SH-2 processor in the same set-top-box example, the audio and ring buffers come close to their limits (Figure 2). Additionally, only 2% of the CPU's resources remain for overhead. Click here to see how switching to the SH-3 processor affects buffer usage.

Among the simulators I reviewed, the CARDtools simulator provides the most comprehensive simulation capabilities. However, the company needs to add better online help for the DBL and TBL to make these components of the tool more user-friendly.

Motorola's new MCore

As with CARDtools' simulator, Software Development Systems' (SDS) SingleStep simulators let you integrate the hardware and software codesign and cosimulation environments of EagleI from Viewlogic (Marlborough, MA, www.viewlogic.com) and V-CPU from Simulation Technologies (New Brighton, MN, www.winternet.com/~simtech). This integration allows you to verify the hardware and software integration at the system level.

Although SingleStep supports Mo-torola's PowerPC, 68K, and MCore, only the MCore and some 68K processors are cycle-accurate. The MCore simulator models the entire instruction pipeline, including the prefetch, decode, and execution units. The MCore simulator also accurately determines the clock cycles for each memory access. This simulator generates more accurate simulations when multiple devices, such as the CPU and a DRAM controller, are contending for the memory bus.

SingleStep's GUI displays runtime information, including each instruction as it executes, all logical memory accesses, address translations, physical-bus cycles, and even instruction timing. Although not applicable to MCore because MCore has no cache, SingleStep also displays cache hits and misses. The simulator provides direct control of the simulated caches, including the ability to selectively enable and disable caches, load and lock memory ranges into cache, flush all dirty cache lines, and display cache statistics. Direct cache control allows you to determine the effects of locking various code or data segments into cache before implementing cache-management routines at the software level.

SDS also offers a peripheral-adaptation kit with an API that developers can use to create and model virtual peripherals to hook into the simulator. Using C functions or C++ classes, you can model I/O devices of any level of abstraction, ranging from a high-level functional description to a detailed register-level interface. SDS chose the C/C++ strategy to maximize flexibility and performance and to avoid exposing software developers to the complexities of VHDL and Verilog models. The peripheral API also includes provisions to link simulated peripherals to real hardware through a memory-mapping scheme. SDS is developing a modeling-extension kit for its simulation environment that will allow the simulator to model internal and external processor buses, bus arbitration and timing for multiple processors, and DMA units and other coprocessors. The kit will also provide a higher level of integration with commercial hardware/software coverification tools.

SingleStep has a well-organized GUI with pulldown windows that help make the simulator easier to use. However, if you want to unleash all the simulator's capabilities, you need to use SingleStep's command window, which requires Unix-like commands. High performance is one of SingleStep's strong points. On a 300-MHz Pentium II system, the simulator demonstrated 1.65 MIPS; however, I was using a simple piece of code, and no virtual peripherals were attached. Click here to download some screen shots of my session with SingleStep.

When accuracy is enough

For instruction-accurate simulators, I tried Microchip's MPLAB-SIM and Tasking's Power Package. MPLAB-SIM is a C++ model that provides instruction-accurate simulation for the PIC architecture. In addition to simulating the PIC's core functions, MPLAB-SIM supports most of Microchip's on-chip peripherals. However, be careful in your performance analysis because the on-chip peripherals and interrupts function only on instruction-cycle boundaries, in contrast to real hardware in which the peripherals and interrupts operate asynchronously to the core. The simulator can detect events after the first cycle of instructions that have two or more cycles.

At any instruction-cycle boundary, you can provide synchronous or asynchronous stimuli to any of the pins. In MPLAB-SIM's asynchronous dialogue box, you can assign input functions on 12 pins (Figure 3). The simulator provides a file-injection capability that allows you to simulate random events, such as A/D conversions. MPLAB-SIM does not support memory-wait-state configurations, but, because only the PIC 17cxx devices have external memory, the lack of wait-state configuration is not an issue (especially because the PIC 17cxx devices run at a maximum of 20 MHz, making it fairly easy to assume zero-wait-state memory). Another limitation of the simulator, common to other instruction-level simulators, is that peripherals may have limited accuracy. For example, the PWM, which can generate 10-bit accuracy pulses on real hardware, has only 8-bit accuracy on the simulator because a finer resolution is meaningless if the simulator can change pin states only on instruction-cycle boundaries. Basically, MPLAB-SIM is a useful simulator for analyzing the functionality of your program, and you can't beat the price: It's free.  Click here to download Microchip's MPLAB-SIM session.

Tasking's Power Package supports Siemens' (Cupertino, CA) new Tricore architecture and, like most of the other companies' tools, is a complete development environment with an instruction-accurate simulator. Like MPLAB-SIM, this simulator is useful for analyzing the basic functionality of your program. But, unlike the PIC architecture, the complexity of the Tricore architecture makes it difficult for an instruction-accurate simulator to accurately model pipeline effects, code branches, and interrupts. Although the simulator supports variations in the Tricore's memory-wait-state configuration, it doesn't allow you to control the external-memory width; the architecture specification stipulates the datapaths and instruction paths. To study the effects of memory wait states, I ran the Dhrystone MIPS benchmark on the Tricore simulator. Even with the cache enabled, the simulator demonstrated a huge performance impact of changing from one to three wait states; the MIPS rating went from 137 to 109. I couldn't determine or change the processor's operating frequency to affect the results of this benchmark. Siemens is also working on a cycle-accurate simulator, which it claims will be available this year. The new simulator will implement a flexible cache model that gives you options such as defining start and end addresses, the number of ways and lines, and the line size and banks. This simulator also includes branch-prediction logic and interrupt-latency determination. Click here to download Siemens' Tricore simulator session.

Simulating a simulator

Intel's Visual Tuning Environment, VTune, differs from the other simulators because it runs on the processor that it is simulating--well, sort of. Actually, VTune runs with Windows 95 or NT systems, and it can simulate Intel processors, including 486, Pentium, Pentium with multimedia-extension instructions, Pentium II, and PentiumPro.

VTune operates at sampling, simulation, and Java program-call graph-generation levels. In the sampling level, VTune interrupts the processor and saves the current execution address in a buffer. The purpose of this part of VTune is to provide a high-level profile of the CPU usage of every function, including the operating-system modules. Before VTune begins its sampling operation, it makes a call into the OS to build a table of every function and their memory ranges. When VTune is finished sampling, it matches the sampled addresses with the table's information and displays the information in a bar graph. You can control VTune's sampling rate, but the minimum granularity is 1 msec--too coarse for any embedded application but more than enough in most PC applications. Besides, sampling too frequently intrudes on the application.

If you double-click on any bar in the profile bar graph, VTune shows a detailed profile of "hot spots" in the function you selected. Double-clicking on a hot spot takes you into the source or disassembled code associated with that hot spot. You can now run a simulation, or dynamic analysis, on that hot spot. The simulator accurately models the entire CPU, including its pipelines, the instruction and data caches, and the branch-predictor circuitry.

One of VTune's best features is its simulation-analysis capability. After it runs the simulation, VTune displays the number of clocks each instruction took to execute (Figure 4). In its analysis, VTune also shows you all penalties, warnings, and pipeline-pairing issues associated with each instruction. This analysis allows you to tune your code to keep pipelines full, minimize cache misses, and minimize other types of operation stalls. Click here to download Intel's VTUNE session.

VTune has excellent online help almost anywhere on VTune's displays. Online help not only shows you how to use VTune, but also provides detailed explanations for each penalty, warning, and pairing issue that you click on. One weakness of the simulator is that it doesn't model system behavior. For example, you can't configure the memory model, which prevents you from analyzing the performance differences associated with DRAM types. VTune sells for $279 (retail), but it's generally free to software developers.

Processor simulators are an evolving technology. But every processor vendor is heading toward the same goal: simulators that behave like real devices. And in the latest trend, vendors are offering or developing simulators to represent the entire system. It is reasonable to assume that simulators could replace in-circuit emulators. It also seems reasonable that simulators should replace hardware-development and -evaluation boards, especially when you consider that simulators let you easily change parameters, such as memory wait states or processor frequency.

However, simulators just can't handle some system events, such as the precise timing of external signals. Furthermore, a simulator will never be able to guarantee the outcome of race events, as when two or more signals simultaneously come into the processor. Or consider how a simulator handles a situation in which the CPU and the DRAM controller simultaneously request the bus. Also, how does a simulator handle the effects of DRAM refresh cycles on execution speeds? Signal timing between asynchronous events, such as the assertion of an interrupt-request signal, may also vary, depending on when the CPU samples the event. Even a difference of a single clock cycle when the processor recognizes an interrupt request may impact the timing if, during that one clock cycle, the processor masked out the interrupt. So, as good as the simulation technology gets, remember, there ain't nothing like the real thing, baby!


Acknowledgments

Thanks to my simulator driving instructors: Gary Carlton of Intel, Darrell Johansen of Microchip, Mark Pauna of Software Development Systems, Joseph Rothman of CARDtools, Thomas Schaer of Siemens, and Greg Yukna of Analog Devices.


  • Most DSP and µP simulators fit into one of two categories: instruction-accurate and cycle-accurate.

  • Some µP simulators can model an entire system. CARDtools' NitroVP is a primary example of this type of simulator.

  • Many cycle-accurate simulators are good enough to produce reliable benchmark results.

  • Today's simulators are fast. You no longer have to start your simulations before a long weekend.

Do simulators meet the needs of performance benchmarks?

Processor vendors have long used their proprietary cycle-accurate simulators to measure the presilicon and postsilicon performance of their architectures and devices. These simulators allow the vendors to run benchmarks, such as Dhrystone MIPS, as well as an unlimited number of proprietary benchmarks. Because the vendors use or have accurately modeled these simulators after the vendors' Verilog or VHDL models, the companies can confidently display their benchmark results. Furthermore, as more µP vendors produce custom devices, benchmarking on simulators is becoming the only choice.

One problem is that a µP simulator is only as accurate as the software program that runs it; a simulator does not deal with precise hardware-timing issues. Another problem with running benchmarks on simulators is the benchmark itself, especially in the market for embedded systems. None of the benchmarks model real-world applications. A third problem with proprietary simulators is that processor customers cannot compare apples with apples because they don't know the system-level conditions a vendor used to produce results.

To address some of these problems, I organized the EDN Embedded Microprocessor Benchmark Consortium (EEMBC). EEMBC's primary goal is to develop real-world benchmarks with precise rules for reporting results. These benchmarks, currently under development, comprise suites of tests. From a high-level perspective, these benchmark suites encompass applications in the automotive/industrial, consumer, networking, office automation, and telecommunication industries. Within each suite, individual tests measure one or more processor functions, allowing you to determine which functions are appropriate for your application. For each test, vendors must report runtime characteristics that include compiler versions and switches, processor-clock and bus speed, wait states, and cache size. Furthermore, the vendors must clearly document any code changes that they implemented to improve the benchmark performance; this documentation ensures that the exact test is repeatable and unbiased.

Manufacturers of simulators

When you contact any of the following manufacturers directly, please let them know you read about their products on EDN's website.
Analog Devices
Norwood, MA
1-617-329-4700
www.analog.com
CARDtools Systems
San Jose, CA
1-408-894-9500
www.cardtools.com
Intel Literature Center
Mount Prospect, IL
1-800-548-4725
www.intel.com
Lucent Technologies
Allentown, PA
1-800-372-2447
www.lucent.com/micro
Microchip Technology Inc
Chandler, AZ
1-602-786-7668
www.microchip.com
Software Development Systems
Oak Brook, IL
1-630-368-0400
www.sdsi.com
Tasking
Dedham, MA
1-617-320-9400
www.tasking.com
   

Markus Levy, Technical Editor 

You can reach Technical Editor Markus Levy at 1-916-939-1642, markus.levy@worldnet.att.net.


| EDN Access | Feedback | Table of Contents |


Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc.