EDN Access

 

September 1, 1997


Fine-tune embedded-system caches
using trace analysis

Ed Rocha, VLSI Technology

Trace analysis helps you to optimize code and cache configurations. You can also use lists of memory addresses, or traces, to develop 3-D locality surfaces that provide further insight into a system's performance.

Cache memory, a routine system component in general-purpose computers, is appropriate in embedded-control applications when the cost of SRAM exceeds a fixed percentage of the system hardware cost. Unlike engineers who design general-purpose computers, em-bedded-system developers don't have to design their caches for general cases. You can either tailor your cache's characteristics--size, update policy, associativity, block size, and write-buffer length--for your application's code or fine-tune the code to take advantage of the cache you've designed.

Trace analysis can help to optimize code and cache configurations by giving you a handle on both global-system performance and cache behavior. Traces are lists of memory addresses that the CPU references as it executes blocks of code. If you are using a stand-alone processor in your embedded design, your development tools should be able to provide similar data for the examples that follow. These examples use a trace-driven cache-memory simulator for embedded ARM processors developed by the Department of Information Engineering at the University of Pisa, Italy. (This simulator is available for embedded ARM cores in VLSI Technology ASICs as ChARM, a cache simulator integrated into the JumpStart development environment.)

The characteristics you need to know to tweak your cache or code comprise execution time, including waiting times; bus traffic for memory accesses, such as the number of read-block operations plus the number of write operations for a write-through cache or the number of update-block operations for a copy-back policy; and cache activity in miss ratios (cumulative cold miss, code miss, data miss, read miss, write miss, data-read miss, and data-write miss ratios).

18ms1261The ARM cache simulator is based on a generic cache (Figure 1) and allows the following parameters:

  • Size: 1 kbyte to 1 Mbyte;

  • Mapping: direct-access, set-associative, full-associative;

  • Replacement algorithm: LRU, FIFO, random;

  • Update policy: write-through or copy-back;

  • Block size: 4 to 256 bytes;

  • Number of blocks per set: one to 64; and

  • Write buffer length: 0 to 8.

Based on the data-bus width and a list of system timing characteristics, the simulator analyzes operations, including cache read, cache write, read-block (fetch a block from main memory in the event of a cache miss), and update-block (for caches with a copy-back update policy, to update a memory block if the copy in cache is dirty and needs replacing). Besides the processor's sequential and nonsequential read and write times, the simulator also accounts for additional delays that memory and I/O devices cause.

You base the simulation on a trace for which you can specify the position of the first memory-address reference and the number of memory references you want the simulation to consider. You can vary one or two characteristics and, if appropriate, graph the simulation results with either variable as the abscissa. Thus, you can show the miss ratio or the execution time as a function of block or cache size with bus width as the parameter.

In addition to 2-D graphing of traditional program statistics, this simulator produces results that you can use to produce 3-D locality-surface plots. Using these plots, you can infer important characteristics about spatial locality, which is the probability of referencing a nearby address-space location in the future, and temporal locality, which is the probability of referencing the same location in the future.

In the examples that follow, the general system configuration comprises a 20-MHz ARM6 core, a 32-bit system bus, and 1 Mbyte of RAM. Table 1 summarizes the timing for all of the following simulations.

Video compression

The first example involves compressing a 2273149-pixel image from a 101-kbyte PPM file to a 5-kbyte JPEG file. The application program comprises 8452 lines of code that occupy 770 kbytes. (This program, called cjpeg, is available from the Independent JPEG Group at ftp.uu.net:/graphicals/jpeg/.) The trace follows the execution of 611,274 instructions. Of 1 million memory references, 76.7% refer to executable op codes and 23.3% refer to data. Of those referring to data, 7.4% are the result of a write to memory.

18MS1262Performance-analysis results of miss ratio and execution time vs block size and cache size for two cache configurations make several things obvious (Figure 2). For example, although the more complex cache speeds execution times, execution times as a function of block size generally converge toward a 32-bit cache size. You gain nothing from using a larger cache. (For these curves, block size varies from 8 to 64 bytes, and cache size varies from 4 to 64 kbytes. The two cache configurations are write-through, direct-access with no write buffer and copy-back, two-way set-associative with a two-deep write buffer.)

When you examine the simulation more closely, the results provide even more information. Suppose this is a simulation of a real-time embedded application, in which this code module has to execute in 150 msec. To meet this requirement, a cache is essential. A simulation of the trace with no cache shows that the code takes more than 500 msec to execute. Given that a cache is necessary, 14 cache configurations deliver execution times lower than 150 msec (Table 2). The 16-bit direct-access write-through configuration is simple and minimizes memory requirements.

Further simulations also can help to determine whether a cache configuration allows you to use less expensive memory. For this example, adding delay time and rerunning the simulation reveals that an additional 10 msec still allows 150-msec execution time for the write-through configuration. However, any additional delay beyond 10 msec causes the system to exceed the time budget.

Audio decompression

A second example uses code for decompressing an audio signal from a 6-kbyte adaptive differential pulse-code modulation (ADPCM) format to 24 kbytes of 16-bit PCM format. The sound sample is a male voice saying "hello world." (J Jansen at the Center for Mathematics and Computer Science, the Netherlands, developed this example. Compression and decompression files--rawcaudio and rawdaudio--are available at ftp.cwi.nl:/pub/adpcm.shar.) The 152 lines of code occupy 24 kbytes. A trace of 471,422 instructions involves 782,240 memory references, of which code accounts for 84% and data accounts for 16%. Writes constitute 7.1% of data. For this example, the time budget for code execution is 90 msec.

In this example, only eight cache configurations allow the system to meet the execution-time criterion (Table 3). Only a copy-back update policy is possible for this example's execution-time requirements. Although the traces in this example and the previous JPEG-compression example show similar write ratios, the miss ratio in this example is lower. This lower ratio increases the impact of write operations on global performance, which in turn means that update policy exerts a greater effect on execution time.

Locality-surface plots

18M1263BThe locality-surface tool provides an intuitive understanding of the program locality and how this locality affects cache and system performance. For the two examples, the locality surfaces (Figure 3) are instructive, because they confirm that the analyzed cache sizes are satisfactory. However, the surface plots provide no additional design insights.

You can develop a locality surface from a trace by creating a spreadsheet table in which each cell represents the number of references for a combination of stride and distance. You transfer information from the trace to an Excel spreadsheet; surface plotting is a feature of Excel.

Stride (s) quantifies the nearness of two references. If a trace shows that a reference to location fff6 follows a reference to location fff4, the stride is equal to 2. For a reference to location fff4 followed by a reference to fff0, the stride is equal to ­4. Stride varies from +256 to ­256 (Figure 3).

Distance (d) is the number of intervening references between a pair of references to the same location. For example, if the CPU references location fff4 and then references it again after three other references, the distance between the two references to fff4 is 4. The vertical axis of the locality surface is the percent of memory references.

The locality surface shows sequential accesses to the same address as ridges along the diagonal line where s=d. Loops appear as ridges parallel to the stride axis where ­d<s<0. The goal is to access cache as much as possible, rather than main memory. Conversely, a series of references with a fixed step produces "striding," which appears as ridges in the region where 0<s<+d. Striding is typical of matrix operations in which the CPU addresses elements in row rather than column order. Ridges parallel to the distance axis, along the line where stride equals 0 show the distribution of repeated accesses to the same address. Sequential accesses appear along the diagonal where s=d.

The locality surfaces for these JPEG and audio examples generally show no problems with the analyzed cache sizes. However, the plots help to illustrate that the amplitude of the diagonal ridge defined by s=d indicates a decreasing frequency of consecutive instruction fetches as the segment gets longer. If the ridge along this diagonal had a peak associated with higher values of consecutive instruction fetches and the design uses only a small cache, you should rewrite the loop to make sure that all references were brought into the cache. A peak farther out on the stride axis indicates a long string of successive accesses.

The temporality ridge in the JPEG example also shows accesses to the same location after 256 references on average. This spacing could affect performance, depending on associativity, if you picked a mapping that results in overwriting those locations before the CPU references those locations again.


Table 1--Simulation timing
System
component
Timing characteristic Symbol Value
(nsec)

Processor

M-clock high TCKH 25
M-clock low TCKL 25
Address setup TAC 14
Data-in setup TDIS 0

Bus

Read TREAD 200
Write TWRITE 200
Sequential read TSREAD 160
Sequential write TSWRITE 160

Memory

Additional delay required for a read TDELAY READ 200
Additional delay required for a read TDELAY WRITE 100
Additional delay required for a sequential read TDELAY SREAD 180
Additional delay required for a sequential write TDELAY SWRITE 20

Cache

Write buffer TWB 20
Block read TRD 35
Block write TWD 40
Tag access TTAG 20
Comparison TCMP 10
Start op code downstream after miss TORQ 10
Table 2--Cache configurations for less-than-150-msec execution
(JPEG-compression example)
Mapping Update
policy
Cache size
(kbytes)
Block size
(bytes)
Write-buffer
length
Execution time
(msec)

Direct access

Write-through 16 16 0 147.95
32 64 0 149.59
16 32 2 148.32
32 64 2 145.25
Copy-back 8 16 0 149.73
16 64 0 148.66
8 16 2 149.24
16 64 2 148.57
Two-way Write-through 16 32 0 146.08
8 8 2 147.87
16 64 2 145.91
Copy-back 8 32 0 146.05
4 8 2 148.34
8 32 2 142.21
Table 3--Cache configurations for less-than-90-msec execution
(audio-decompression example)
Mapping Update
policy
Cache
size
(kbytes)
Block
size
(bytes)
Write-buffer
length
Execution
time
(msec)
Maximum
additional
delay
(msec)
Direct access Copy-back 8 8 0 88.97 40
16 16 0 86.4 260
8 8 2 88.75 60
16 16 2 86.34 260
Two-way 4 16 0 88.96 40
8 64 0 88.62 50
4 16 2 88.86 50
8 64 2 88.61 60

18M126AU

Author's biography

Ed Rocha is a marketing engineer for VLSI Technology Inc (Tempe, AZ) where he has worked for two years. He holds a BSEE from California State Polytechnic University--Pomona.


| EDN Access | Feedback | Table of Contents |


Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc.