|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
September 1, 1997 Fine-tune
embedded-system caches Ed Rocha, VLSI Technology Trace analysis helps you to optimize code and cache configurations. You can also use lists of memory addresses, or traces, to develop 3-D locality surfaces that provide further insight into a system's performance. Cache memory, a routine system component in general-purpose computers, is appropriate in embedded-control applications when the cost of SRAM exceeds a fixed percentage of the system hardware cost. Unlike engineers who design general-purpose computers, em-bedded-system developers don't have to design their caches for general cases. You can either tailor your cache's characteristics--size, update policy, associativity, block size, and write-buffer length--for your application's code or fine-tune the code to take advantage of the cache you've designed. Trace analysis can help to optimize code and cache configurations by giving you a handle on both global-system performance and cache behavior. Traces are lists of memory addresses that the CPU references as it executes blocks of code. If you are using a stand-alone processor in your embedded design, your development tools should be able to provide similar data for the examples that follow. These examples use a trace-driven cache-memory simulator for embedded ARM processors developed by the Department of Information Engineering at the University of Pisa, Italy. (This simulator is available for embedded ARM cores in VLSI Technology ASICs as ChARM, a cache simulator integrated into the JumpStart development environment.) The characteristics you need to know to tweak your cache or code comprise execution time, including waiting times; bus traffic for memory accesses, such as the number of read-block operations plus the number of write operations for a write-through cache or the number of update-block operations for a copy-back policy; and cache activity in miss ratios (cumulative cold miss, code miss, data miss, read miss, write miss, data-read miss, and data-write miss ratios).
Based on the data-bus width and a list of system timing characteristics, the simulator analyzes operations, including cache read, cache write, read-block (fetch a block from main memory in the event of a cache miss), and update-block (for caches with a copy-back update policy, to update a memory block if the copy in cache is dirty and needs replacing). Besides the processor's sequential and nonsequential read and write times, the simulator also accounts for additional delays that memory and I/O devices cause. You base the simulation on a trace for which you can specify the position of the first memory-address reference and the number of memory references you want the simulation to consider. You can vary one or two characteristics and, if appropriate, graph the simulation results with either variable as the abscissa. Thus, you can show the miss ratio or the execution time as a function of block or cache size with bus width as the parameter. In addition to 2-D graphing of traditional program statistics, this simulator produces results that you can use to produce 3-D locality-surface plots. Using these plots, you can infer important characteristics about spatial locality, which is the probability of referencing a nearby address-space location in the future, and temporal locality, which is the probability of referencing the same location in the future. In the examples that follow, the general system configuration comprises a 20-MHz ARM6 core, a 32-bit system bus, and 1 Mbyte of RAM. Table 1 summarizes the timing for all of the following simulations. Video compression The first example involves compressing a 2273149-pixel image from a 101-kbyte PPM file to a 5-kbyte JPEG file. The application program comprises 8452 lines of code that occupy 770 kbytes. (This program, called cjpeg, is available from the Independent JPEG Group at ftp.uu.net:/graphicals/jpeg/.) The trace follows the execution of 611,274 instructions. Of 1 million memory references, 76.7% refer to executable op codes and 23.3% refer to data. Of those referring to data, 7.4% are the result of a write to memory.
When you examine the simulation more closely, the results provide even more information. Suppose this is a simulation of a real-time embedded application, in which this code module has to execute in 150 msec. To meet this requirement, a cache is essential. A simulation of the trace with no cache shows that the code takes more than 500 msec to execute. Given that a cache is necessary, 14 cache configurations deliver execution times lower than 150 msec (Table 2). The 16-bit direct-access write-through configuration is simple and minimizes memory requirements. Further simulations also can help to determine whether a cache configuration allows you to use less expensive memory. For this example, adding delay time and rerunning the simulation reveals that an additional 10 msec still allows 150-msec execution time for the write-through configuration. However, any additional delay beyond 10 msec causes the system to exceed the time budget. Audio decompression A second example uses code for decompressing an audio signal from a 6-kbyte adaptive differential pulse-code modulation (ADPCM) format to 24 kbytes of 16-bit PCM format. The sound sample is a male voice saying "hello world." (J Jansen at the Center for Mathematics and Computer Science, the Netherlands, developed this example. Compression and decompression files--rawcaudio and rawdaudio--are available at ftp.cwi.nl:/pub/adpcm.shar.) The 152 lines of code occupy 24 kbytes. A trace of 471,422 instructions involves 782,240 memory references, of which code accounts for 84% and data accounts for 16%. Writes constitute 7.1% of data. For this example, the time budget for code execution is 90 msec. In this example, only eight cache configurations allow the system to meet the execution-time criterion (Table 3). Only a copy-back update policy is possible for this example's execution-time requirements. Although the traces in this example and the previous JPEG-compression example show similar write ratios, the miss ratio in this example is lower. This lower ratio increases the impact of write operations on global performance, which in turn means that update policy exerts a greater effect on execution time. Locality-surface plots
You can develop a locality surface from a trace by creating a spreadsheet table in which each cell represents the number of references for a combination of stride and distance. You transfer information from the trace to an Excel spreadsheet; surface plotting is a feature of Excel. Stride (s) quantifies the nearness of two references. If a trace shows that a reference to location fff6 follows a reference to location fff4, the stride is equal to 2. For a reference to location fff4 followed by a reference to fff0, the stride is equal to 4. Stride varies from +256 to 256 (Figure 3). Distance (d) is the number of intervening references between a pair of references to the same location. For example, if the CPU references location fff4 and then references it again after three other references, the distance between the two references to fff4 is 4. The vertical axis of the locality surface is the percent of memory references. The locality surface shows sequential accesses to the same address as ridges along the diagonal line where s=d. Loops appear as ridges parallel to the stride axis where d<s<0. The goal is to access cache as much as possible, rather than main memory. Conversely, a series of references with a fixed step produces "striding," which appears as ridges in the region where 0<s<+d. Striding is typical of matrix operations in which the CPU addresses elements in row rather than column order. Ridges parallel to the distance axis, along the line where stride equals 0 show the distribution of repeated accesses to the same address. Sequential accesses appear along the diagonal where s=d. The locality surfaces for these JPEG and audio examples generally show no problems with the analyzed cache sizes. However, the plots help to illustrate that the amplitude of the diagonal ridge defined by s=d indicates a decreasing frequency of consecutive instruction fetches as the segment gets longer. If the ridge along this diagonal had a peak associated with higher values of consecutive instruction fetches and the design uses only a small cache, you should rewrite the loop to make sure that all references were brought into the cache. A peak farther out on the stride axis indicates a long string of successive accesses. The temporality ridge in the JPEG example also shows accesses to the same location after 256 references on average. This spacing could affect performance, depending on associativity, if you picked a mapping that results in overwriting those locations before the CPU references those locations again. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| EDN Access | Feedback | Table of Contents | |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||