Memory Hierarchy Design - Part 2. Ten advanced optimizations of cache performance
2.2 Ten Advanced Optimizations of Cache Performance
The average memory access time formula above gives us three metrics for cache optimizations: hit time, miss rate, and miss penalty. Given the recent trends, we add cache bandwidth and power consumption to this list. We can classify the ten advanced cache optimizations we examine into five categories based on these metrics:
- Reducing the hit time - Small and simple first-level caches and way-prediction. Both techniques also generally decrease power consumption.
- Increasing cache bandwidth - Pipelined caches, multibanked caches, and nonblocking caches. These techniques have varying impacts on power consumption.
- Reducing the miss penalty - Critical word first and merging write buffers. These optimizations have little impact on power.
- Reducing the miss rate - Compiler optimizations. Obviously any improvement at compile time improves power consumption.
- Reducing the miss penalty or miss rate via parallelism - Hardware prefetching and compiler prefetching. These optimizations generally increase power consumption, primarily due to prefetched data that are unused.
In general, the hardware complexity increases as we go through these optimizations. In addition, several of the optimizations require sophisticated compiler technology. We will conclude with a summary of the implementation complexity and the performance benefits of the ten techniques presented in Figure 2.11 on page 96. Since some of these are straightforward, we cover them briefly; others require more description.
First Optimization: Small and Simple First-Level Caches to Reduce Hit Time and Power
The pressure of both a fast clock cycle and power limitations encourages limited size for first-level caches. Similarly, use of lower levels of associativity can reduce both hit time and power, although such trade-offs are more complex than those involving size.
The critical timing path in a cache hit is the three-step process of addressing the tag memory using the index portion of the address, comparing the read tag value to the address, and setting the multiplexor to choose the correct data item if the cache is set associative. Direct-mapped caches can overlap the tag check with the transmission of the data, effectively reducing hit time. Furthermore, lower levels of associativity will usually reduce power because fewer cache lines must be accessed.
Although the total amount of on-chip cache has increased dramatically with new generations of microprocessors, due to the clock rate impact arising from a larger L1 cache, the size of the L1 caches has recently increased either slightly or not at all. In many recent processors, designers have opted for more associativity rather than larger caches. An additional consideration in choosing the associativity is the possibility of eliminating address aliases; we discuss this shortly.