Subscribe to EDN
RSS
Reprints/License
Print
Email

Exploring memory architectures: pillars of processing performance

Processor-based systems rely on multiple, heterogeneous memory subsystems to deliver better system performance, power, and cost efficiencies.

By Robert Cravotta, Technical Editor -- EDN, June 21, 2007

AT A GLANCE
Memory subsystems and software can best impact a processor by merely preserving its theoretical maximum performance.The processor's architecture is a first-order driver of the options available to design the memory subsystem.A processor's tolerance of latency is a balance of the speed, cost, and power consumption of implementing a hierarchy of fast and expensive or slower and cheaper memories.Preserving the mechanisms that provide tolerance of memory-access latency is still a mostly manual exercise for developers.
Sidebars:
Ease of use
Multiple options

Related Articles:
For an article by Senior Technical Editor Brian Dipert about memory see Banish bad memories.

For a related article about allocating memory and dealing with fragmentation visit Handling memory fragmentation.

For an article on handset-memory considerations see Microelectronics applications require the right stacked-memory-packaging architecture

A key performance characteristic of processor architectures is how much application-specific work they can perform per unit of time. The EEMBC (Embedded Microprocessor Benchmark Consortium) benchmark, unlike Dhrystone MIPS (millions-of-instructions-per-second) scores, describes the performance of processors executing tasks in embedded-system applications. Version 1.0 of the EEMBC benchmarks does not capture the system-level influences, such as the memory subsystems, of processing performance because the benchmarks can often run from within the processor’s L1 cache. However, EEMBC’s second-generation, system-level benchmarks for networking and digital entertainment more realistically stress even those processors with large cache memories.

It is increasingly important to consider the system-level impact of the memory subsystems of a processor because the types and sizes of the memories and access methods in the system define the upper limit of a processor core’s performance. According to Gerard Williams III, a fellow at the processor division of ARM, a processor with an ideal memory system never misses in the cache and has ideal access to the bus. Chip designers must first understand the processor’s IPC (instructions-per-cycle) capability and then try to implement a memory architecture that minimizes the performance loss. This performance loss can result from caching or memory-access effects, such as miss rate due to capacity misses, cache size, or conflict misses.

A well-matched memory subsystem can in the best case merely preserve the processor’s maximum IPC rate, whereas a mismatched memory subsystem can drastically reduce the processor’s performance by starving and idling the core’s execution units. Building and implementing memory subsystems that do not adversely affect the processor core’s performance continue to be challenging because the performance gap between processor logic and the memories is widening with each process-technology reduction. In essence, the improvement in memory-access latency, the time it takes to receive the first bit of a memory request at each process-technology step, gets smaller than the commensurate clock-rate improvement of the processor’s core logic.

Likewise, the best performance impact software developers can accomplish with insightful placement of program instructions and data within the memory subsystem is to preserve the processor’s maximum IPC rate. However, mismatching the placement of program instructions and data in the memory subsystem with the application’s usage scenario can significantly degrade the processor’s performance. Freescale’s application note on preventing M1 memory contention provides an example that exhibits a worst-case 54% processor-performance degradation due to memory contention that the developer can avoid with better placement of the data buffers (Reference 1).

In general, compilers and profiling tools can provide limited assistance with global optimizations for placing instruction and data in memory. Green Hills’ optimizing compilers support the reordering of functions in memory to optimize cache hits. Texas Instruments’ CodeSizeTune profile-based-compilation tool assists a developer’s ability to explore configurations by automating the building and profiling of different versions of the software with different compiler settings that affect code size and execution speed (Figure 1). In general, though, for many high-efficiency and real-time-constrained systems, the burden falls on the software developer to understand the memory subsystem to avoid incurring unnecessary BOM (bill-of-materials) costs because of the system’s inefficient use of processing and memory resources.

Tolerance of latency

A primary concern when implementing memory architectures is making the processor tolerant of the access latencies of the memories the system uses. A properly designed memory subsystem can mask much of the system’s memory-access latency and provide a sufficient read/write throughput rate—that is, the memory-access time for subsequent data in the same block of data—to support continuous access. This scenario avoids starving the processor’s execution units of instructions and data. Memory designers must also balance masking the memory’s access latency against the silicon area of the memory, the total power the memory consumes, and the ease of use of the memory by software developers and tools (see sidebar “Ease of use”).

Direct drivers of memory-access latency are the time it takes to perform address decoding, activating the appropriate word line, sensing the bit line, and driving the output from the sense amps. The address-decoding latency is the time it takes to latch the address and decide which word line requires activation; this process takes O(n log n) time as a function of the size of the memory’s row and column addresses. So, as the memory structure gets bigger, so too does the time to decode the addressing. The word-line-activation latency is the time it takes to raise the word line; it is primarily an RC delay related to the length of the line, with longer lines driving longer delays. The bit-line-sensing latency is the time it takes for the sense amplifiers to detect the cell contents. The bit-line architecture, the RC of the sense-drive line, the cell-to-bit capacitance ratio, and the sense-amplifier topology all affect bit-line-sensing latency. The output-driving latency, an RC delay, drives the time it takes to propagate the data from the sense amps to the output.

Memories and the logic to manage them dominate the silicon area of many processor-based devices. As a result, memories can be the largest components of the device’s silicon cost and the largest consumer of both dynamic and static power in the system. The many types of volatile and nonvolatile memory available involve numerous trade-offs, and the system designer must balance and manage the key parameters to deliver good enough memory performance at lower cost and power consumption.

To balance masking memory-access latency, silicon cost, and power consumption, processor-based devices usually rely on a hierarchical memory structure that places smaller amounts of faster yet more expensive memories closer to the processor core and larger amounts of slower and less expensive memories farther from the processor core (Figure 2). After the processor registers, which are the fastest and scarcest memory resources in the system, memory hierarchies may use local memories or TCMs (tightly coupled memories), multiple layers of caches, and volatile and nonvolatile on- and off-chip memories.

Modern optimizing compilers are competent at managing the use of the processor registers, but they are weaker at managing and optimizing the other memories. This situation is due partly to the fact that optimizing the use of the registers works well as a tactical exercise with a local view of the program code. To optimize the use of the other memory structures, such as the TCMs, in a processor-based system requires a more global view of the system, and this capability is still emerging in most compilers.

Local memories or TCMs connect to the processor core through local- or dedicated-memory buses for access performance similar to that of cache memories. Memory-access determinism is a key difference between TCMs and caches. Cache-line locking manually and temporarily enables a cache at the line level to act as a TCM. Program-instruction and code access through a TCM is deterministic, but, with a cache, the designer must consider the worst-case scenario of cache misses. “A typical rule of thumb for the penalty of a cache miss is an order of magnitude longer access latency than the previous level,” says David Fisch, director of architecture at Innovative Silicon. “An L2 memory access has 10 times longer latency than an L1 cache access, and it also has a 10 times shorter latency than an L3 memory access.” However, using TCMs puts the onus on the software developer to manually manage that memory space, usually with a DMA controller, so that the necessary code and data are in the TCM when the processor needs them.

Cache comprises less of a faster memory to mask the latency of a larger amount of slower memory. Slower memory is denser and, hence, cheaper. Caches rely on the premise of temporal and spatial locality to mask the memory-access latency of the slower memory. “Temporal locality” describes the premise that, if the processor requests some data, then the processor will soon need that same data again. By keeping a copy of the data in its storage, the cache can avoid going to the slower memory. “Spatial locality” describes the premise that, if the processor requests code at a memory location, then it is highly likely that the next processor request will be the code at the next memory location or close to that location. By prefetching some amount of the data near the currently requested data at the same time as the original fetch, the cache can have the next few data locations in its store without incurring the latency of another fetch from the slower memory.

Larger caches usually mean fewer cache misses at the cost of more silicon area. Increasing the cache-set associativity, which refers to the number of locations where a given memory can reside in the cache, almost always reduces cache misses. The cache line’s length can vary positively or negatively based on an application’s behavior. According to Bill Huffman, chief architect at Tensilica, “Configuring caches is an iterative task that is highly dependent on the application set that will execute on the processor.”

Balancing the various cache parameters can be a complex process that involves trade-offs between silicon area and miss rates (Figure 3). In the figure, the explored cache configurations span from a 4-kbyte, direct-mapped, 16-byte-line configuration with a load-miss rate of 13.4% to a 32-kbyte, four-way-set-associative, 64-byte-line configuration with a load-miss rate of 1.9% for a JPEG-encoding application (Reference 2). Even though the larger cache is better, there is a diminishing return of benefits for the 32-kbyte cache. There is a larger performance benefit from increasing the cache-line size than from doubling the size of the cache; the longer cache lines reduce silicon cost. Also, although higher cache-set associativity is better, in this example, going from two-way to four-way-set associativity yields fewer benefits. In short, no clear rule of thumb exists for configuring caches.

Decision drivers

The processor-core architecture is the first-order driver of the memory-architecture options that a designer has. The reason is that the designer builds the core with assumptions about how the memory components interface with and complement the core. Von Neumann and Harvard architectures are two common processor architectures that model and implement different ways to view and access memory. Processors based on a von Neumann architecture model the system memory as a single storage structure that hosts both the program instructions and data; a single bus interface services all program and data accesses. Processors based on a Harvard architecture model the system memory for program instructions and data as physically and logically separate storage structures with separate bus interfaces—one for instructions and the other for data. The Harvard architecture supports simultaneous access for program instructions and data, whereas the von Neumann architecture does not.

To choose an optimized memory design, a designer must also understand the application’s behavior and requirements. Considerations for the memory design are: How will application data enter and exit the system, and will the processor directly load the data or will an external agent, such as a DMA controller, load the data into the processor’s local RAM? You must ask similar questions about outputs: Will the processor directly drive the output ports, or will an external agent, such as a DMA controller, transfer the data from the processor’s local RAM to an I/O interface. Other questions include: What is the application’s start-up scenario, can the system make efficient use of special memory interfaces, and can the on-chip-memory resources accommodate all or even just the performance-sensitive code and data of the application?

The application start-up requirements affect where you can store the initialization code and through what interfaces the system can retrieve it. On-chip OTP (one-time-programmable) ROM is useful for storing boot code because it is small with high silicon density. It supports fast start-up because it needs no wait time after start-up to begin executing. The initialization code could reside in and execute in place from flash memory; it could also reside in off-chip memory and be shadowed into on-chip instruction RAM, which can result in longer system start-up. If the application code and data can reside in on-chip memories, it might be unnecessary to support off-chip-memory interfaces. If the performance-sensitive program code can fit into local memories, the designer may not need to implement caches.

Read more In-Depth Technical Features

Designers can tailor the processor to the known constraints of the application they are targeting to include only the amount of random and nonvolatile memory the application requires. The sizing and parameters of TCMs, caches, or special memories target the application. Processors targeting a wider set of applications typically implement a generalized memory architecture that includes the maximum resource requirements of the set of applications with variants of the device offering fewer resources to meet lower costs. For systems with similar processor-core architectures, the memory subsystem becomes a higher order driver for differentiating the system’s deliverable processing performance, power consumption, and price (see sidebar “Multiple options”).

Memory controllers abstract the implementation of the memory block they service so that it appears as a data pipe to the processor system. They contain the logic necessary to read the memory block and, as appropriate to the type of memory they service, write, refresh, test, and correct errors. For on-chip memories, the memory controller can manifest the company’s proprietary innovations, which differentiate its processor device from a similar device from its competitor. As a result, most processor vendors are unwilling to discuss the specifics of their memory controllers. They hint at techniques for use in memory controllers, including using wide data buses, multiplexed or staggered access of banks, buffering, pipelining, and transaction reordering, as well as specialized and speculative access patterns.

In addition to the characteristics of the implemented memory, system-level factors that affect the design and efficiency of memory controllers include how physical addresses map to the internal representation of the memory system; the type of addressing patterns, such as burst, random, and concurrent-access patterns; the mix of reads and writes; and how unused memory enters low-power modes. Its primary usage model normally dictates the architecture of a memory controller, such that a graphics or a multimedia memory controller might optimize for sequential accesses whereas a memory controller for an embedded-communication-system application might optimize random accesses over a large memory span. For those embedded memories with system-level reliability requirements, the memory controller, for additional complexity, can provide ECC (error-correction-code) protection.

The traffic pattern at the memory controller differs significantly between single- and multiple-processor-core systems. A memory controller for a single-core system may use a stream, but the memory controller for a shared memory in a multicore system might need the ability to handle multiple streams and random traffic. For multicore designs, the memory architecture must enable fast and efficient message passing and data sharing between processors. Although different approaches exist for accomplishing these goals, no single configuration is efficient for all types of communications. Fast, point-to-point channels and queues are essential to exchange short and critical messages, whereas shared memory is better for sharing larger data structures. When using shared memories, users need programming support for synchronization and memory management.

As more embedded systems incorporate multiple cores, especially heterogeneous cores, as part of the design, development tools will most likely evolve to better assist developers with the spatial and temporal placement of code and data to sustain better latency tolerance and squeeze out better performance in increasingly complex designs. The development tools must assist developers in better understanding the global behavior of the system and matching that behavior with memory subsystems available in the system. Otherwise, memory and chip designers must continue to incorporate ever-more-complex control algorithms in their memory controllers to invisibly compensate for software signers’ and development tools’ lack of visibility into the behavior of the memory system.






References
  1. Schuchmann, David, “Tuning an Application to Prevent M1 Memory Contention,” Application Note AN3076, Freescale Semiconductor Inc, May 2006.

  2. “How to Optimize SOC Performance Through Memory Tuning,” White Paper, Tensilica.

Ease of use

Ease of programming is a feature that is important to software developers. A flat address space that hides the memory hierarchy makes it easier for the developers to program. Brian Boles, digital-signal-controller-division technical fellow at Microchip Technology, shares that, "Generally, it is easier for a compiler to target an application to a generalized memory structure." It is more difficult for compilers to optimally allocate the code and data to application-specific memory structures without visibility to the global and dynamic characteristics of the application code.

For sophisticated applications that require operating systems, such as Linux, the memory architecture may need to support virtual addressing. However, a consideration for developers using heavy operating systems to meet time-to-market schedule pressures is the potential loss of insight into how to partition the software to take advantage of on-chip resources to save power and cost. Part of the conflict is balancing and determining how much of the on-chip memory the operating system requires to operate mainly out of on-chip memory and how much of the memory this approach leaves for the application. "To date, general-purpose operating systems do not have hooks to specify the complete physical-to-memory-system mapping so as to facilitate the most optimum use of the underlying memory system," says Phil Ames, segment-marketing manager for the Embedded and Communications Group at Intel. "However, it is common in embedded designs to hand-tune the software to make best use of the memory system."

Managing each different class of memory may require specialized software. For example, small-block NAND flash (528 bytes per page) usually requires different flash-management software from large-block NAND flash (2112 bytes per page). One approach to manage this situation is to modularize the software into layers so that the software developer has to rewrite as little as possible when changes are necessary. According to Doug Wong, member of the technical staff at Toshiba's memory-products group, "NAND flash appears to be the first commodity memory to add significant intelligence to the memory device itself in order to make it easier to use." Toshiba's LBA-NAND and eMMC-compliant embedded NAND both contain built-in controllers that perform NAND-management functions, such as block management, wear-leveling, logical-to-physical-block translation, and automatic error correction. This approach significantly reduces the burden on the system architect or software engineer in managing the NAND-flash device for an FFS (flash file system) or for FTL (flash translation layer).


Multiple options

The following example, using the ARM7-based NXP LPC2129 processor core, illustrates some of the possible first-order-decision influences of the processor-core architecture on the memory architecture (Figure A). The ARM7 is a three-stage-pipeline von Neumann-architecture machine, with one port that connects to the ARM high-performance bus through an AHB (Advanced High-Performance Bus) bridge. The bridge is necessary to provide synchronization between the processor and the peripheral frequencies, to accommodate the processor interfaces, and to act as an interface to a multimaster bus. Although the bridge is necessary, it imposes a two-clock latency penalty when the processor accesses anything through the AHB and an additional performance penalty if addresses are out of sequence.

An obvious location to place program and data memory is on the AHB side of the bus so that the processor can access the memory and peripherals can directly access memory data. However, the AHB bridge still imposes a two-cycle latency penalty. To optimize processing performance, designers can place the program memories on the processor's local-bus side of the AHB bridge. Although this configuration increases the processing performance, other bus masters cannot directly access this memory, forcing the designer to place more memory on the AHB side for the DMA masters. This approach increases cost in older processes, but in deep-submicron processes, the increase in performance can outweigh the cost increase.

Flash memory is slower than 6T (six-transistor) SRAM cells, but the use of flash memory is important in embedded systems due to its nonvolatility, solid-state reliability, low power consumption, and design flexibility. Many subarchitectures within a single memory type allow you to tune the architecture to the application's requirements. These requirements may include access speeds, programming speeds, read-voltage power-consumption levels, and cost. Other important considerations for flash memory are endurance time in years and the number of supported erase cycles.

The random-access speed for embedded flash is approximately 50 nsec versus 85 nsec for merchant flash, which presents a problem when targeting a processor that operates faster than 100 MHz. However, because accessing the embedded-flash is not pin-limited, the embedded-flash-memory subsystem can use wide bit widths with some interface logic to increase performance. In this example, a 128-bit width allows the system to simultaneously access four processor-data words, which provides an effective access frequency of 80 MHz for linear code. Combining the four-word fetch with logic that buffers and a read to allow branch prediction enables you to achieve acceptable performance when executing from flash. This method enables a more cost- and power-effective mix of SRAM and flash as local memories for random access of data and mostly linear access of program code than just an SRAM implementation.

You have several options for implementing the bus architecture to support high-bandwidth peripherals. One way is to use a multilayer bus, which is a matrix that allows many masters to access memory resources in different ways. Another way is to design an AHB-to-AHB bridge, so there are two or more independent buses. Because the local-memory SRAM does not support DMA, regardless of which method you choose, any high-bandwidth peripherals, such as Ethernet or USB, should have a dedicated memory resource from which they can directly access memory. The number of stored data packets and frames, data rates, and processor speed should drive the size of the dedicated memories.


Author Information
You can reach Technical Editor Robert Cravotta at 1-661-296-5096 and rcravotta@edn.com.

Altera: www.altera.com

ARC International: www.arc.com

ARM: www.arm.com

Atmel www.atmel.com

Cambridge Consultants: www.cambridgeconsultants.com/asic

CriticalBlue: www.criticalblue.com

EEMBC: www.eembc.org

Freescale: www.freescale.com

Green Hills Software: www.ghs.com

Hi-Tech Software: www.htsoft.com

Innovative Silicon: www.innovativesilicon.com

Intel: www.intel.com

Intrinsity: www.intrinsity.com

Kilopass Technology: www.kilopass.com

Microchip: www.microchip.com

MIPS: www.mips.com

NXP Semiconductors: www.nxp.com

Qimonda: www.qimonda.com

Renesas: www.renesas.com

Samsung: www.samsung.com

STMicroelectronics: www.st.com

Stream Processors: www.streamprocessors.com

Tensilica: www.tensilica.com

Texas Instruments: www.ti.com

Toshiba America Electronic Components: www.toshiba.com

Virtium Technology: www.virtium.com

Xilinx: www.xilinx.com

RSS
Reprints/License
Print
Email
Talkback
Canon Resource Center

Featured Company


Most Recent Resources

Advertisement
Related Content

No related content found.

  • 0 rated items found.
Advertisement

KNOWLEDGE CENTER

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)
Featured Job On
Scroll for More Jobs
Advertisement
About EDN   |   Site Map   |   Contact Us   |   Subscription   |   RSS
© 2012 UBM Electronics. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other UBM Canon sites

UBM Canon | Design News | Test & Measurement World | Packaging Digest | EDN | Qmed | Pharmalive | Appliance Magazine | Plastics Today | Powder Bulk Solids | Canon Trade Shows