DRAM technology for SOC designers and—maybe—their customers
An understanding of DRAM technology has become critical to anyone designing consumer-electronics SOCs.
By Drew E Wingard, PhD, Sonics Inc -- EDN, August 6, 2009
Consumers have come to expect—and even demand—the full benefits of the convergence between computing, communications, and digital-media technologies that we've all been predicting for the past 15 years. OEMs therefore produce a bewildering array of advanced electronic systems in a variety of form factors, feature sets, and prices to fulfill consumer desires. SOC (system-on-chip) devices enable much of this convergence by integrating increasing amounts of performance at acceptably low costs.
Although many differences exist between the processing requirements of the computing, communications, and digital-media components in converged SOCs, one commonality is the reliance upon external memories, particularly DRAMs (dynamic random-access memories), to provide high-bandwidth storage for data that the SOC processes. For many consumer SOCs, the performance and cost of the overall system depend upon the efficiency of the communications between the SOC and the attached DRAM. This dependence is particularly strong in SOCs that process high-definition digital-video streams, such as HDTVs (high-definition televisions), digital-cable or satellite STBs (set-top boxes), DSCs (digital still cameras) and camcorders, and advanced multimedia and smart mobile telephones.
Classic computer-architecture texts describe a computer as comprising three classes of hardware: processing, memory, and I/O (input/output) devices. In consumer-electronics equipment, you find hierarchies of these same components. Although the phrase “system on chip” implies the integration of all a system's functions into one device, SOCs typically implement most of the user-visible processing portions of the system but rely on external devices to implement at least some of the I/O functions and often require external nonvolatile and volatile memory devices, including DRAMs, to implement the system's memory.
These SOCs typically comprise subsystems that you can classify as processor, embedded memory, or I/O. And those subsystems may themselves similarly comprise IP-core building blocks that are processors, memories, or I/O. SOC developers use these hierarchies to minimize the amount of communication that passes across the hierarchy, thereby minimizing the dependencies that exist between components at different locations in the hierarchy. This approach helps improve the overall design quality and reduces the total design effort. The hierarchies also improve system performance and reduce system power and energy consumption because it is far more efficient for processors to access local memory and I/O resources than remote ones.
Even with effective management of communications through the use of hierarchy, however, many consumer SOCs require efficient and high-bandwidth access to external DRAMs. A key challenge facing SOC developers is therefore to optimize access to external DRAM.
Although many general-purpose computing systems use their CPUs to process media and communication streams, consumer SOCs cannot afford the area and power inefficiencies inherent in most CPU architectures. Instead, SOCs normally rely on a set of accelerators for the processing tasks that implement the main functions of the SOC. Common accelerators include DSPs for audio processing, 2- and 3-D graphics cores, hardware-video codecs, and communications processors. The accelerator architectures vary widely, from fixed-function devices through programmable engines to configurable processors with ISA (instruction-set-architecture) extensions for function acceleration. Mixes of all three approaches are also available. A common characteristic of most accelerator architectures is the use of sufficient local memory resources to both minimize the amount of necessary external-memory bandwidth and increase the latency tolerance of such external access.
In most consumer SOCs, external DRAM must support the lion's share of the external-memory-bandwidth requirements of the initiators—the CPUs, accelerators, and I/O interfaces that request data. When this total loading gets too high for cost-effectively implementing DRAM, SOC architects turn toward embedded shared-memory components on the SOC. These embedded memories, typically using SRAM or embedded-DRAM technologies, implement either hardware- or software-managed caching schemes that further reduce the external-memory loading. Designers of SOCs for battery-powered applications, such as mobile phones, may also use shared embedded memories to support the total memory-bandwidth needs of certain operating modes and to allow the powering-off of external DRAM to lengthen battery life.
However, on-chip memory is expensive. The density and cost advantages of external commodity DRAMs drive their use in most consumer applications. DRAM serves as the key resource in an SOC to hold the data passing between each of the initiators in the SOC. By allocating sufficient data storage between processing stages, DRAM offers the cheapest method for allowing each component to operate at its own pace, thereby decoupling the operation of each of the subsystem components. A shared-DRAM subsystem also minimizes the total memory footprint of a system because the system can allocate different amounts of DRAM to each component for different operating modes of the application.
Commodity though it may be, DRAM still costs money. The best way to minimize this cost is to maximize the efficiency of DRAM accesses, which maximizes the sustained, or usable, bandwidth as a fraction of the peak, or available, DRAM bandwidth. Because each DRAM device offers a maximum amount of memory bandwidth and has a minimum storage capacity, some systems require multiple DRAMs to deliver the required memory bandwidth. In applications such as HDTV, the bandwidth requirements are so high that systems must use more capacity than the application requires just to achieve sufficient DRAM bandwidth. In such cases, improving DRAM efficiency can reduce the number of DRAMs in the final consumer system, making the SOC more cost-effective at the system level.
DRAM primer
A DRAM device contains a large number of DRAM cells in 2-D arrays, in which each cell holds a single bit of data in the form of charge stored on a capacitor. Because the data is stored on capacitors in the bit cells, the data storage is not permanent. Over time, the capacitor's stored charge decays. To avoid data loss, the DRAM must periodically refresh the cells to restore their charge. These refresh operations take time away from normal array traffic and thus reduce the overall throughput of the DRAM system by a few percentage points.
A read or write operation first uses row addresses to open, or activate, the page by copying the bit values in a row of cells, or a DRAM page, into sense amplifiers at one end of each column. The chip then reads or writes a set of data—one array word—to or from the sense amplifiers, using a column address. Because accessing the sense amplifiers is much faster than accessing a new row of cells, the latency of column accesses within a page is much quicker than opening a new page. Before the chip can access another page in the same array, it must precharge the array, or close the page—that is, load the data values stored in the sense amplifiers back into the cells of the open page. While the chip opens and closes a page, the system can access no data from pages in that array at the DRAM interface, which results in a loss of data throughput.
Most modern DRAMs contain banks, each of which contains multiple such arrays. This organization allows data access to one bank to occur simultaneously with page-closing and -opening operations on other banks. In this way, the DRAM can hide the throughput penalties of page operations by exploiting bank-level parallelism in the access patterns from the system.
In the 1990s, vendors introduced SDRAM (synchronous DRAM) with clocked interfaces that enabled higher data rates than previous asynchronous DRAMs. In recent years, this trend has continued with DDR (double-data-rate) SDRAM devices that send data on both edges of the clock, enabling even higher data rates. SDRAM architectures support the concept of burst-mode accesses, in which a column access results in the transfer of several interface words of data. The SOC's memory controller programs the burst-length value into the SDRAM's mode register to choose between burst lengths of 1, 2, 4, or 8 words. Burst-mode accesses are important because they free up the DRAM command interface to transmit precharge and activate commands for other banks while one bank sends or receives data.
Figure 1a shows a timing diagram for a burst-read access to a 133-MHz SDRAM. This access causes a page miss in Bank 0, so the memory controller issues a precharge command to close the open page. After the number of clock cycles necessary to cover tRP—the minimum time between the precharge command that closes a page and the activate command that opens a page in the same bank—the controller activates the new page at Row 8 in Bank 0. After at least tRCD—the minimum delay between the activate command, which opens a row in the bank, and the read- or write-column command, which moves data in the same bank—the desired page opens, so the controller issues a burst-read command to Column A in Bank 0. Two cycles later, the SDRAM returns the requested data, which is two words, a0 and b0, because the SDRAM's burst-length value is 2. Bank 0 is busy closing and opening pages during the entire period Figure 1a depicts in pink. You lose these data cycles unless the controller can schedule commands to other banks that use the data bus during this period.
Achieving the lowest cost per bit of storage requires using relatively large arrays of bit cells with resultantly slow array circuitry. The fundamental operating frequencies of SDRAM arrays have not kept pace with technology scaling. Instead, improvements in DRAM bandwidth result from the prefetch architecture that DDR SDRAMs introduced. The prefetch architecture relies on choosing DRAM-interface words that are narrower than the DRAM-array words. If the ratio of the array-word size to the interface-word size is N, then interface words can operate at an effective speed N times that of the array without loss of throughput. However, the minimum efficient access size for the DRAM array is the array word, so the minimum efficient interface burst length becomes N.
The degree of prefetch is the major architectural differentiator in SDRAM technologies. Table 1 shows the current prefetch values for SDRAMs. The minimum efficient access size for DDR3 devices is a burst of 8 interface words. Unfortunately, this size often proves troublesome for consumer SOCs.
Figure 1b shows the timing differences between the 133-MHz SDRAM device and a recently announced 666-MHz DDR3 SDRAM. Although the latency of a page miss at 40.5 nsec has improved by only 10% over the other device's 45 nsec, the peak data throughput has increased by a factor of 10: 1333 versus 133 Mbps/pin. Note that, even though the burst length is now 8, the entire burst of DDR-read data takes less time—6 versus 7.5 nsec—than a single word in the SDRAM case. Note also that the time when Bank 0 is busy is nearly as long, and it will thus take more DDR bursts to cover this busy time than the SDRAM requires.
Challenges in SOC DRAMs
Because the operating rates of the DRAM arrays have increased more slowly than those of the interface data, each of the DRAM-bank operations has latency measuring many more interface data cycles for DDR3 than did earlier SDRAM devices. To keep the DRAM-data interfaces fully occupied for high efficiency, the SOC's DRAM controller must overlap precharging and activation of pages in other banks while accessing data based on burst commands to an open page in a first bank. This requirement means that the DRAM controller manages a pipeline of DRAM commands, and this pipeline becomes deeper as DRAM technology advances.
Figure 2 shows a sample schedule that a DRAM controller could implement to maintain high throughput during Bank 0's page miss. The schedule relies on cycling through five DRAM banks, assuming a page miss every other burst. This situation allows consecutive 8-word reads to column addresses CA and CI in the same page. To keep the schedule full and therefore minimize the efficiency losses that may result from page closing and opening, the DRAM controller must manage 10 column commands plus five precharge and five activate commands in a pipeline. For the SOC designer, managing such deep memory pipelines requires more complexity in the DRAM controller and the on-chip interconnect, as well as implementation of scheduling logic that feeds the controller.
|
Although pipelining the DRAM system maintains high efficiency and throughput, it also increases latencies for processors and other initiators on the SOC because a new request must normally wait for the pipeline of requests ahead of it to drain before receiving service. Thus, DRAMs' high efficiency and low latency are in conflict, which is particularly acute in SOCs, in which the large number of processors and other initiators have varying burst characteristics, throughput requirements, and latency sensitivities. This situation forces the designers of many consumer SOCs, who must optimize for highest DRAM efficiency, to implement complex arbitration and scheduling techniques to balance the efficiency-versus-latency trade-offs for different classes of initiators.
For instance, an LCD controller can naturally fetch an entire scan line's worth of pixels that it will display. This traffic pattern looks like a long incrementing burst, which is nearly ideal for maximizing DRAM efficiency because it reads most or all of the bits in one page before moving on to the next one. However, making a general-purpose CPU that is serving a cache miss wait behind such a large sequence of DRAM requests substantially reduces CPU performance and could even prevent the CPU from serving a critical interrupt in a timely manner. Thus, SOC designers would normally break up the scan-line fetch into smaller bursts so that latency-sensitive CPU requests can interleave between these bursts.
When the interleaved CPU request targets a different DRAM bank from the scan line's bank, the DRAM controller can schedule the requests without losing efficiency. If the CPU targets the same bank, however, the CPU's request requires closing the scan line's page so that the DRAM can open the CPU's page. After the DRAM services the CPU request, the chip must close the CPU's page so that it can reopen the scan line's page. If the CPU has another cache miss on the same page, this process will recur and waste substantial DRAM efficiency. This situation, page thrashing, can be so inefficient that the CPU would have achieved higher performance by waiting until the scan-line accesses had finished—that is, not interleaving in the first place.
Because the DRAM system must deliver the combined throughput requirements of several initiators and because higher-efficiency DRAM requires longer bursts, it follows that the DRAM system normally transfers data at higher peak bandwidth than most initiators—with the notable exception of the CPU—require. Most initiators therefore include FIFO (first-in/first-out) buffers so that the DRAM can efficiently service their requests into or out of the FIFO buffer, allowing the initiator to move data at lower bandwidth on the other side of the FIFO buffer. For those initiators requiring guaranteed throughput from DRAM, these FIFO buffers provide latency tolerance by covering the communication requirements of the initiator for the time it takes the initiator to drain or fill the FIFO. Deeper buffers make the initiator more latency-tolerant, giving SOC designers more flexibility in scheduling the initiator's traffic to DRAM at the cost of additional buffering area. For reads, such an architecture relies on the initiator's providing burst requests far in advance of needing the data; this approach is simply another form of pipelining.
The prefetch architecture of DDR3 raises a new challenge for consumer SOCs. As the minimum efficient burst lengths of DRAMs increase and the high bandwidth requirements of SOCs force wide DRAM words, the minimum efficient DRAM burst can reach 64 bytes or higher. Initiators are accessing data structures of various sizes and access patterns. Of particular interest are CPU-cache lines and MPEG macroblocks, each of which is relatively small and has effectively random access patterns. Initiators' fetches that are smaller than the DRAM burst waste DRAM-data transfers, and the system operates less efficiently. Today, this access-granularity problem is particularly acute in HDTV and STB applications, but it is also becoming common in other video-capable SOCs. SOC designers often address the access-granularity problem by reducing the size of the DRAM burst by either rebuilding their designs to use less DRAM bandwidth—for instance, by adding on-chip memory—or splitting the DRAM system into multiple independent channels, in which each channel is narrower and thus operates at lower burst size.
A final challenge facing SOC designers is the variety of operating modes in consumer-electronics devices. Cost concerns preclude optimizing for the sum of the worst-case bandwidth requirements of each initiator across all modes, so the designer must instead independently consider each key operating mode. In each mode, the various initiators have different performance requirements, and optimum scheduling and buffering choices often differ across modes. Because a single SOC must support these modes, the designer must eventually choose an architecture and design parameters that cover these needs. Sometimes, designers leverage programmable scheduling and other features to allow runtime optimization of the SOC for the different modes.
Many challenges face designers of consumer SOCs when they are considering the implications of external-DRAM systems. Designers must select DRAM channel counts, burst lengths, arbitration and scheduling policies, and FIFO-buffer depths to carefully optimize their design across a range of key operating modes of the end applications. Although the underlying DRAM technology continues to deliver both higher capacity and peak bandwidth, these features bring with them additional latency and complexity. The wide variety of initiators and associated traffic requirements of the SOC, along with the extreme cost sensitivities of consumer markets, make these challenges more difficult. These challenges force designers to operate DRAMs at the highest achievable efficiency. Commercial products to address these problems are available from Sonics and other manufacturers, and you should consider them when embarking on new consumer-SOC development.

















Drew E Wingard, PhD, co-founded Sonics in September 1996 and has been chief technical officer and secretary since March 1997. He is also a member of the board of directors. Before co-founding Sonics, Wingard led the development of advanced circuit and CAD methodology for MicroUnity Systems Engineering Inc. He also co-founded and worked at Pomegranate Technology from 1992 to 1994. Since December 2001, Wingard has served as secretary and a director of the OCP-IP (Open Core Protocol-International Partnership), a nonprofit trade organization. He received a bachelor's degree in electrical engineering from the University of Texas—Austin and master's and doctorate degrees in electrical engineering from Stanford University (Stanford, CA).

