|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
October 9, 1997 Advanced DRAM puts you in the fast lane Brian Dipert, Technical Editor Direct RDRAM, which targets 1999 and later PCs, is now public. Is the race for PC main memory over, or are double-data-rate SDRAM and SLDRAM still in the running? And is there room for other DRAM architectures in the diverse embedded-application base? Beginning most noticeably with the emergence of the Intel486 CPU, a growing disparity has developed between the performance of DRAM and the bandwidth needs of PCs and embedded systems. On-CPU clock multiplication widens the gap between memory capability and system needs even further than external clock frequencies would indicate. But an array of new DRAM architectures offers you a number of options for relieving the memory-performance bottleneck. One common system-level technique to overcome DRAM's limitations is to increase the size and complexity of levels 1 and 2 (L1 and L2) SRAM caches. Beyond the diminishing returns and higher cost of this approach, the processor's cache does nothing to improve the performance of other subsystems that access DRAM. In today's PCs, for example, bus traffic also flows across the PCI bus between DRAM and SCSI, IDE, ISA, USB, and, soon, Firewire-based peripherals. These data streams include network traffic and video and high-fidelity audio inputs from digital-video disk (DVD) and hard-disk drives.
Because caching around the DRAM is an insufficient response to increasing memory-subsystem bandwidth demands, how can DRAM manufacturers improve their devices' inherent performance? (See box "The fill frequency.") Combinations of the following techniques, each with its trade-offs, commonly find use in squeezing maximum bandwidth from the DRAM subsystem:
Intel predicts that main-memory subsystems will need to provide as much as 1.6 Gbytes/sec of peak bandwidth by early 1999, when a range of higher performance PCs will appear. Similarly, the first-generation AGP specification, which uses a 32-bit, 66-MHz bus, calls for 266-Mbyte/sec maximum bandwidth between the chip set and graphics controller. You can assume that AGP-equipped PCs will have similarly stringent bandwidth needs between the graphics controller and frame buffer. AGP's 2X mode transfers data on both rising and falling edges of the 66-MHz clock, translating to 533-Mbyte/sec bandwidth. The future 4X mode will achieve 1.1 Gbytes/sec, probably by using higher clock frequencies. In comparison, 66-MHz synchronous DRAM (SDRAM) on a 64-bit system bus delivers only 533-Mbyte/sec maximum performance. Sustained bandwidth is lower because of address, control, and non-page-access-cycle overhead. Workstations and embedded systems--such as CD-ROM and disk drives, digital set-top boxes, and imaging and communications equipment--will soon have similar bandwidth needs if they don't already. Embedded-system designers are also interested in cost-effective alternatives to the fast SRAM that they traditionally use. In evaluating advanced DRAM architectures to use in your next-generation designs, consider both technical attributes of the memories and related economic variables. These factors include the potential for the memory's use in high-volume applications and the number of vendors shipping or planning to include the memory in their product lines. The DRAM market, valued at approximately $25 billion according to Walt Lahti, principal analyst at InStat (Scottsdale, AZ), is one of the key technology-development engines in the roughly $120 billion semiconductor-IC industry. With PCs (and, therefore, Intel) directly influencing approximately two-thirds of the overall DRAM market, you should not ignore PC trends. Also, you should consider the presence and diversity of chip-set and integrated-processor memory-controller support, as well as whether realistic product-sampling and production schedules exist at necessary densities, bus widths, speeds, voltages, and electrical interfaces to meet your requirements. Finally, carefully examine the assumptions that various vendors' benchmark claims make for suitability to your application. Main-memory options Main-memory accesses predominantly consist of frequent but short code-read bursts for cache-line fills, although growing data traffic is beginning to alter this generalization. Today's systems have high cache-hit percentages. Also, accesses scattered throughout the available memory space result from software that was developed with modular languages, such as C++, and that runs on multitasking OSs. These factors create a strong possibility that each DRAM access won't be to an already-open row page stored in the DRAM's sense-amp array. These factors also increase the probability that the mP won't subsequently need some of the locations it accesses during a cache-line fill. Therefore, optimizing random-access latency is important for main memory, at least as important as the time necessary for subsequent accesses within an open page. Many high-end PCs still use extended-data-out (EDO) DRAM because, at 66 MHz, SDRAM can require an extra wait state for initial accesses, although subsequent fetches within a burst take one fewer wait state. The resultant clock profiles for a four-access burst cache-line fill using 60-nsec fast-page mode (FPM) DRAM, EDO DRAM, and 66-MHz SDRAM show limited bandwidth differences (Table 1). The table also shows L2 cache for comparison. Estimates for the resultant system-performance gain average only approximately 2%, with 5% at the high end. The SDRAM performance improvement over EDO should be more significant at 100 MHz, and the zero-wait-state burst accesses also will prove valuable for accelerating long multimedia-data transfers. Pipelined and speculative prefetch µPs, such as Intel's Pentium II CPU, allow the memory controller to exploit SDRAM's multibank architecture for maximum efficiency. These combined trends should effectively conclude the era of EDO, at least for future generations of PCs. In spite of SDRAM's limited performance improvement at 66 MHz, some PC manufacturers use it both as a marketing differentiator, because DRAM companies now provide it to them at little to no price premium, and as a way to gain experience for the upcoming 100-MHz SDRAM generation. According to several DRAM vendors, price parity between EDO and SDRAM has been delayed not necessarily by extra silicon cost but by yield. A 66-MHz SDRAM is roughly timing-equivalent to a 50-nsec EDO, and no demand exists for a product that doesn't yield to 66 MHz. Hitting these speed targets with near-100% yield and cost-effective die size usually requires at least a 0.4-µm process lithography. Unclear system-performance benefits are the least of SDRAM's challenges. Manufacturer-to-manufacturer variations in power-up and initialization sequences, I/O impedance, output-buffer strength, column-address-strobe (CAS#) latencies, and clock-to-data and other timings wreak havoc with system engineers attempting to purchase product and with end users attempting to upgrade their systems. The situation is better than it was in the early days. (SDRAM definition began in the late 1980s!) Back then, some SDRAMs used a single-bank internal architecture, whereas others used a two-bank approach. Early SDRAMs also offered level- and edge-triggered row-address-strobe (RAS#) alternatives and had variations in supported burst lengths. Incompatibilities are still so bad, however, that Intel has issued a document, commonly called "PC-66," with commands, modes, and specs more stringent than those in the JEDEC documentation. Intel requires DRAM manufacturers to supply devices that meet PC-66 to ensure compatibility with Intel's chip-set and motherboard designs. Intel has also developed a similar document for upcoming 100-MHz SDRAMs, which promise to be even more challenging. Remember that at 100 MHz, the total clock period will be 10 nsec with additional deductions for rise and fall times, chip-to-chip and signal-to-signal skew, and signal-propagation delay. Some of the more critical aspects of the PC-66 and PC-100 SDRAM specifications include setup, hold, and clock-to-output timings (Table 2).
Some memory modules include delay-locked loops (DLLs) or PLLs to factor out clock-propagation delays between the controller and the memory-module socket. These circuits ease read-timing requirements but increase cost and power consumption and complicate power management. Register-buffered modules, an option most attractive in memory-intensive servers and workstations, present a consistent load to the core logic, regardless of the number and density of memory chips. However, the registered data-bus delays, complicated by bidirectional data-flow requirements, further increase the random-access latency. Bill Johnson, marketing director at Smart Modular Technologies, believes that PC OEMs hoping to avoid incompatibility problems are increasingly standardizing on a few memory and module suppliers and requiring customers to purchase memory upgrades directly from OEMs. Modules now contain serial-presence-detection EEPROMs (JEDEC standard 21-C) that provide more information than the previous parallel-presence-detection pins. Johnson predicts that some PC manufacturers, by means of the BIOS, could use this information to alert their support organizations to the presence of invalid memory modules. For end users, having to purchase memory upgrades directly from the manufacturer would reduce the number of available supply options and could result in higher prices. One other issue confronting SDRAM and other more traditional main-memory options is granularity, which is the minimum base-system memory density and the minimum density-upgrade increment. As this article went to press, a number of Korean and Japanese DRAM vendors were intentionally shifting focus from 16-Mbit-DRAM production to the upcoming 64-Mbit generation. The main reasons for doing so include gaining more experience with new sub-0.3-µm manufacturing lines, disengaging from the intensely price-competitive 16-Mbit market, and attempting to use supply to move customer demand to the less crowded and more financially attractive 64-Mbit density. However, even with a ×16 DRAM component-bus width, the minimum system granularity using 64-Mbit DRAM is 32 Mbytes for one ×64 system bank or 64 Mbytes for the more common dual-bank configuration (Table 3). Four-bank DRAM subsystems, such as those anticipated for the PC-100 SDRAM generation, will have even higher granularity. Even in an era in which OSs and applications consume tens of mega-bytes, this level of system granularity may be unacceptable, especially for entry-level PCs. This fact is one of several reasons that some DRAM manufacturers plan to offer an incremental 128-Mbit DRAM generation between the traditional 64- and 256-Mbit densities. Other incentives include the ability to preserve much of the manufacturing equipment and methodologies that vendors use with 64 Mbits, as well as the ability to fit the die into existing packages. Granularity is a critical issue for embedded systems, whose density requirements often lag several generations behind those of PCs. One option for embedded-system designers is migration to other memory technologies, such as SRAM and flash memory. Another possibility is to integrate DRAM on ASICs. This option is also attractive to ASIC manufacturers as they strive to create demand for the gate counts that their leading-edge manufacturing can deliver. Granularity, as it relates to chip- and system-bus width, can also significantly affect power consumption. Consider a 32-Mbyte, 64-bit data-bus system requirement. Using 64-Mbit SDRAMs that are each ×16, you need four chips, all of which consume active power each time the system accesses the DRAM bus. With an alternative ×16 Direct RDRAM or SLDRAM (formerly, SyncLink) system bus, you'll still need four chips, but only one will be active each DRAM-bus access. Even though this ×16 access will run at much higher clock frequencies to achieve equivalent bandwidth, the lower total capacitance due to the narrower bus will tend to balance out the dynamic output power (P=CV2f). Fewer active chips mean lower average internal power consumption per chip, and the end result may be lower total power consumption. These differences decrease in significance as system memory-density requirements grow. DDR SDRAM
DDR supporters make impressive bandwidth claims at high data-transfer frequencies and wide data-bus widths (Table 4). Realistically, however, this potential will be limited by address- and control-cycle overhead and by the fact that address cycles will use only the rising edge of the memory clock, giving them half the bandwidth of data cycles. For this reason, most DRAM manufacturers planning to offer DDR will bypass the 66/133-MHz option and go directly to the 100/200-MHz version. The 66/133-MHz DDR would arrive after Intel's planned conversion to a 100-MHz local-bus frequency and would also offer uncertain performance advantages over standard 100-MHz SDRAM. These high frequencies may require differential clocks, dedicated data strobes, and migration from low-voltage TTL (LVTTL) to low-swing stub-series terminated logic (SSTL), at least on data and clock signals. Planned conversions from a two- to a four-internal-bank configuration, in combination with memory-controller designs that exploit this configuration's performance capability, will help hide bank random-access, precharge, and refresh latency. DDR will use an internal data bus twice as wide as the external bus, accessing two locations at the same time to achieve performance targets. This wider internal bus may increase die size and cost relative to a standard SDRAM; however, Direct RDRAM and SLDRAM will be even wider. Some DRAM manufacturers have limited confidence at this early stage that customers designing high-volume upgradable systems will be able to reliably use the 133/266- or 150/300-MHz speeds, even with SSTL-3 or -2 interface levels. This lack of confidence leads to worry that 100/200-MHz DDR will be a "point" product, which is a big concern for memory, chip-set, and system manufacturers making difficult time, money, and resource-allocation decisions. Minimal-chip designs with point-to-point connections between the DRAM and controller and with little to no upgrade requirement, such as graphics frame-buffer memory, are more straightforward applications for the higher DDR speeds. S3 Corp (Santa Clara, CA) is one of several graphics vendors working on DDR JEDEC standardization. Because DDR is conceptually so similar to standard SDRAM, it uses much of SDRAM's test, assembly, and board- and module-manufacturing infrastructure. DDR is also an open architecture that promises to be a simple transition for both memory and chip-set manufacturers. However, Intel states that it plans no support of DDR in any upcoming chip sets. Even without Intel's involvement, which may change in the future, DDR may find sufficient interest to ensure at least some market success. Potential DDR adopters include other PC µP suppliers, such as AMD (Sunnyvale, CA); non-Intel chip-set manufacturers, such as Via Technology (Fremont, CA); and workstation and server companies. Workstations are attractive because they are just beginning to migrate from EDO to SDRAM. Workstations and servers also tend to have longer development and product life cycles than do PCs, and they use more proprietary and complex DRAM-subsystem architectures to achieve performance and density requirements. From the burst-EDO and "SDRAM-lite" lessons, however, you might infer that DDR without Intel's blessing could be nothing more than a niche product. Fundamental DDR characteristics, such as clocking schemes, voltage levels, and pinouts, are still under vigorous debate in JEDEC, and multiple conflicting and incompatible "standards," if they appear, will only cause further confusion and slow adoption. A few other DDR concerns are worth mentioning. The conversion from LVTTL to SSTL interface levels might cause end-user frustration during memory upgrades. Also, DDR will contain internal PLLs to remove memory-clock skew. This fact could complicate system power-management functions, because PLLs tend to have slow, unpredictable response to power transitions and input-clock suspension and resumption. In addition, DDR may use a unidirectional (read-only) or bidirectional data strobe, or "echo clock," traveling with each 8- or 16-bit data word to synchronize cycles between the memory and controller (Figure 3). These data-strobe signals (DS), along with the differential clock and 2.5V SSTL-2 supply voltage under consideration for 133/266- and 150/300-MHz speeds, take up pins reserved for error-correction-code (ECC) functions on today's DIMMs and create the potential for further module incompatibility. The ×32 or larger external data bus needed for reasonable system granularity at 64-Mbit densities and above will increase die size and cost. With multiple outputs transitioning simultaneously, the data bus may also cause so much noise and burn so much power that it would overwhelm any potential performance improvement. Enhanced Memory Systems' Enhanced DRAM (EDRAM) architecture includes a number of features designed to maximize sustainable memory bandwidth in systems with small or nonexistent caches. Today's parts combine a 25-nsec, random-access, 4-Mbit DRAM array with a 10-nsec, 2- or 8-kbit SRAM cache on one chip. The wide, 2-kbit internal interface between DRAM and SRAM arrays also helps boost performance over standard DRAMs. Interface options for the original EDRAM included static-column and FPM. Enhanced Memory Systems slightly modified standard pinouts and functions to maximize the effectiveness of the integrated cache. Chip select (S#) enables the memory controller to access the SRAM while precharging the DRAM. Write/read (W/R#) control avoids closing the SRAM page when writing to the DRAM array, and internal logic updates both DRAM and SRAM, if necessary, to maintain coherency. Finally, a dedicated refresh pin (F#) allows hidden refresh without a CAS#-before-RAS# cycle, which would close the open SRAM page. Recently, EDRAM evolved to a multibank architecture with EDO and burst-EDO options. Along with the performance improvement that the new interfaces provide, multibank EDRAM subdivides the DRAM and SRAM into four internal banks. This evolution continues the trend toward hiding precharge and refresh and retains the pinout enhancements in the original EDRAM. Multiple banks further improve sustainable bandwidth by increasing the probability that randomly desired code or data will already be in the partitioned SRAM cache. Reference designs, documentation, and timing-analysis tools, available on Enhanced Memory Systems' Web site, minimize the effort needed to interface EDRAM to a variety of PC and embedded processors. The company plans to offer its first SDRAM-interface (Enhanced SDRAM, or ESDRAM) 16-Mbit devices for sampling by year-end. The devices will be as wide as ×16 and as fast as 133 MHz and contain two DRAM and two 4-kbit cache banks. ESDRAMs will be fully pinout-compatible with SDRAMs. They will still support the ability to hide DRAM refresh concurrently with cache accesses by means of software-command enhancements beyond the standard SDRAM suite. The company also plans 64-Mbit and DDR variations of ESDRAM for 1998. VLSI Technology's (San Jose, CA) Polaris core-logic chip set for Alpha µPs will support ESDRAM. EDRAM's success in graphics and embedded-system applications and as a fast-SRAM alternative originates in its ability to provide higher performance than standard DRAM while preserving some compatibility with standard DRAM interfaces. As an evolutionary alternative to DDR for SDRAM designs, ESDRAM will be subject to the same power, timing, granularity, and other concerns. However, although the lack of extensive supply sources will tend to keep ESDRAM prices higher than more mainstream alternatives, it may also prove to be a hidden blessing. Fewer supply sources mean less potential for functional and specification variability, and the smaller system-memory densities common in embedded-system designs also tend to increase design margin. ESDRAM also has good potential to meet Intel's PC-100 SDRAM timing and functional requirements, although the fact that ESDRAM is entering the 16-Mbit density as many other DRAM vendors make the transition to 64 Mbits may impact its PC success. Direct RDRAM
RDRAMs interface with the Rambus memory controller in a packet-based fashion. Base and Concurrent RDRAMs transfer address, data, and control information across the Rambus channel via a common set of eight or nine pins that are synchronized with the clock by careful impedance and trace-length matching. After the memory controller initiates a read or write request, Base RDRAMs respond with an acknowledge (ACK) packet if the desired addresses are in the sense-amp cache and the transaction can proceed. Base RDRAMs respond with a no-acknowledge (NACK) packet if the locations must be loaded to the sense amps (which the memory automatically does), indicating that the memory controller should retry the transaction later. During this delay, the memory controller can initiate a request to another RDRAM, if possible, to better exploit available bandwidth. The Rambus-controller design includes a demultiplexer to convert the high-speed, 8- or 9-bit channel back to its 64- or 72-bit lower frequency alternative within the system controller. Reliable RDRAM operation requires careful pc-board layout, especially in multichip designs, to keep the total channel length as short as possible and to eliminate differences in trace length and impedance between signals. Achieving this goal ensures that signal-to-signal skew is as small as possible. The maximum number of chips per Rambus channel is 32. This restriction is one of the key Rambus-architecture concerns for high-end servers and workstations, although you can overcome it by adding channels to the memory controller, which increases pin count, or by providing a channel-to-channel bridge chip, which impacts initial access latency. The close chip-to-chip placement, combined with frequency-driven high dynamic-power consumption, also creates thermal-dissipation challenges that did not exist with SIMMs and DIMMs. However, by enabling signal-propagation delays between the memory and controller to span multiple clock periods, the Direct RDRAM specification allows the channel (and therefore spacing between chips) to be substantially longer than that of previous-generation RDRAMs. The second-generation Concurrent RDRAM protocol makes better use of the bus bandwidth than does Base RDRAM, and the protocol forms the basis of today's third-generation Direct RDRAM. The Rambus controller can now initiate as many pipelined transactions within one RDRAM as there are banks, creating the potential for unlimited zero-wait-state bursts. The Concurrent protocol also eliminates the complex multitransaction ACK/NACK handshake, requires fewer internal registers and counters, and more effectively uses the available serial-communications channel.
Direct RDRAMs incorporate an internal 128-bit bus, and, at the 64-Mbit density, they will include four internal DRAM banks. The Direct RDRAM specification allows for a range of sense-amp and DRAM array sizes as well as any number of banks. You can determine the ideal number of banks for a chip density and application by examining the anticipated row- and column-access delays within the chip, Rambus-channel clock-frequency target, and maximum commonly occurring burst length. Too few banks will "starve" the channel during long access bursts, but too many banks will make the memory unnecessarily expensive. A 16-bit system interface may first seem a counterintuitive means of in-creasing performance, considering the ever-widening CPU data-bus trends. However, this approach delivers several advantages over a traditional 64-bit or wider alternative, starting with the previously described power, noise, EMI, and system-memory granularity. Using fewer signal, supply voltage, and ground pins reduces package and die costs for both the controller and the memory, simplifies pc-board layout, and minimizes real estate. Concurrent RDRAM has 31 such pins, and Direct RDRAM has approximately 76, compared with approximately 140 for SDRAM and approximately 160 for SDRAM-II (see box "Narrow-bus, high-bandwidth memories"). A low incremental pin count per additional channel, combined with predictable clock-frequency improvements as process lithographies continue to shrink, gives Direct RDRAM performance head room. These factors, along with the technology's lower risk by using concepts already proven in Concurrent RDRAM silicon, were key influences on Intel's decision to support Direct RDRAM, according to Ahmad. He says, "Intel's approach is that solutions are realistic when they exist." Subodh Topraini, vice president of marketing for Rambus, reports that the company has completed its first Direct RDRAM design, and simulations show that the I/O buffers have performance capability greater than 1 Gbps. Many Direct RDRAM concerns come from DRAM and chip-set manufacturers and deal with economic and political issues, not necessarily technical shortcomings. Rambus is a memory-design company, not a memory manufacturer, and it makes its money by collecting royalties from the DRAM and chip-set vendors that license and produce its products. Memory companies worry about loss of innovation, differentiation, and control in setting future standards, as well as sacrificing the perceived benefit of multiple vendors' perspectives in tackling system challenges. Some memory companies also predict low initial yields to the 400-MHz clock specification, as well as higher memory and core-logic (and, therefore, system) prices due to the royalty payments and the possibility of Intel's re-entry into the DRAM market. Logic companies also see the potential for Intel to translate its Rambus alliance into further success in the PC chip-set arena. These logic companies even extrapolate to the prediction that Intel will integrate a Direct RDRAM controller directly on a future µP. NEC, with its V830R/AV, has already integrated a Concurrent RDRAM interface on a CPU. Regardless, more than 20 memory and core-logic companies, including the top 10 DRAM manufacturers, have taken Rambus licenses. Some non-PC-system companies also worry that Rambus and Intel will not consider their needs when making Direct RDRAM architecture decisions for the PC. For example, high-reliability servers and communications equipment can use a ×72 or ×80 system-memory bus to spread data and ECC bits among multiple chips, enabling a well-behaved system shutdown if an entire memory chip malfunctions (a hard error). Because the Rambus channel accesses all information from one memory, a hard error in that memory chip could cause a system crash. Individual RDRAMs do not have mean-time-between-failure rates inherently inferior to those of standard DRAMs, however. Failure rates depend on how often the DRAM is accessed, so efficient processor caching helps. After several years of moving away from ECC for cost reasons, PCs with SDRAM are again using it, primarily in response to large per-chip densities and higher clock frequencies. Both of these factors increase the probability of an occasional soft bit-level error due to alpha particles or noise. SLDRAM SLDRAM-architecture definition efforts, which had been slowly progressing for several years, accelerated early this year in response to the Intel/Rambus announcement at the International Solid State Circuits Conference in February. SLDRAM developed from two previous IEEE high-speed bus standards: the 1595 Scalable Coherent Interface (SCI) and the 1596.4 RamLink, an SCI subset that removed multiprocessor and other features that the IEEE committee judged unnecessary for the target applications. SLDRAM further modified the point-to-point RamLink interface by optimizing for multichip DRAM arrays, a maximum 64-byte burst length for high-end CPU cache-line fills, and a 3-to-1 average read/write-access ratio.
Farhad Tabrizi, SLDRAM Consortium chairman and director of strategic marketing at Hyundai, claims that push-pull outputs use less power than open-drain outputs, especially at high frequencies. (You can reach the Consortium at www.sldram.com.)The actual power use depends somewhat on access profile, bus loading, and other assumptions. Push-pull outputs consume dynamic power with each output transition, whereas open-drain outputs draw extra output current through the chip only when pulling the outputs to a logic-low voltage. RDRAM vendors may need to use larger open-drain pulldown transistors to overcome the passive-only pullup termination, but SLDRAM will consume constant power across the series-stub resistors. Differences in termination impedance also determine which DRAM has higher average power burn. Some SLDRAM supporters also feel that push-pull outputs in combination with series-stub resistors, by presenting a lower impedance signal load, may be more tolerant of transmission-line reflections, such as those on the longer traces of heavily loaded systems. As an additional concession to servers, workstations, and other high-chip-count applications, the SLDRAM memory controller can measure the round-trip signal delay and voltage-level differences between it and the various SLDRAMs. The controller can then tune the SLDRAMs' input threshold voltages, output-driver strengths, dc-offset voltages, and turn-on timing characteristics to minimize or eliminate skew due to trace-length and chip-to-chip variations in today's SDRAM. These techniques, assisted by SLDRAM's push-pull output structure, are common in today's high-speed chip and board testers. The memory controller initially calibrates the SLDRAMs on system power-up but may periodically recalibrate during normal system operation to account for high-temperature electrical and timing fluctuations. Low-end memory controllers might simply access all memories at the slowest chip's speed, whereas more elaborate controllers can dynamically control different memory regions and allocate available resources among functions according to their performance needs. Regardless of the technique, the chosen level of complexity resides within the controller rather than the memories themselves, which keeps total cost as low as possible. SLDRAM output-buffer simulations indicate performance head room to beyond 1.2 GHz. Clock-distribution schemes between SLDRAM and Direct RDRAM also differ. Both architectures move from a standard one-signal to a two-signal differential clock to cancel out common-mode noise and reduce reliance on threshold-voltage levels. But whereas Direct RDRAM retains the round-robin clock loop, SLDRAM uses a more traditional "tree" distribution scheme. To account for clock skew, SLDRAM also uses the data-strobe concept first seen in DDR, with two bidirectional data clocks. Tabrizi feels that this scheme has fewer clock-circuitry requirements in SLDRAM than in RDRAM, enabling the use of a simpler--or even eliminating the use of--digital PLL in SLDRAM, which might lower cost and improve power management. He also points to the less stringent pc-board-layout requirements of this data-strobe approach in conjunction with the controller-to-memory calibration technique. The SLDRAM scheme may also enable faster response and fewer data-bus "dead" cycles when switching access from one chip to another. Finally, the SLDRAM clock technique also bypasses the potential for Rambus patent-infringement problems. Almost every DRAM manufacturer participates in the SLDRAM Consortium, but to varying degrees: Some contribute only money and a meeting representative, whereas others dedicate small engineering teams to the effort. Hyundai and Mitsubishi are creating an SLDRAM conceptual test chip due for completion late this year, consisting primarily of I/O drivers and current-, voltage-, and timing-adjustment circuits; IBM Microelectronics is developing a companion evaluation module and system-board design. In parallel, Mosaid Technologies and Micron are each working on 64-Mbit, full-chip designs, with Mosaid's scheduled for completion mid-1998 and initial silicon coming from a Siemens factory. However, the performance targets for this initial design specify only a 200-MHz clock, resulting in half the bandwidth per 16-bit channel that Intel's 1999 high-end predictions require. SLDRAM supporters are optimistic that faster chips will follow six months later. Because the Consortium is still resolving SLDRAM architecture details, the amount of memory-design experience that vendors can use, which strongly influences schedule confidence, is unclear. Beyond the fundamental memory-design challenges, SLDRAM Consortium members must also develop functionally and electrically compatible memory controllers across a range of logic processes. What about data? Data-focused DRAM's requirements fundamentally differ from code's in a few key areas. Data transfers are often longer than cache-line fills, and the locality of a data reference (the probability that if one data access is in the DRAM's sense-amp cache, the next one will be too) is improved. These characteristics mean that fast sequential access within a long series is important. Additionally, data-access profiles tend to be more balanced in their read and write percentages. Read-modify-write functions are more common for uses such as pixel updates and block-level ECC (which data applications often allow), making fast context switches between reads and writes crucial. However, you cannot ignore random-access latency. In a graphics application, for example, color, depth, and opaqueness information for a pixel may physically reside in different regions of the frame buffer, and simultaneous rendering and drawing operations may access different areas of the memory. From an economic standpoint, the unit volume that data DRAM represents is significantly smaller than that for code (Table 5). For example, compare the required graphics frame-buffer sizes at various resolutions, color depths, and 2/3-D graphics parameter sets with the amount of main memory shipped in an average PC. Table 5's data is valid if the graphics controller uses direct-mapped pixel values. Look-up-table-based pixel mapping proportionally reduces frame-buffer size but also limits the maximum number of simultaneously displayed colors. Beyond pixel data, graphics controllers sometimes store frequently accessed information--such as fonts, menus, cursors, and texture maps--in the frame buffer, which increases density requirements in the process. Also note that the frame-buffer sizes in Table 5 often do not line up neatly with the granularity options in Table 3. This fact was one of the early motivations for UMA, which conceptually made more efficient use of available memory by combining code and graphics in one large subsystem. Some graphics-card and PC companies provide multiple frame buffers on their high-performance graphics products. These buffers require multiples of the memory densities that Table 5 shows for a parameter set but allow complete and simultaneous rendering of one or more frames while the RAMDAC outputs another frame to the monitor. Although data-DRAM applications may not set the price and volume trends for the overall DRAM market, many have faster than average product development and obsolescence cycles. Many data-DRAM applications also push the state of the art in performance compared with code. (Think of 3-D graphics boards or communications data buffers for an example.) In converting the density numbers in Table 5 to bandwidth equivalents, remember that the common 72-Hz noninterlaced-display refresh rate includes the time needed to realign the scan beam. Frame-buffer peak bandwidth is as much as 50% higher than the display refresh rate would otherwise indicate. The refresh-rate-based bandwidth number also omits drawing bandwidth, a critical factor when you consider single-port memories. Lower volumes, higher performance, and faster product development and obsolescence increase an engineer's willingness to consider alternative memory approaches. It's no surprise, then, that many more viable DRAM products exist in data-DRAM applications. Mosel Vitelic, for example, ships EDO DRAMs in 1-, 2-, and 4-Mbit densities and ×4, ×8 and ×16 interface options, with row-and column-access times as fast as 30 and 12 nsec, respectively. IBM Microelectronics, Integrated Silicon Solutions, Silicon Magic, and several other companies also offer fast- EDO DRAMs. MoSys supplies the Multibank DRAM (MDRAM) architecture, which offers 32 internal DRAM banks, each with a corresponding 128-byte sense-amp array, per megabyte. MDRAM was one of the first architectures with such advanced features as a DDR data bus, low-voltage-swing outputs, and an embedded PLL. A large number of smaller blocks give density flexibility that's valuable in frame-buffer designs; MoSys offers parts in 0.5-, 0.75-, 1-, 1.25-, and 2-Mbyte densities. MDRAM operates as fast as 166 MHz with a dual-edge clock and has row- and column-access times of 32 and 14 nsec, respectively. MDRAM foundry sources include Integrated Device Technology (IDT), Oki, and Siemens. Unlike EDRAM, MDRAM's proprietary interface and protocol require custom memory-controller designs. MoSys recently announced synchronous-graphics RAMs (SGRAMs) based on the MDRAM architecture, as well as MCache L2 SRAM cache-replacement chips. IDT also plans to use fast DRAM to replace SRAM with its Fusion architecture in some applications. Finally, MoSys is developing an advanced MDRAM-based Rambus-interface memory. SGRAM is a multisourced variation of SDRAM with added block-write and write-per-bit (masked-write) functions. Because of lower granularity requirements of frame-buffer memory, an 8-Mbit density with a ×32 interface is the most common SGRAM configuration. However, because the number of chips in graphics is less than the number of chips in main memory and the connection between the chips and the memory controller is point-to-point, the interface can run as fast as 133 MHz, and DDR enhancements are due next year. SGRAM always tends to be more expensive than SDRAM at a given density because of lower comparative product volumes, the limited number of vendors, higher performance requirements, and the added silicon cost of the wider interface and graphics-optimized functions. However, many analysts and vendors predict that SGRAM will shortly become the highest volume frame-buffer memory option for PCs, surpassing EDO. Versions of 16-Mbit SGRAM should also appear in the market this year. Base, Concurrent, and Direct RDRAMs can also act as high-performance data memories, although (like EDO DRAM, SDRAM, and standard MDRAM) they may not contain a full suite of graphics-optimized commands. Graphics vendors have varying opinions on the value of these functions, especially when considering the added cost that the functions incur. (Block write is especially silicon-intensive.) Because data transfers are generally longer than code bursts, the data transfers naturally increase the efficiency of the data bus. Therefore, Direct RDRAM's separate control bus may be unnecessary to achieve reasonable sustained data bandwidth. Microsoft (Redmond, WA) plans to support RDRAM in its upcoming Talisman architecture, and additional graphics-controller support comes from Chromatic Research (Sunnyvale, CA) and Cirrus Logic (Fremont, CA). The Nintendo 64 (Kyoto, Japan) video game uses two RDRAMs as a UMA for all code, graphics, audio, and miscellaneous data functions, and Gateway 2000 (North Sioux City, SD) and Micron use RDRAM in their DVD-equipped PCs. Mitsubishi offers cache DRAM (CDRAM), which, as the name implies, integrates a full-featured 16-kbit SRAM cache, cache controller, and 4- or 16-Mbit DRAM array on the same die, with separate external address and control buses for each memory and a 128-bit internal datapath between the buses. The external interface is ×16, and the 16-Mbit version supports a burst mode. A future revision of the 4-Mbit version will also add burst mode. Dual-port video RAM (VRAM) originally serviced the ultrahigh-end graphics market, such as workstations and video arcades. Today's dual-port alternatives include Samsung's window RAM (WRAM) and several other Mitsubishi devices. 3D-RAM, which is the result of Mitsubishi's and Sun Microsystems' (Palo Alto, CA) collaboration and is alternate-sourced by S-MOS Systems (San Jose, CA), combines 10 Mbits of four-bank DRAM, a 2-kbit SRAM cache (with a 256-bit interface between the DRAM and SRAM), a graphics-optimized ALU, and various buffers. On-chip arithmetic functions reduce external read-modify-write traffic to mostly writes, simplify controller design, and boost overall rendering performance. The 3DPro chip set combines 3D-RAM for pixels, CDRAM for texture maps, and a PCI-based graphics controller. The upcoming 16-Mbit, dual-port-graphics RAM (DGRAM), another extrapolation of the CDRAM concept, will provide four DRAM banks, a triple-port SRAM, and 143-MHz performance from each of two 16-bit external buses. When evaluating memory alternatives for data use, keep in mind the "2-N rule." Most, but not all, multibank DRAMs require you to switch between internal banks for highest performance. For example, for a two-bank DRAM, sequential accesses to addresses 0, 1, 2, and 3 by using internal interleaving and pipelining completes more quickly than a sequence of reads/writes to the same or consecutive even or odd locations. Because linear and interleaved burst sequences toggle the lowest order address or addresses, the 2-N rule is typically not a problem for code fetches. Some DRAMs with a true SRAM cache on board can operate under the "1-N rule" with no performance restrictions, regardless of address sequence. References
Acknowledgments I'd like to recognize the contributions of Terry Lee at Micron Technology, Bob Fusco at Oki Semiconductor, Billy Garrett at Rambus, and Kevin Patrick at Mosaid Technologies. A special thanks also goes to Steven Przybylski of The Verdande Group for writing a box and contributing reference information. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||