DDR3 SDRAM exposed: Inside a bleeding-edge, blazing-fast memory device
By Randy Torrance, Chipworks - November 3, 2008
For this second installment of The IC Insider we decided to really stretch our wings. We expect to normally devote this column to innovative blocks of analog circuitry from devices in the power-management, MEMS, RF, and consumer-SOC markets—things that can be explained well in a Web environment. However, today we are going to go right down to an advanced 68-nm, 1-Gbit memory device, and talk a little about how it achieves the blazing speeds required in today's applications.
For at least two years now, DRAM manufacturers have been touting the advantages of DDR3 SDRAMs over the previous generation of DDR2 memories. These benefits include a lower operating voltage of 1.5V versus 1.8V, a power-consumption reduction of up to 30%, and the cost advantages inherent in using more advanced technologies. The primary benefit mentioned, of course, is the higher data-transfer rates. While DDR2's maximum transfer rate is just 800 Mbps, DDR3 is specified up to 1600 Mbps.
For the last couple of years DDR3 has held out a lot of promise, but the realization of that potential has been slow in coming. The first DDR3 chips on the market used older 90-nm technologies and were not available in the higher speed grades. Hence DDR2 SDRAMs have maintained their market dominance. But this is now changing as DDR3 chips are showing up in more advanced technologies and higher speed grades. A good example is the latest part from the world's No. 1 DRAM supplier, Samsung's K4B1G0846D-HCF8. Chipworks has just completed an analysis of this chip.
The K4B1G0846D-HCF8 is a 1-Gbit, high-speed CMOS, third-generation DDR3 SDRAM, internally configured as 16 Mbits × 8 I/Os × 8 banks. The K4B1G0846D-HCF8 uses an 8n prefetch architecture to achieve high-speed operation. The 8n prefetch architecture is designed to transfer two data words per clock cycle at the I/O pins. A single read or write access to the device consists of a single 8n-bit wide, four-clock data transfer at the internal DRAM core and two corresponding n-bit wide, one-half-clock-cycle data transfers at the I/O pins. This version of the device achieves a data rate of 1066 Mbps on each I/O pin.
Samsung manufactures the K4B1G0846D-HCF8 in a four-metal, single-poly, 68-nm CMOS process, and mounts it in an 82-ball FBGA package. This is the most advanced DDR3 process and design we have seen.
What makes this device particularly interesting is the impressive organization of the layout. Not a micron of silicon is wasted in this tightly packed 68-nm design (Figure 1).
We performed SEM (scanning electron microscope) imaging of the spine of the device, a total of 13 mm2. The memory array in the periphery is highly repetitive, so only the edge along the spine required imaging. Even so, the task required 19,000 images of each of the four layers (76,000 total images) at the magnification required for device and interconnect extraction. Next, we stitched together and aligned these images, both horizontally and vertically, and compensated for any drift in the microscope using software. All this is a long way from when reverse engineering was possible by arranging, and taping together, large photographs on the lab floor.
As we know, the data-transfer speed is the greatest factor differentiating DDR3 from DDR2. The impressive 1066-Mbps transfer rate requires some very advanced circuit design. The data receiver and transmitter are both examples of these critical circuits. Possibly even more demanding are the circuits used to synchronize the incoming and internal clocks, the DLL (delay-locked loop), and the DCC (duty-cycle correction). Chipworks has extracted all of the circuits associated with the data I/O and the clock synchronization. The following sections talk about a few of the interesting blocks in these circuits.
Delay-locked loops first showed up on DRAMs in the early 1990s, when the first SDRAMs appeared. For the first time ever, DRAMs were using a clock, and needed to synchronize their data with the outside world. A full PLL (phase-locked loop) was not needed because no clock multiplication was required, just clock synchronization and phase adjustment. The idea was that DLLs would use mainly simple digital logic to capture the incoming clock and create all the required internal clocks. Hence, DLLs would be much simpler than PLLs. Our analysis of this chip shows that DLLs aren't necessarily simple anymore.
Samsung's DLL is an impressive design. When we completed the extraction and organization, we realized that we'd created 60 independent hierarchical schematics. Samsung must have put a huge amount of work into designing this large DLL system. It certainly does take up quite a bit of real estate. We found the expected fixed-delay lines, adjustable-delay lines, clock selectors, clock splitters, phase detectors, and duty-cycle-correction logic. But on top of this, we uncovered blocks for mimicking I/O buffer delays, conversions from CML to CMOS and back, initialization circuits, and lots of load-matching circuitry and control. Here we will take a closer look at a few of the interesting circuits.
The external clock applied to the pins CK/CKN is input to an adjustable delay line. The delay line consists of three sections: a four-stage fixed delay line, a 16-stage variable (tapped) delay line, and a fine delay mixer. The incoming external clock is first propagated through the fixed delay line (Figure 2), where it can be optionally inverted by means of the control signal INVERTCK.
At clock speeds up to 533 MHz, this fixed-delay line needs to be implemented in some logic family faster than standard CMOS, and Samsung has chosen to use CML (current-mode logic). As would be expected, the minimum channel lengths in this circuit are nowhere near the minimum allowed by this 68-nm technology. Rather, the differential pairs use a length of 220 nm, and the current-source transistor uses a length of 360 nm, allowing for much better matching.
Figure 3 shows the implementation of a portion of this circuit on the substrate level of the die. The substrate is stained and optically imaged to show the diffusions, which are not packed as tightly as in the other layers.
Optical imaging does not provide high enough magnification to see the other layers; SEM is required. Four of the diffusion resistors are highlighted by white rectangles. As you can see, these are continuous resistors stretching off the left edge of the screen.
However, on the poly layer, as shown in Figure 4, the contacts connecting to M1 are clearly visible.
Now it's obvious that these are metal programmable resistors. This approach allows easy resistor matching. In addition, this layout methodology allowed Samsung to modify these resistors' values late in the design cycle.
From this block, the clock passes to a multistage tapped delay line. One stage of this delay line is shown in Figure 5.
As you can see, this block also uses CML logic due to its speed constraints. Note the small capacitors, possibly used for fine delay control. A control-shift register activates and deactivates the stages and corresponding taps of this delay line. The selected number of consecutive stages and the two adjacent taps are active at the same time. Immediately after the DLL resets and initializes, four stages are active. If this delay is not sufficient, the control register activates more stages as required.
The design aggregates the clock signals from the two adjacent taps at the output of the fine delay mixer. The resulting clock signal then goes to a CML-to-CMOS converter, where it is combined with the duty-cycle-correction signals, amplified, and latched at CMOS levels.
The data transmitter has some difficult specifications to meet, not the least of which is the reliable transmission of data on both edges of a clock at 1066 Mbps per pin. One of the interesting features of this transmitter is the programmable impedance circuitry. We found a wide range of impedance and signal-edge programmability on the output driver. Figure 6 shows just one of many programmable-impedance circuits used in the transmitter.
As the figure shows, Samsung has used a binary-weighted, parallel-resistor-selection scheme. The output of this schematic is then combined with multiple other programmable-impedance and edge circuits for both the pull-up and the pull-down drivers. Taken together, these give an excellent range of impedance and edge control that could be used to match package and board impedance and optimize the output waveform. This may also allow Samsung to reuse this design for other speed variants—perhaps all the way up to 1600 Mbps some day.