|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AT - A - GLANCE |
|
It came as no surprise when Intel and most of the PC industry recently announced the progression from a 66- to a 100-MHz main-memory bus--an increase in throughput from 528 to 800 Mbytes/sec--to fulfill the needs of skyrocketing CPU frequencies. Traditional benchmarks demonstrate the benefits systems derive from using a faster system bus, but what happens inside the computer system? EDN set out to find out how a system behaves internally while running real-world applications. This information will help you realize the head room available for future applications. Attaching Hewlett-Packard's logic analyzers and setting up clever triggers, EDN's hands-on "surgical team" garnered information on the internal workings of a computer system.
During blizzard conditions, the team arrived at the "hospital"--Hewlett-Packard's facility in Colorado Springs, CO. Everyone was tense as the patients were wheeled into the operating room. Patient No. 1 was an Intel SE440BX motherboard; its heart, a 400-MHz Pentium II and an Intel 440BX chip set. Patient No. 2 was a Microstar MS5169 motherboard with a pulsating 300-MHz AMD K6-2 (previously the K6-3D) and Acer Labs Inc's (ALI's) Aladdin V chip set. Each system was loaded with 128 Mbytes of Hitachi's $250 PC-100 synchronous DRAM (SDRAM), giving the patients plenty of breathing capacity. They were about to undergo a week of intensive probing as the surgeons tapped into their main arteries--the main-memory bus, the Accelerated Graphics Port (AGP), and the PCI bus (Figure 1).
Starting with the Intel SE440BX motherboard, we used a variety of processors, including a $722, 400-MHz Pentium II (P2) with a 100-MHz front-side bus, a $305, 300-MHz P2 with a 66-MHz bus, and a $159, 300-MHz Celeron. (The 440BX chip set that this motherboard uses supports any Slot 1 CPU with a 66- or 100-MHz front-side bus.) The 400-MHz P2, Intel's first on 2V, 0.25-mm technology, employs new cache technology that allows it to run faster and with less power. Compared with the 0.35-mm P2, which consumes a maximum of 43W at 300 MHz, the new P2 consumes only 28W at 400 MHz! Although the 400-MHz P2's front-side bus runs at 100 MHz, a switch setting in BIOS also allowed us to run the processor at 300 MHz and maintain a 100-MHz bus speed.
For memory and peripherals, we used 128 Mbytes of PC-100 SDRAM; Real 3D's $129 StarFighter AGP graphics card with 4 Mbytes of synchronous graphics RAM (SGRAM); Adaptec's $423 AHA-3940UW PCI multichannel UltraSCSI host adapter; Quantum's $895, 9.1-Gbyte Viking II hard drive; Hitachi's GD-2000 digital-versatile-disk (DVD)- ROM drive; and PC Power and Cooling's 300 ATX power supply. We also added a reference board from Yamaha that contained the company's $9 YMF724 audio controller, along with Microsoft's USB mouse. The YMF724 has built-in OPL3, FM synthesis, a 64-voice XG wavetable, a digital mixer, and a PCI-bus interface. During our testing, we also used ATI's $279, 4-Mbyte All-in-Wonder Pro to replace the StarFighter card.
We used the same basic configuration for the Microstar MS5169 board, except here we installed AMD's 300- MHz K6-2 and an ATI XPERT@Play AGP graphics card. Microstar based the board on ALI's Aladdin V chip set, which supports both AGP and a 100-MHz main-memory bus. AMD calls this bus configuration "Super7." Although the bus configuration of the AMD K6-2 is still Socket 7-compatible, the processor should display significant performance improvements because the L2 cache speed is directly linked to the speed of the front-side bus. The K6-2 contains a 32-kbyte instruction cache and 32-kbyte data cache. These larger caches (compared with 16 kbytes for the K6's caches), in conjunction with the faster bus, should help extend the market life of Socket 7.
Unless otherwise indicated, we used the StarFighter graphics card with all Intel configurations.
For the operating system, we used Microsoft's Windows 98 (the RC2 version). This OS has some nice improvements over Win95 that make it more user-friendly to install. For one thing, you don't have to restart the system 20 times during installation. In addition, Win98 inherently supports AGP, UltraDMA, and USB. One of AGP's benefits is its ability to use graphics textures directly from main memory; in contrast, the PCI model must copy textures to the graphics controller's local memory before use. Unlike Win98, Win95 lacks direct support ofAGP texture memory, so you have to load the VGARTD.XD driver as a patch to the dynamic memory manager in Windows 95. With Win95, UltraDMA support also requires a special driver.
Before Win98, USB support was a nightmare (Reference 1). First, Win95 lacks support for USB, instead requiring you to use OSR2.1; then, USB works only with correct drivers. So, you can imagine my exhilaration when I installed Win98 and was able to automatically use the USB mouse.
Once we had the computer systems running, the next step was to hook up the measurement equipment, HP's 16700A Series logic-analysis system. For this portion of the procedure, HP's Craig Kirkpatrick, field applications engineer, and Joel Birch, technical marketer, were invaluable resources (see sidebar "The surgeons"). The $9900 16700A supports cross-domain debugging that allowed us to trigger from multiple computer buses. The analyzer, with its Windows 95-like graphical user interface (GUI), can simultaneously display multiple time-correlated views on screen. The analyzer frame holds as many as five analysis cards, and each card supports as many as 64 2-Mbit-sample-deep channels. For the AGP 2X portion of the data capture, we also used the new, high-speed HP 16557D logic-analysis card. This card supports 68 channels, 2-Mbit memory depth per channel, 135-MHz state analysis, and 500-MHz timing analysis. You can combine three of these cards, at $14,500 apiece, to form a 204-channel analyzer.
FuturePlus Systems provided the preprocessors to make the physical and electrical connection between our HP test system and the target PC platform. These preprocessors included the $1000 FS2220 AGP probe and interposer card, the $2000 FS2000 PCI-bus-analysis probes, and the $500 FS2320 SDRAM DIMM bus probe and extender card. The AGP probe provides accurate timing of signals as fast as 500 MHz. To use this probe, Kirkpatrick inserted it into the motherboard's AGP socket, installed the AGP card in the socket on the probe, and attached the HP logic-analyzer pods using two $345 HP E5346A high-density termination adapters. The mechanical design of the FS2220 is awkward: When we plugged it into the AGP socket on an ATX form-factor motherboard, it overhung two of the PCI slots. This overhang limited the number of PCI cards we could install during our testing.
The FS2000 acted as an extender for a PCI card and provided an electrical and mechanical interface to HP's 16700A for passive PCI-bus monitoring. The FS2000 comes with bus-disassembler software that decodes the PCI-bus signals and presents a readable display that lists the transaction type, address, data, and status conditions. Although the FS2000 doesn't give the quality of postprocessing that HP's E2925A does, the FS2000 proved more useful for our analysis. Using the FS2000, Kirkpatrick and Birch coordinated the PCI data capture from the logic-analyzer trigger. The E2925A is a PCI-specific logic analyzer that provides continuous counters that measure bus usage, efficiency, throughput, or retry rate.
The FS2320 DIMM analysis probe is both a 72-bit, 168-pin, error-correcting code, unbuffered SDRAM DIMM bus probe and an extender card. It provides accurate state capture for bus transactions as fast as 500 MHz. We installed the FS2320 into one of the motherboard's 168-pin DIMM slots, then installed one of the Hitachi HB52E168EN-B6 128-Mbyte DIMMs into the socket on top of the DIMM probe. This approach allowed us to measure the data throughput in and out of this DIMM only, which is why we used only one 128-Mbyte SDRAM DIMM in our test configuration.
Our goal in this project was to determine the sustained data throughput or usage on each of the PC buses while we ran through a series of real-world applications. This goal meant that we'd have to capture bus transactions for relatively long periods. For example, running the Tomb Raider II 3-D game's demonstration took more than a minute. With bus-speed frequencies of 100 MHz and the logic analyzer attempting to capture every transaction, the analyzer card's memory would have filled up in fractions of a second. Birch suggested a major improvement: Have the analyzer capture only those cycles that occurred when the bus was not idle. This bought back a large portion of the memory space but still didn't allow us to acquire the amount of data we needed. (For related information, also see sidebar "Design considerations for 100-MHz systems.")
We realized that to determine the sustained data throughput, we could set up the logic analyzer to increment one of its 32-bit counters every time the bus signals met a particular trigger condition. Although the counter could handle more than 4 billion trigger conditions, we used the analyzer's ability to periodically dump the counter value into the analyzer's memory. We used a 100-msec period because it provided fine enough granularity without generating an excessive amount of data. A variation on our methodology would have been to shorten the period; the shorter the time period, the closer the data in the counter would be to instantaneous data throughput. However, the analyzer card's memory would fill up sooner, minimizing the amount of time that you could run the application.
The logic-analyzer trigger conditions depended on the bus that we were monitoring. For example, on AGP, we wanted to see the amount of data throughput associated with the AGP 2X mode. Graphics controllers on AGP cards typically use this mode to transfer texture maps. According to the AGP specification (http://developer.intel.com/technology/agp), the two address/data bus-strobe signals (AD_STB0 and 1) qualify data on the AD bus and are relevant only for 2X transfers. The 16700A's GUI and FuturePlus' FS2220 software made it easy to interpret the AGP signal names and set up the trigger point. In logic-analyzer terminology, Kirkpatrick set up the trigger to store only AGP strobes without storing any state information (Figure 2). The resultant output was a listing that showed the number of AGP strobe state counts in 100-msec intervals. Each strobe count represents 8 bytes transferred on average. If you want to get more detailed AGP state information, remove the "no-state" command, and you end up with a listing that shows precisely how often and how long it takes to complete AGP 2X bursting transfers.
Triggering on the main-memory bus for DIMM accesses was a bit more complicated than triggering on the AGP. Here, we looked for overall bus usage; this search included all reads and writes. Will Morris, senior technical marketing engineer at Intel, concluded that the DIMM-access trigger would involve any of the SDRAM DIMM's four chip-select inputs (S0 to S3) and the column-address-strobe (CAS) and row-address-strobe (RAS) signals. The system asserts any of the chip selects low to gate input or output from the DIMM, and, when CAS is simultaneously low, this signal combination indicates that the processor (or some other bus agent) is performing a cache line read or write. In the logic-analyzer screen for this setup, the macro sequence shows that the analyzer stores no states until a trigger condition occurs but then only after waiting for a timer reset delay (Figure 3). Then, every 100 msec, the analyzer stores the count value in its memory and resets the count. This setup offers somewhat limited accuracy because it combines both reads and writes. In retrospect, we could have differentiated between DIMM reads and writes by also adding the write-enable signal (W) as a qualifier. However, during the applications we ran, reads dominated the DIMM accesses.
We configured the trigger for PCI in a similar manner. From the PCI specification, whenever the Frame and Initiator Ready (IRDY) signals are high, the bus is in an idle state. So, by defining the trigger as not equal to Frame AND IRDY, we could store the count of all nonidle states and multiply this number by four to obtain the bytes per state (the width of the PCI bus in bytes).
The types of real-world applications that we ran on our test systems divide into several categories, each stressing the systems in different ways. Three-dimensional games, such as F1 Racing Simulation from Ubisoft (www.ubisoft.com), Incoming from Rage (www.rage.co.uk), and Tomb Raider II from Eidos (www.eidos.com) make up one category with much texture action. Adobe's (www.adobe.com) Photoshop 4.0 and Premiere 4.2 and Intel Indeo, productivity tools, are CPU-intensive. DVD, a category unto itself, produced some interesting results as we ran through a variety of implementations.
The important parameters for 3-D games are real-time response and image quality, both of which can consume lots of CPU horsepower and system bandwidth. To study how many system resources the F1 Racing Simulation, Incoming, and Tomb Raider II 3-D games consume, we examined the main-memory (DIMM) bus and AGP throughput while varying the CPU and main-memory-bus frequencies.
We discovered that comparing the Intel and AMD processors was an apples-to-oranges comparison. The DIMM-bus throughput results from the AMD K6-2 were quite low (Figure 4). The reason for this low figure is that the ATI graphics controller, which we used for the Microstar system, uses its on-chip cache and local memory for textures; therefore, the amount of data throughput associated with textures is small (reflected on both AGP and DIMM). By applying a "texture fudge factor" to the K6-2 results, I determined that the Slot 1 and Socket 7 systems had comparable results. I created the fudge factor by adding the expected AGP throughput to the DIMM throughput of the K6-2 . Although this analysis indicates that the results of the data throughput are in the same ballpark, I couldn't make an accurate technical assessment on the Socket 7's bus behavior based on this information.
Aside from the AMD/graphics-card inconsistency, the data-throughput results were as expected. Tomb Raider II, with the visibly apparent best texturing of the three games, had a sustained AGP bandwidth of 30 to 125 Mbytes/sec, regardless of the CPU or DIMM-bus frequency. And, as expected, the DIMM bus almost linearly tracked AGP, at 30 to 130 Mbytes/sec with a few peaks that hit 170 Mbytes/sec. However, the DIMM-bus results from using Celeron were 38 to 160 Mbytes/sec with a few peaks higher than 175 Mbytes/sec. This result indicates that Celeron's lack of a Level 2 cache puts more demand on main memory, but the 66-MHz bus still has plenty of head room.
Rage's Incoming produced results that were similar to those for Tomb Raider II. Although the amount of AGP-texture traffic was only 30 to 60 Mbytes/sec, all Intel-processor configurations produced results within a few percentage points of each other. Celeron is the only exception, with 5 to 10% lower AGP throughput. The results on the DIMM bus demonstrated that Incoming's data demands do not depend on processor or bus speed, except when the L2 cache is missing. (There's nothing that a $10 L2 cache won't fix.) Incoming consumed 75 to 150 Mbytes of bandwidth with Celeron (with peaks hitting 200 Mbytes); the other processors consumed 40 to 75 Mbytes/sec with peaks hitting 175 Mbytes/sec.
Although Celeron's 66-MHz bus has a theoretical bandwidth limit of 528 Mbytes/sec, the 200-Mbyte/sec peaks that it hit while running Incoming were probably closing in on the bus's practical limit. (However, Intel claims that the 440BX can achieve a 60 to 70% sustained bandwidth.) Using the 440BX chip set helps to increase the practical bandwidth limits on the main-memory bus and on any of the Intel-processor configurations we used. For starters, Intel increased the 440BX's input and output buffering; although the company wouldn't quote a number, it claimed that the BX's buffer depth is twice that of the LX. This depth improves the latencies associated with arbitration and DRAM-page misses. Additionally, the BX can keep as many as 32 SDRAM pages open (eight per DIMM), compared with only two with the LX. This ability helps threaded applications' performance and helps when the CPU is sharing main memory with the graphics controller. However, don't expect to soon see many Celeron systems using the 440BX chip sets. Intel is positioning the 440EX chip set for use in Celeron systems; the 440EX has many of the same features as the previous-generation 440LX.
F1 Racing Simulation produced more interesting results (Figure 4). The most noticeable result is that Celeron produces 120- to 180-Mbyte/sec DIMM-bus throughput (the highest amount) but 25- to 70-Mbyte/sec AGP throughput (the lowest amount). Compare these results with the 300-MHz P2 with a 66-MHz bus, which generated 70 to 110 Mbytes/sec on DIMM and 40 to 75 Mbytes/sec on AGP. These results indicate that the F1 game depends heavily on the L2 cache. Celeron's lack of L2 cache forces the CPU to compete with the graphics controller for all AGP accesses. If you compare the difference between the P2s with 66- and 100-MHz front-side buses, you can see that the increased bus speed minimally helps increase the AGP throughput. A more significant difference in AGP throughput comes from increasing the CPU's frequency to 400 MHz, thus giving the impression that the F1 game benefits from increased floating-point performance.
With the 400-MHz P2 installed in the motherboard, we surgically removed the StarFighter card and replaced it with ATI's All-in-Wonder card. We wanted to see if the graphics controller and driver made a difference on system resource usage. Sure enough, the difference was huge, even though both cards used the AGP2X mode to transfer texture data. As indicated, the system with the StarFighter card resulted in 30- to 130-Mbyte/sec DIMM-bus throughput when running Tomb Raider II. Running the same game sequence with the ATI card produced only 10- to 20-Mbyte/sec sustained DIMM-bus throughput. Likewise, running Incoming with the StarFighter card yielded 40- to 75-Mbyte/sec DIMM-bus throughput. But the ATI card yielded 15- to 20-Mbyte/sec throughput and only 8 to 12 Mbytes/sec on AGP.
Several hypotheses can explain this behavior. Much of the texture-bandwidth discrepancy results from the difference in the two cards' frame-rate performance. With the StarFighter card, the Incoming demo hit a frame rate of around 38 frames/sec; with the ATI card, the frame rate was approximately 19 frames/sec. (We believe that part of the reason for ATI's low frame rate related to the driver quality, not hardware.) Lower frame rate translates to lower texture demand.
Another hypothesis for this bandwidth throughput is that the StarFighter card, in conjunction with Real 3D's driver, manipulates all texture maps in AGP execute mode. The textures come straight across AGP directly into the graphics controller's pipeline. Therefore, if the pixel rates were high during an application, then a proportional amount of bandwidth would be texture-related. On the other hand, ATI's RagePro controller takes advantage of its 4-kbyte on-chip cache and local memory to store textures. The company also claims its driver uses a proprietary technique that stores textures in main memory in a way that increases the locality of reference and improves the chance of subsequent access hits. Based on the comparison of these two texture-transfer techniques, you can draw several conclusions:
Surprisingly, Photoshop, Premiere, and Indeo, all CPU-intensive applications, used little sustained CPU bandwidth. (Because we were using an AGP graphics card, we didn't look at AGP traffic because it doesn't play an active role in 2-D applications.) During the testing of Photoshop, we ran a filter on a several-hundred-megabyte tagged image-format file. Although the filter operation took almost three minutes to complete, indicating that this operation was CPU-intensive, the 400-MHz P2 consumed 8.5 Mbytes/sec of DIMM-bus bandwidth, and both 300-MHz (66- and 100-MHz bus) P2s consumed 5.5 Mbytes/sec. Although Celeron used as much as three times more bandwidth than these processors, its DIMM-bus throughput was only 15 to 24 Mbytes/sec.
Similarly, when we used Adobe's Premiere to render a few second-long video clips, the operation took more than three minutes but consumed only 2 to 31 Mbytes/sec of DIMM-bus bandwidth (except for a few peaks). Again, Celeron consumed two to three times more DIMM-bus bandwidth, but this amount was still only 5 to 60 Mbytes/sec. Playing back an action-packed movie-clip (AVI) file with Intel's Indeo also consumed only 12 to 25 Mbytes/sec of bandwidth; Celeron ranged from 40 to 64 Mbytes/sec. Our results from these three applications indicate that the L1 cache plays a huge role in minimizing the sustained DIMM-bus bandwidth. Because we limited our testing to one file type and one filter type, I think that our Photoshop results are inconclusive. However, a faster processor is obviously advantageous for these CPU-intensive applications.
Our last set of tests revolved around the playback of DVDs. The primary functions of DVD playback are the decompression of the digital video and audio stored in compressed form on a DVD-ROM disk. Playback also includes the interpretation of navigation data, embedded with the video and audio. The navigation data allows users to perform random accesses, searches, fast forwards, and other interface functions.
As the DVD player reads the compressed data from the DVD-ROM, it deposits the compressed data in system memory. Theoretically, this action would consume a peak of 11 Mbps on the PCI bus. The DVD system then passes most of this compressed data to the decoder. In a PCI-based hardware-DVD implementation, this data would then travel back across the PCI bus to the decoder, accounting for another 11-Mbps maximum bandwidth consumption. If you use a software DVD implementation, you omit this step. However, this step is when the CPU starts processing the compressed data. Theoretically, this step could consume as much as 27 Mbytes/sec of DIMM-bus bandwidth.
When the data arrives at the hardware decoder, it first passes through a content-scramble-system (CSS) decryption algorithm. From there, the data passes through a parser, which separates the video and audio streams. While the video stream runs through a decoder that performs an MPEG-2 and rendering algorithm, the DVD system performs Dolby digital, linear-pulse-code modulation, or MPEG decoding on the audio stream. Rendering puts an extra burden on the DVD system because most DVD titles originate from motion-picture studios for presentation in dark theaters or on TV; this situation means that the video can appear too dark on a PC. The decoder should recognize the difference and perform color and gamma correction to adjust color and brightness. The audio also presents a challenge because it can be multichannel and multilingual. The navigator must determine and playback the user-selected language, a continuous decoding process.
But the most demanding aspect of DVD playback is that the system must perform all the computations in real time. The DVD player must also supply the decoder with compressed data at minimal latency. On the surface, the compressed-data bit rate, 1 to 11 Mbps, may seem trivial. But the computational demands of DVD playback are enormous. Furthermore, the PC98 Design Guide and the Windows Hardware Quality Lab (WHQL)--see www. microsoft.com--require the CPU to run background tasks without interrupting DVD playback.
In this project, we investigated three categories of DVD playback. These categories included a pure hardware approach using Creative Labs' $281 PC-DVD Encore, which integrates C-Cube's $20 ZiVA-PC DVD-decoder chip. We also used a hardware-video/software-audio-decoding product from ATI, part of the company's $379 All-in-Wonder Pro DVD board. (This is a full-featured product.) ATI also supplied us with its pure-software DVD player. Both ATI DVD implementations use the motion-compensation circuitry contained in the RagePro Chip.
We measured the CPU usage and DIMM- and PCI-bus bandwidth consumption while playing an action-packed scene from the movie Air Force One. Although our testing didn't provide conclusive evidence on the system impact of DVD, it allowed us to make a relative comparison among the technologies.
We first compared ATI's hardware and software implementations using each of the Intel CPU configurations. Regardless of the CPU, the data throughput on the PCI bus had minimal variance, as we expected. What we didn't expect was the amount of PCI throughput; it ranged from 10 to 60 Mbytes/sec with an average of around 36 Mbytes/sec. Assume a maximum data throughput of 11 Mbps as the player reads the compressed data from the DVD drive. Add to that the 384 kbytes/sec of audio data that the DVD player sends over the PCI bus to the audio card (48 kHz316-bit samples3two channels=384 kbytes/ sec). PCI-bus traffic resulting from the DVD playback should be less than 1.5 Mbytes/sec. Obviously, DVD playback was not the only activity occurring in the system, but Intel, ATI, and C-Cube could not rationalize the "extra" activity (beyond the usual system-level overhead). The only suggestion I have is to run the test again and use each of the PCI master's Request/Grant signals as qualifiers to isolate each of the data sources. However, we didn't have time to do this testing.
ATI's software DVD player uses the host CPU to perform all the CSS decryption, bit-stream parsing, AC-3 decoding, and video decoding. This player uses AGP memory to store the video stream. After the CPU processes the video, ATI's software programs the RagePro chip to use the AGP 2X mode to transfer the video from AGP memory to graphics memory. At 24 frames/ sec, this transfer would consume a maximum of 16.6 Mbytes/sec over the front-side bus. During the operation of ATI's software DVD player, Celeron consumed 0 to 118 Mbytes/sec of DIMM-bus bandwidth (an average of around 53 Mbytes/sec); the remainder of the Intel processors consumed 0 to 90 Mbytes/sec (an average of around 47 Mbytes/sec). This information indicates that an L2 cache provides a slight advantage for this implementation of DVD playback.
ATI's hardware DVD player consumed considerably less DIMM traffic than the software implementation, but the gap widened between Celeron and the other Intel processors. Celeron's DIMM-bus bandwidth ranged from 15 to 88 Mbytes/sec and averaged around 36 Mbytes/sec; the remaining Intel processors used 5 to 40 Mbytes/sec, averaging 14.5 Mbytes/sec. These results indicate that DIMM-bus bandwidth is minimal, even though this DVD-player implementation is still using the host CPU to perform CSS decryption, stream parsing, and AC-3 audio decoding (through Microsoft's DirectSound).
The Creative Labs Encore DVD im-plementation produced the highest quality DVD playback and also had the least impact on system resources; CPU usage was less than 5%. This percentage is exactly what you'd expect from an almost-all-hardware DVD player; Encore uses the host to run the DVD navigator. The PCI-bus throughput, although 20% lower than the other DVD implementations, still averaged more than 28 Mbytes/sec. During the operation of this player, the DVD drive transfers 11 Mbps of compressed data to main memory. Then, the player transfers most of this data back over PCI to the decoder. Encore uses a video loop-back cable that attaches externally to the video card, so the player generates no additional bus traffic from decompressed and decoded video. However, the audio stream, although minimal, must still go back over PCI to the audio codec.
In this phase of our testing, we used Anchor Chips' AN3041Q Co-Mem board to provide measured bandwidth consumption on the PCI bus and performed reads into main memory. Mike Davis of Anchor Chips wrote a program that used the Co-Mem's instruction-cache-fill mechanism to read 1024-byte blocks from host memory in each PCI transaction.
The Co-Mem configuration achieved a maximum DIMM-to-Co-Mem bandwidth of 105 Mbytes/sec. However, to play a DVD movie without quality degradation, we had to lower the Co-Mem's bandwidth consumption to at least a maximum of 25 Mbytes/sec. Even with only a 25-Mbyte/sec load on the PCI bus, the software DVD player dropped many frames. This drop was apparently caused by making the CPU wait. The isochronous nature of the DVD data stream requires the DVD decoder to resynchronize when it falls behind, which wastes additional CPU cycles and further degrades the software DVD quality.
This hands-on project revealed some interesting information about the behavior of high-end PC applications. The bottom line is that if you can afford the best PC, buy a 400-MHz or faster system with a 100-MHz front-side bus. Today, buying this system is like owning a Ferrari that you can't get out of first gear, but the applications are coming soon that will take advantage of this level of performance. Trust me, I've seen them.
Thanks to Hewlett-Packard and Intel for their tremendous support. Thanks to Dan Francisco for handling the politics. Special thanks to Will Morris for his integrity, perseverance, and friendship during this laborious project.
Designing a platform that supports 100-MHz bus frequencies presents many challenges for system-design engineers. The 50% increase in bus frequency translates into shorter cycles, faster signal edge rates, and higher power consumption. To produce a reliable design, you must carefully analyze system ac timing, board layout, signal integrity, electromagnetic interference (EMI), and system thermodynamics. Furthermore, to avoid additional system cost for 100-MHz platforms, you must meet all design specifications using high-volume manufacturing materials and techniques; that is, standard pc-board materials, requiring no additional glue logic or signal termination. The synchronous-DRAM (SDRAM) interface presents a particularly difficult challenge, because of the wide variation in the number of possible configurations--ranging from one single-sided DIMM to four double-sided DIMMs.
Meeting the system ac timings is the first design challenge for 100-MHz bus frequencies. With a 10-nsec clock period, you must carefully distribute the timing budgets to the processor, chip set, SDRAM, clock component, and motherboard. Proper timing allocation provides the most layout flexibility. The primary components in the analysis are clock-to-output time (Tco) for the driver (processor, chip set, or SDRAM), setup (Tsu) or hold time (Thold) at the receiver (processor, chip set, or SDRAM), flight time (motherboard transmission-line delay), clock skew, and clock jitter. Because of fast edge rates, the timing equation should also include a component to compensate for simultaneous switching effects (Tsso). Table A provides a sample timing calculation for a 100-MHz SDRAM interface.
The timing calculation shows a positive margin of 0.70 nsec for a three-DIMM-board design when the SDRAM component is driving data to the memory controller (in this case, Intel's 440BX). This result assumes worst-case voltage, temperature, and component timings. The Tco parameter for the SDRAM assumes a 0-pF load. Intel extrapolated this number from the PC-100 SDRAM specification for Tac (output valid from clock), which is 6 nsec into a 50-pF load. The TSKEW number is a sum of the clock-component (±250 psec), board (±100 psec), and DIMM (±280 psec) skews.
Although the component vendors set most of the timing parameters, the system designer must pay particular attention to parameters that can affect these component timings, especially voltage and temperature. If the system voltage or temperature exceeds the component specifications, the ac timings are no longer guaranteed. Table A's SDRAM timing example assumes the parameters in Table B.
You can control the timing budget of the flight time with the motherboard layout. In addition to the trace length, you must also consider the trace width and thickness, as well as the board's dielectric material. These factors determine the final propagation velocity and characteristic impedance for the signals. The physical routing of the traces is also important for both flight time and signal integrity. Poor signal integrity can also affect the flight time, because signal reflections and ringing increase the signal's settling time. In all cases, Intel recommends analog simulation.
EMI is another area for concern with increased system-bus frequencies. For EMI analysis, the signal edge rates (and not just the frequency) are of primary importance. These fast edge rates result in more radiated power in the higher frequency harmonics. You should follow standard EMI-reduction practices for signal routing for fast-edge-rate signals and high-frequency system clocks. Designers often overlook the ground-return path for these high-edge-rate signals. Providing a good return path helps to minimize the loop inductance and reduce radiated emissions. Good system shielding with isolated shield grounding also helps to control radiated emissions.
Clock-synthesizer vendors use spread-spectrum techniques to help reduce radiated emissions. Although these components help to reduce radiated power levels, the clock modulation that spread spectrum uses can result in increased susceptibility to cycle jitter and can also increase clock skew in other PLL-generated system clocks. Clock vendors currently offer several options for the maximum frequency modulation (specified as a percentage of the center frequency) but do not recommend using a modulation greater than 0.5%. Intel's SE440BX motherboard, for example, uses 0.5% modulation. You should also use simulation or lab measurements to verify system timings.
Higher frequency also translates to higher power consumption. (Power}V2FC, where V=voltage, F=frequency, and C=capacitive load.) This higher power consumption translates to higher case and junction temperatures. Intel recommends a complete system thermal analysis to verify the case temperature for the mP, memory controller, and SDRAM. You may have to add airflow or a heat sink to meet the case-temperature specifications.
As the industry moves to 100-MHz system and memory-bus operating frequencies, design engineers face new challenges in producing low cost, reliable system designs. It's becoming more difficult to meet ac-timing, signal-integrity, EMI, and thermal requirements. You should use thorough system analysis and simulation to help produce a solid design. Chip-set vendors provide design guides with detailed topology and layout information to aid in motherboard routing. You can find Intel's 440BX design guide on the Web at http://developer.com/design/chipsets. And, although computer OEMs are introducing 100-MHz systems, they are planning the next generation of systems with even higher operating frequencies. Because of the limitations of SDRAM, Direct Rambus (www.rambus.com) memory will be the technology of choice to meet future memory-bandwidth requirements and to minimize implementation costs.
Joel Birch does technical marketing in Boeblingen, Germany, for HP's E2920 family of computer verification tools. He recently moved from Silicon Valley, where he was a field applications engineer for 10 years with HP's digital-design and test-and-measurement products. He specialized in real-time measurements for debugging and performance optimization of Motorola 68K and PowerPC embedded software and hardware systems.
Mike Davis is a senior applications engineer at Anchor Chips, specializing in PCI. He has 25 years of experience designing mPs into applications, including TV character generators, networked laser printers, OCR (optical character recognition), document image management, and video compression. Davis holds a BSEE from Lehigh University (Bethlehem, PA).
Tim Harvey, an application engineering manager at Anchor Chips, directs a team specializing in PCI, USB, and embedded-system applications. He has extensive design experience in PC-, satellite-, TV-, and communication-product development. As principal system engineer at TWAV Inc, he designed and implemented a fully integrated, PC-based, telephone-video- transmission system. Harvey holds a BSEE from the University of Illinois--Urbana/Champaign.
Craig Kirkpatrick (craig_kirkpatrick@hp.com) is a 14-year HP employee. He's currently a field applications engineer in Portland, OR, providing training and consulting on HP test-and-measurement equipment that supports the development and verification of computer systems.
Will Morris, with Intel since 1984, is a senior technical marketing engineer in the OEM Platform Solutions Division (OPSD). One of his roles is helping to define OPSD's flexible desktop motherboards, such as the PD440FX, AL440LX, and SE440BX used in this project.
Chuck Small, a 31-year employee of HP, was part of the team that designed HP's first logic analyzer. Currently, he is product manager for Intel-processor support.
Steve T Zaske has been a technical software marketing engineer with Intel's Open Platform System Division for two years. He supports the Intel Performance Evaluation and Analysis Kit (IPEAK). IPEAK is a family of evaluation tools that test AGP, ACPI, hard-drive, and 3-D-graphics performance. Before working at Intel, Zaske spent four years at Semantech (www.semantech.com) as a software-development engineer.
Table ASample timing calculation for 100-MHz interface |
|
| DRAM driving parameters | Three unbuffered DIMMs (nsec) |
| Tco. DRAM.max | 5.2 |
| Tflight.max | 2.15 |
| Tsso.brd.max | 0.4 |
| Tskew | 0.63 |
| Tjitter | 0.45 |
| Tsu.BX | 0.47 |
| Total | 9.2 |
| Period | 10 |
| Margin | 0.7 |
Table BBoard-design parameters |
|
| Parameter | Value |
| Voltage | 3.3V55% |
| Temperature | 0 to 1058C |
| Trace width | 6 mils |
| Trace spacing | 6 to 10 mils |
| Maximum trace length | 4 in. to center of T |
| Board dielectric | 4.5 (FR4) |
| Characteristic impedance (Z0) | 65V515% |
| Propagation velocity (S0) | 1.6 to 2 nsec/ft |
| Routing topology | Balanced T |
| For more information: | ||
| When you contact any of the following manufacturers directly, please let them know you read about their products on EDN's web site. | ||
| Acer Labs Inc San Jose, CA 1-408-467-7456 www.acerlabs.com |
Adaptec Inc Milpitas, CA 1-800-959-7274 www.adaptec.com |
AMD Austin, TX 1-800-222-9323 www.amd.com |
| Anchor Chips San Diego, CA 1-619-676-6815 www.anchorchips.com |
ATI Technologies Inc Thornhill, ON, Canada 1-905-882-2600 www.atitech.com |
C-Cube Microsystems Milpitas, CA 1-408-944-6300 |
| Creative Labs www.soundblaster.com | FuturePlus Systems Corp Bedford, NH 1-603-471-2734 www.futureplus.com |
Hewlett-Packard Colorado Springs, CO 1-800-452-4844 www.hp.com |
| Hitachi America Ltd Brisbane, CA 1-415-589-8300 www.hitachi.com |
Intel Literature Center Mount Prospect, IL 1-800-548-4725 www.intel.com |
Microsoft Corp Redmond, WA 1-206-882-8080 www.microsoft.com |
| Microstar Computer Corp Fremont, CA 1-510-623-8585 www.msi.com.tw |
Quantum Corp Milpitas, CA 1-408-291-2492 www.quantum.com |
PC Power and
Cooling Carlsbad, CA 1-760-931-5700 www.pcpowercooling.com |
| Real 3D Orlando, FL 1-800-393-7730 www.real3d.com |
Yamaha LSI San Jose, CA 1-408-437-3133 www.yamaha.com |
|
You can reach Technical Editor Markus Levy at 1-916-939-1642, fax 1-916-939-1650, markus.levy@worldnet.att.net.
| EDN Access | Feedback | Table of Contents |
Copyright © 1998 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Business Information, a unit of Reed Elsevier Inc.