|
|||||||||||||||||||||||||||||||||||||||||||||||||||
December 4, 1997 EDN Hands-on project: Markus Levy, Technical Editor Theoretically, the new PC architectures and buses provide a system with lots of bandwidth to transfer data and feed the CPU. To explore the reality behind the theory, a group of PC experts and I set out to test these architectures. Find out what we discovered about Slot 1, AGP, PCI, USB, and more.
The goal of this project is neither to pit one company's CPU, graphics card, or disk drive against another's, nor to provide complete benchmark information about the products. The primary goal is to analyze some of the PC architectural choices and provide an understanding of AGP vs PCI graphics and Socket 7 vs Slot 1 bus structures. I also got a feel for how some PCI and USB peripherals affect system performance. To begin this project, I collected PC components from many companies (see box "For more information..."). I then assembled a huge group of PC experts (see box "Men at Work"). Initial roadblocks Although I started planning for this project almost a year ago, the real work didn't begin until August. I flew to San Diego, home of Anchor Chips, where President and CEO Ron Sartore and his crew worked with me for a week as we assembled our state-of-the-art computer systems. Initially, we had an Intel system based on Intel's new 440LX chip set and an Acer Labs reference design based on Acer's Aladdin 4+ chip set. For the operating system, we used Microsoft's OSR2.1, code-named "Detroit"; this OS is essentially an enhanced version of Windows 95 and supports AGP and USB.
Software installation To begin the software installation, we booted DOS from a floppy disk and then created four equally sized partitions on our Western Digital Caviar 5.1-Gbyte hard drive. We used one of the partitions to store all the data on the CD-ROMs that we used throughout the project. This approach proved quicker than using the CD-ROM drive for installing applications, drivers, and operating systems. From the DOS prompt, we performed a basic OSR2 installation on the C: partition; by "basic," I mean we used a standard VGA driver and no DMA support for the hard disk, for example. Next, we used the DOS FDISK command to switch primary partitions. You can also use PowerQuest's Partition Magic product to create, switch, and copy partitions. After booting from DOS again, we installed OSR2 on the new primary partition. From here, we loaded Windows and used Explorer to copy all the files from the original C: partition to a subdirectory on the storage partition. We set the options in Explorer to allow us to view hidden files, because we had to copy those to that subdirectory. The benefit of all this work was that we could erase our C: partition and reinstall OSR2 simply by copying all the files in the subdirectory to the erased partition. This approach let us avoid reinstalling the operating system whenever we ran a benchmark and wanted to clean up the partition. We had to repeat these steps and install OSR2 after we moved the drive to another PC platform. The next step was to install the USB supplement, which turns OSR2 into OSR2.1 and enables USB and AGP support. I obtained this supplement from Intel. I also went to Microsoft's Web site and downloaded its DirectX 5 driver, which is necessary to run some 3-D applic ations and benchmarks. I ran most of the benchmarks in this project with the DMA mode switched on for the hard drive. You can switch to this mode by going into system properties in the control panel and clicking on device manager, then disk drives, and then DMA. Note that Windows 95 lacks this capability. Analyzing USB performance
The first device we checked out was a loop-back test tool from Intel. This tool, a 930Hx bus-powered hub with an embedded function, allows you to perform loop-back testing on bulk endpoints (addressable sources or sinks of data). We used it to ensure that the USB ports on the Intel and Acer boards were functioning properly and to generate USB-bandwidth consumption. The tool has three bulk-endpoint pairs. The endpoint address includes an endpoint number and a direction. One pair has a packet size of 64 bytes, the maximum packet size for a bulk endpoint. The packet size of an endpoint determines how many bytes of data that endpoint can transmit or receive in a USB transfer. The other two pairs of bulk endpoints have 8-byte packets. Unfortunately, we discovered that the tool is not an ideal bandwidth consumer, because its firmware set up the bulk endpoints in in/out pairs. When the host sends data to the device via an out endpoint, the 930Hx copies the data from the receiving FIFO buffer to RAM and then from RAM to the corresponding in endpoint's transmitting FIFO buffer. The loop-back application running on the host would send pattern data to the out endpoint and then immediately read and verify the same data from the in endpoint. Transferring data from the 930Hx's RAM to the transmitting FIFO buffer is time-consuming; therefore, the 930Hx could provide no data at the host-requested rate. When the host re-quests data from a USB de-vice having no data available, the device sends a not-acknowledge (NAK) packet to the host. The host then reschedules the re-quest, sometimes immediately. The host request, followed by a device NAK, continues until the device has data available, and the device then transmits the data. The Intel loop-back tool demonstrated a lot of device-NAK activity. The 930Hx operates in double-buffer mode, so it immediately services the first two bulk ins and outs, but the device generates several NAK signals on the third transfer before sending the data. Anchor Chips' software engineer Mark McCoy speculated that this delay occurred because the device loops back the data that it sends out. Looping back requires the device to read the entire in FIFO buffer and write all those bytes to the out FIFO buffer. McCoy also discovered another limitation of the tool: It supports a maximum transfer size of only 512 bytes, which translates to eight data packets for a 64-byte maximum-packet-size endpoint. (Windows allows USB transfers as large as 64 kbytes.) Once the 512-byte transfer completes, the host application and device driver must send a new transfer request to the driver. The USB analyzer indicates that this transfer request has a turnaround time of 3 to 4 msec, or three or four USB frames. To simulate a typical output device that creates burst bulk outputs, the group used Anchor Chips' AN2131Q EZ-USB device. This device contains a USB transceiver, a serial-interface engine, an enhanced 8051 core, endpoint buffer memories, and 8 kbytes of code and data RAM. The EZ-USB device uses the USB in a novel way for downloading the device's operating firmware. (Check out photos and a detailed description of the EZ-USB's "renumeration" on EDN's Web site.) The EZ-USB performed bulk-out transfers in 512-byte chunks as fast as the host could submit them. This capability resulted in a sustained transfer rate of about 164 kbytes/sec. Each 512-byte transfer could easily complete in the time of one USB frame. The three- or four-USB-frame delay caused the 164-kbyte transfer rates vs the theoretical 1-Mbyte transfer rates that would happen without those frame delays.
USB bulk throughput isn't limited to 210 kbytes/sec, though. For a real application, you could get better bulk throughput by performing larger bulk transfers or even performing transfers of any size asynchronously. For asynchronous transfers, the host queues multiple transfers, so that one transfer is ready when another one completes. This approach avoids the three- or four-frame latency.
We also worked with the Altec Lan-sing USB-connected speakers. Al-though we got them to play audio, any system activity, such as moving the mouse, caused audio breakup. We later determined that the speakers were not to blame. Apparently, the software sound emulation in Detroit is inadequate; the speakers worked well using Windows '98, code-named "Memphis." Is AGP really better than PCI? There's no doubt that AGP is better than PCI for 3-D graphics applications with large textures, but the proof may not be so obvious--yet. The AGP is a dedicated graphics bus based on Revision 2.1 of the 66-MHz PCI specification. AGP provides the graphics chip with direct access to textures stored in system main memory. AGP yields 528-Mbyte/sec peak bandwidth by transferring data on both the rising and falling edges of the 66-MHz clock. A full-blown AGP implementation incorporates sideband signals, which enable the graphics chip to pipeline and queue memory requests and allow the graphics chip to issue new addresses and requests while transferring data from previous requests. The AGP specification requires no graphics chip to implement these sideband signals. Alternatively, the graphics chip can imple- ment double-rate and 66-MHz, or "frame-mode," PCI. So far, ATI Technologies' 3D RagePro graphics chip is the only sideband-enabled AGP device. The other available AGP cards implement double-rate or frame-mode PCI. This situation, along with an absence of high-textured 3-D applications and benchmarks, makes it difficult to comprehend the practical value of AGP. In an effort to probe the differences between AGP- and PCI-based graphics, the group devised several experiments. In the first and simplest experiment, we used Intel's AL440LX and Ziff-Davis' (Medford, MA) 3-D WinBench '97 with large-texture scene (which I call "3DWB-LT") to indicate AGP's capabilities. (You can download this free benchmark from the Ziff-Davis Web site at www.zdbop.com.) We ran the 3DWB-LT benchmark with graphics cards from ATI, Nvidia, Number Nine, and Intergraph. Each vendor, except Intergraph, supplied us an AGP and a PCI card that used the same graphics chip. Intergraph provided only a PCI card. Intel's AL440LX allowed us to run the Pentium II at 200, 233, 266, and 300 MHz. For most of the graphics cards, we ran the 3DWB-LT at each frequency so that we could determine the benchmark's dependency on the host processor. We checked the theoretical maximum benchmark performance using a null software driver that drops the graphics rendering of the benchmark into the bit bucket. In other words, this driver allows only the host CPU to perform the geometry calculations. Intel's AL440LX with a 300-MHz processor achieved a whopping 44.7 frames/sec on the 3DWB-LT using the null driver. Nvidia's Riva 128 AGP card, outperforming all the other cards, hit 37.5 frames/sec on this benchmark--indicating that the host CPU's floating-point engine was not the bottleneck. Despite this measurement, we discovered that performance for most of the graphics cards scaled linearly with the CPU's frequency. I speculate that this linear scaling reflects the time that the host CPU spends executing the software driver. You can find results of the test here. In summary, if you plot this data and perform a linear interpolation down to 0 MHz, you'll find that roughly two-thirds of the benchmark's performance relates to the CPU. We also tested the RagePro AGP card from ATI. At 300 MHz, with 2 and 4 Mbytes of onboard graphics memory, the card yielded 21.8 and 23.4 frames/sec, respectively. These figures differ roughly only 7% from each other, indicating that ATI's implementation was less susceptible to memory-size differences and depended more on the AGP interface. Using ATI's PCI card, the benchmark results for 2 and 4 Mbytes yielded 0.37 and 1.96 frames/sec, respectively. After comparing the AGP and PCI results, our immediate impression was that AGP must be fantastic: Who would ever want to use a PCI graphics card again? But when we repeated the benchmark for Nvidia's AGP and PCI cards, both with 4 Mbytes of local memory, we got results of 37.5 and 34.5, respectively. Additional local memory may further increase the graphics performance, because it would allow more of the textures to be stored locally. However, more important, a well-designed graphics engine yields good performance, whether it's on AGP or PCI. Nvidia's Riva chip uses a deeply pipelined architecture and a 12-kbyte texture cache that helps offload bus traffic, and the core runs at 100 MHz. It also supports good DMA capability to run so well on AGP or PCI. After running this benchmark with all the graphics cards, we discovered that the test doesn't really stress the advantages of AGP; the test uses only 5.1 Mbytes of textures. ATI asserts that when using a 4-Mbyte card, only 1.76 Mbytes of textures reside in system memory, and, with a frame rate of 34 frames/sec, the graphics controller consumes 60 Mbytes/sec of bus bandwidth. This figure equals 45% of PCI bandwidth, 23% of PCI-66 bandwidth, and only 11% of AGP 23 bandwidth. A better AGP benchmark, although currently unavailable, would be one that changed textures every frame. This benchmark would thrash a graphics chip's cache and depend much more on using AGP's direct-execution mode. Moving graphics to AGP undoubtedly frees some PCI bandwidth. There's also no doubt that some PCI utilization-intensive applications benefit from the extra bandwidth. For example, what if you were playing a PC game with heavy-duty 3-D graphics or working on a spreadsheet while transferring a file via Ethernet, while teleconferencing, or while playing a DVD movie (all made even more practical with split-screen technology). Where do you get the "extra" bus bandwidth?
As a practical test, I measured the PCI bandwidth consumption of a variety of standard PC products. These products include Sigma Designs RealMagic Hollywood DVD card, Digital Semiconductors 21143 Ethernet controller, Western Digitals Caviar UltraDMA hard drive, iomegas portable jaz scsi drive and PCI ultra scsi card, and Altec Lansings ACS500 USB interface speakers. We repeated a similar experiment with the Business Applications Performance Corp's (BAPCO, Santa Clara, CA, www.bapco.com) business-application benchmark. Running the benchmark suite with AGP- and PCI-based graphics, we noticed no appreciable performance difference as we varied the PCI-bandwidth utilization from 0 to 50%. At 50% utilization, performance between the systems with AGP and PCI varied 1 to 12%, depending on the application the benchmark was running. However, with 90% utilization, performance varied 9 to 34%, depending on the application. Unfortunately, because of time restrictions, we chose no PCI-utilization points between 50 and 90%. This selection would have allowed us to determine the inflection point at which PCI utilization starts to affect the benchmark results by impeding the PCI-based graphics performance. Socket 7 vs Slot 1 is one of the biggest debates in the PC industry. Socket 7 is an open specification; Slot 1 is Intel-proprietary. In Socket 7 systems, the L2 cache and main memory share a bus; Slot 1 systems have a dual-bus architecture. In Socket 7 systems, the CPU accesses the L2 at main-memory speeds; in Slot 1 systems, the CPU accesses the L2 at one-half the core frequency. Socket 7 proponents AMD and Cyrix assert that Socket 7 still has several years of life. Yet, both companies are developing new bus structures. Intel, on the other hand, has stopped all Socket 7-marketing efforts. Furthermore, the company has gone beyond even pushing Slot 1 designs and has started promoting the proprietary Slot 2 architecture. A political and theoretical discussion regarding Socket 7, Slot 1, and other main-memory and L2-bus implementations is beyond the scope of this article. However, our group of PC experts used a test that demonstrated some obvious benefits of Slot 1. Again using HP's E2920 PCI exerciser, Thomas Dippon, an application specialist for PCI-test tools at HP, set up a script to instruct the PCI exerciser to perform burst reads into main memory using the memory-read-multiple PCI command. (You can view this script, as well as instructions on using the E2920 here.) For this part of the testing, we obtained an Intel AN430TX PC platform, a Socket 7 implementation that works with processors, such as Intel's 233-MHz Pentium with multimedia extensions. To compare apples with apples, we decreased the Pentium II's speed from 300 to 233 MHz on Intel's AL440LX. Both PC platforms come with 512 kbytes of L2 cache, and both have a 33-MHz PCI bus and a 66-MHz system bus. Both PC platforms also used the same SDRAMs, disk drives, and PCI graphics cards and monitors. During this test, we ran the BAPCO benchmarks three to five times on each PC platform. Each time we ran the test, we changed the throughput of PCI to SDRAM reads by changing the read burst length. Longer bursts produced higher PCI throughput and, therefore, higher bandwidth consumption on the main-memory bus. For example, when we set up the E2920 to perform continuous one-double-word bursts, PCI throughput was 6.6 Mbytes/sec, burst lengths of eight double words yielded a throughput of 40 Mbytes/sec, and so forth. The results of the test unquestionably favored Slot 1. On Intel's AL440LX PC platform, results varied only 2 to 7%, even when the E2920 PCI exerciser created 60-Mbyte/sec throughput compared with the results we obtained without E2920-induced reads. This throughput is only 11% of the main memory's throughput of 533 Mbytes/sec. But, at the same throughput, benchmark results on the AN430TX PC platform were 80 to 85% lower than on the unmodified test. Even when we used the E2920 to generate only 6.6 Mbytes/sec of main-memory reads, the BAPCO results were down 22 to 37%. Our testing did not consider all the factors that could affect the perform-ance of this benchmark. For example, the 430TX and 440LX may have different snoop mechanisms and chip-set buffer sizes. You should also consider the algorithms that a chip set uses for managing main-memory pages and bank switching, as well as processor capabilities, such as Pentium II's out-of-order instruction execution. So, although this test provides no conclusive evidence that a Slot 1 implementation is always significantly better than a Socket 7 implementation, the results do show drastic performance differences. In general, we concentrated on the new and emerging bus enhancements, such as AGP, Slot 1, and USB. Other system functions are not static, however. For example, although the PCI-bus-interface architecture of DMA combines with a FIFO buffer to satisfy many applications' needs, a new architecture based upon caching SRAM may un-leash additional performance gains. This new architecture, as Anchor Chips' AN3041Q device demonstrates, minimizes the number of steps necessary to transfer data into PCI local processors and DSPs. Additionally, cost-sensitive applications can eliminate local memory by caching their memory needs across the PCI bus to main memory. From a system perspective, the larger memories that this architecture supports allow the local processor to exploit PCI's bursting capability, resulting in the movement of lots of data with minimal PCI-bus overhead. As you might guess, bursting impacts other system components. To test this effect, our group loaded down the PCI bus using the AN3041Q. In this setup, some motherboards could support sustained transfer rates of 50 Mbytes/sec, whereas others topped out at 25 Mbytes/sec. This variance in transfer rates gives designers a challenge to maintain a balanced system. Other sources The best all-around reference for PC architecture is the MindShare PC System Architecture Series, published by Addison Wesley (Reading, MA). You can check out these books at www.mindshare.comor www.aw.com/devpress. A good place to obtain practical information about PC architecture and performance analysis is Tom's Hardware Page at www.sysdoc.pair.com. This Web site also offers lots of links to other good sites. Acknowledgments In addition to the "Men at Work" team, I'd like to acknowledge George Alfs and Michael Greene of Intel, Eric Lundgren and Rick Osborne of ATI, Mike Blaskovich of Digital, Stuart McClaren of Nvidia, Donald MacDonald and Jeanne Cotter of HP, Maurizio de Julio of Anchor Chips, J Taylor, and Maury Wright of EDN. |
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
| EDN Access | Feedback | Table of Contents | |
|||||||||||||||||||||||||||||||||||||||||||||||||||
| Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc. | |||||||||||||||||||||||||||||||||||||||||||||||||||