Zibb

Feature

Balancing in three dimensions

Graphics-chip suppliers walking a tightrope to success are encountering numerous obstacles that may cause their downfall. The best-positioned companies have diverse, flexible product lines that help you negotiate your own unique system-balancing acts.

By Brian Dipert, Technical Editor -- EDN, 4/27/2000

AT A GLANCE

Choose your vendors carefully; your application may have different priorities from those of a graphics subsystem that targets a different use.

The graphics manufacturers' slow but sure incorporation of functions that originally ran in software on the CPU may end within the geometry pipeline.

Silicon innovations are meaningless if content developers and application-programming interfaces don't support them.

When deciding on a CPU-to-graphics subsystem interconnection, make sure you calculate worst-case bandwidth requirements to avoid periodic reductions in frame rate.

Exotic memory technologies most easily find homes in performance-starved graphics frame and texture buffers.

Parallel processing, both within a chip and among multiple chips, helps overcome bottlenecks at the front and back ends of the graphics pipeline.

Click here for the addendum to this article. The addendum provides links to other good sources of graphics information and provides the results I got using iPeak.

If you have access to the print version of this article, you may want to grab your 3-D glasses and check out some of the images in 3-D, starting on pg 54.

On your (bench)mark, get set....

 

Of all the bizarre technologies in electronics, 3-D graphics has got to be one of the strangest. Where else do you find dozens of chip companies (plus, in some cases, additional board manufacturers) chasing after only three significant opportunities: PCs (and related workstations), home- and arcade-game consoles, and visual-simulation systems? Within the biggest of these, PCs and workstations, only two relatively small sets of users (game players and digital-content creators) really need robust 3-D graphics, in spite of what the marketers might say. And you, the system designers, know this fact, which is why you're relentlessly driving the chip and board suppliers to deliver higher quality and faster performance (so the system specs will look good on your products' ads) at low to no cost.

John Latta, president of graphics-consulting company 4th Wave, sums the situation up well: Graphics-chip architects face "a difficult dynamic," he says. They must "design to a perceived need, a shallow market demand, and no cost adder. These are the complexities of a technology in search of a market."

Short-term skepticism aside, 3-D-graphics capability pervades all new desktop and workstation computers and is increasingly muscling its way into notebook platforms as well. It's only a matter of time before someone figures out how to "three-dimensionalize" the graphical user interface or finds a more mainstream killer application for the technology than Id Software's Quake or Epic Games' Unreal . Sega and Sony are reaping the rewards of rich graphics capability, with Microsoft, Nintendo, and others hoping to duplicate their success. And Internet appliances and set-top boxes give glimpses into the kinds of systems that will in the not-too-distant future expand 3-D graphics into a diversity of applications.

To survive until the application base expands, however, the graphics companies will have to be nimble, accurately forecasting what their customers will require 18 months to two years in the future and quickly responding to the inevitable variations from the plan that still occur along the way. They'll have to contend with application developers (struggling with profitability and time-to-market concerns of their own), who, in attempting to target the largest customer base, are loath to support leading-edge or proprietary features. They'll have to dance an uneasy tango with the microprocessor vendors, which are also trying to suck graphics functions into both their own chip sets as hardware and their host CPUs as software. And they'll have to diversify their product lines as much as they can, so that a downturn in one segment of the market doesn't lead to their demise.

What's the use?

The three main 3-D graphics applications—game playing, digital-content development, and visual simulation—may appear similar at first glance, but they have widely varying requirements. Traditionally, first-person-shooter and other action games have dominated the stand-alone and PC arcade genre. In these environments, high frame rate has historically been the most important feature (Reference 1). Players won't tolerate getting blasted by an alien or crashing into a barrier because the screen updates lag behind the human reflex-response time. As a result, games-targeted-graphics vendors concentrate on the all-important pixel-fill rate and take short cuts in nearly every other aspect of hardware and software design. Image-quality enhancements are a low priority. The vendors use a few large polygons to construct 3-D objects; as your character sprints down a dark hallway in search of bad guys, you wouldn't notice the fine detail on things you pass and blast.

The performance-is-everything drumbeat is becoming more muted with time, however. The average size of each computer monitor isn't growing drastically, meaning that higher resolution displays don't automatically gobble up increases in graphics chips' pixel-fill rates. The vendors must find other reasons, such as true color, more accurate texture blending and multitexture application, and antialiasing, to justify both your initial purchase and subsequent upgrades to their fastest, newest, most expensive chips. At 30 to 60 frames/sec (a hotly debated and individual-specific threshold), the human eye can't detect—or therefore react to—frame-to-frame differences representing, for example, character movement. Note, though, that the longer the frame-change delay, the more it impacts the overall system response time to user input. Also, the peak frame frequency is less meaningful than the sustained average (or better yet, worst-case) specification. Game players quickly notice update rates that dip, no matter how briefly, below the players' detection threshold. Such a dip occurs, for example, when large amounts of new texture information must transfer to the graphics subsystem as a character enters a room.

Games are also diversifying. Although an action or simulation program might require less than ideal quality, a slower but more immersive environment, such as an adventure, a fantasy, or a role-playing title, values accurate color, fine detail, and sharp object edges. Consider the reasons behind the success of games such as Myst and Riven : Users require a rich graphical world to draw them into the game's premise and plot. As games deliver characters that look more human and surroundings that more resemble our own, our expectations for their realism correspondingly increase (Reference 2). Any application of text to a 3-D surface also quickly unmasks a graphics architecture with subpar quality capabilities.

Contrast the needs of game players with those of digital-content developers, such as CAD engineers and special-effects artists. In these cases, computer users , not the artists who developed the game, create the 3-D environment. Average polygon sizes per rendered 3-D object are much smaller than in games and essentially boundless in number; detail and accuracy of presentation are important. And frame rate is a secondary concern; 15 frames/sec is often overkill. Note that in both games and digital-content development, roughly half of the polygons that make up each object are back-facing for each displayed frame. Because the camera can't see them, the graphics subsystem doesn't render them; it simply discards them as soon as it receives them.

For example, if you were looking at the front of a 3-D representation of my house, all polygons that represent the back and, to some extent, the sides of the house would be invisible to you. By analyzing the direction of the back-facing polygons' normals (vectors emanating outward from the polygons' faces), the graphics chip would know to immediately stop processing them. Most 3-D-accelerator geometry and -rendering specifications assume that roughly half of the incoming polygons are back-facing. They also suppose that the polygons are triangular, needing at most three coordinate triplets to represent their spatial location, and that they combine in strip or fan arrangements. Both of these structures require that the graphics subsystem receive only one additional 24-byte vertex-coordinate triplet to represent each additional polygon after the first (Figure 1).

The third major 3-D-graphics application, visual simulation for use in flight simulators, military training equipment, and the like was the first consumer of 3-D-graphics technology, and today it continues to push the state-of-the-art. Like arcade games, simulation systems require high sustained frame rates, as well as (from an overall system perspective) near-immediate real-time response to users' input actions. Like digital-content creation, the 3-D simulation environment incorporates myriad polygons to ensure fine detail and spans a range of polygon sizes. Unlike games and digital-content creation, though, visual simulation does not assume that 50% of polygons are back-facing. Consider, for example, the percentage of total polygons visible in a downward-looking aerial view from an airplane. Visual-simulation applications, especially with dynamically created worlds, also cannot assume that polygons combine in tidy, fast-rendering triangle and fan structures.

Although important differences exist in the major 3-D applications, each leverages developments made in the others. Gaming-targeted chips are incorporating quality features that first appeared in content-creation systems, for example. And the simulation world is adopting technology born in arcades. Years ago, the US Army commissioned the development of a custom version of Atari's Battlezone , one of the first popular wire-frame 3-D graphics games. And more recently, the US Marines adapted Id Software's Doom II engine for military use.

Partitioning the pipeline

If Gordon Moore needed a case study to testify on the accuracy and power of his law, 3-D graphics would be a compelling candidate. Technology that appears first in high-end simulation systems costing tens or hundreds of thousands of dollars migrates to workstation graphics chips costing hundreds or thousands of dollars and eventually to less-than-$100 mainstream-PC graphics, all within a few short years. Triangle-setup and -rasterizing functions are making their way onto core logic, such as Intel's i810, and host CPUs, such as Intel's upcoming Timna. Meanwhile, the stand-alone-graphics-chip suppliers are making their next logical integration step: accelerating geometry transform and lighting (T&L), including culling and clipping, in hardware (Figure 2). Although people often use "transform" and "lighting" in the same breath, you need not accelerate both of these functions in hardware or emulate them in software. They use similar matrix-multiplication operations; however, for both quality and performance, many game developers support no light sources but instead employ textures to represent illumination and shadow patterns. Other game engines use proprietary light algorithms that neither DirectX nor OpenGL can accelerate in hardware. DirectX version 8, due this summer, supports pixel shading, which helps to alleviate light-source-quality concerns. However, in the short term, a mainstream graphics chip that does a good job of accelerating polygon T&L but supports only a few lights—or a few kinds of lights—at decent performance levels in hardware doesn't necessarily reflect a bad architectural decision. Such a chip would probably be inappropriate for high-end content creation though.

Nevertheless, by accelerating T&L in hardware, the mainstream chip suppliers are threatening the place that high-end vendors hold in the small but lucrative workstation-graphics market. Those vendors have for years been accelerating T&L in hardware. Display quality, robust application-programming-interface (API) support, and the need to pass stringent application certification paces the newcomers' progress.

On the other hand, the newcomers are now in a battle for the much larger volume PC world with the host-CPU providers. One of the most obvious applications of floating-point single-instruction-multiple-data (SIMD) instruction sets, such as AMD's 3DNow!, IBM and Motorola's AltiVec, Intel's Streaming SIMD Extensions (SSE), and Sun's Visual Instruction Set (VIS), is to boost the performance of software-emulated T&L algorithms. Graphics applications have historically done this task in software under the DirectX API, although OpenGL has for some time incorporated optional hardware acceleration. Last summer's DirectX Version 7 for the first time supported optional hardware-accelerated versions of these functions, but lacking other compelling uses for SIMD and high clock rates, the CPU manufacturers won't give up their turf without a fight. Look, for example, at Intel's new Williamette processor architecture. The second-generation, floating-point SIMD engine now supports double-precision, floating-point and 128-bit-integer operations, and the ALU runs at twice the core operating frequency.

Issuing the first challenges to the host-CPU monopoly on T&L, Nvidia and S3 last fall launched their single-chip hardware-T&L-enhanced GeForce 256 and Savage2000 graphics accelerators, respectively (Reference 3). With Quadro, which closely followed GeForce, Nvidia took a page from Intel's marketing book. Although Quadro is significantly more expensive than GeForce, Nvidia manufactures both from the same die, but Quadro runs at a higher clock speed and has antialiased-point and -line support in its drivers. When you look at benchmark results for Savage 2000, remember that the device's drivers, by press time, did not yet support hardware T&L due to the device's polygon-clipping unit, which is not fully functioning. ArtX, which ATI recently acquired, also accelerates T&L in hardware, and, in partnership with Acer Labs, brought hardware T&L to mainstream core logic with integrated graphics. The partnership's first product targets low-performance Socket 7-based CPUs, an ideal environment for hardware-T&L acceleration, and its 128-bit main-memory interface alleviates unified-memory-architecture (UMA) bandwidth concerns.

From a purely technical perspective, the graphics vendors' claims about the appropriateness of hardware-accelerated T&L have a lot of merit and hark back to the arguments that DVD and high-definition-TV (HDTV) decoding-silicon providers also make (Reference 4). Fortunately for the graphics suppliers, 3-D graphics is a less "upper-bounded" task than DVD or HDTV decoding, meaning that the upper limits of 3-D graphics' requirements is less fixed. Continued user demands for higher resolution and greater image quality delay the inevitability that the host CPU will subsume the 3-D graphics function into software.

Fixed-function, frequently executed tasks are, for cost, speed, and power reasons, well-suited for dedicated hardware; as a result, you don't have to waste the host CPU on these menial tasks. The CPU is then free for other more appropriate duties, such as accurately modeling object physics effects; implementing more-realistic-character artificial intelligence, animation, kinetics, and kinematics; managing object databases; implementing audio synthesis; and the like. For example, when Epic's Unreal Tournament gets a better review than Id Software's Quake III Arena , it's often because the bad guys in Unreal move and respond in a more unpredictable and therefore lifelike and challenging manner.

Where's the beef?

Reality, however, often differs from theory. The fundamental problem facing trendsetters, such as Nvidia, is that most content developers want to sell as many copies of each game as they can. So they write the game to the installed base of hardware. Think back to what was the hottest selling PC configuration approximately 18 months ago. Even if the content developers enable an optional silicon-accelerated T&L pipeline, they aren't yet in many cases taking advantage of the presence of an advanced graphics chip to enhance their titles. Enhancement possibilities include sending more polygons, more light sources, or an otherwise higher quality representation of the 3-D world to the screen or finding other ways of using the extra host CPU processing power that suddenly becomes available (Figure 3).

Hardware T&L integration is a given for most stand-alone, next-generation graphics chips. Incorporating it, for example, is ATI's new architecture, the Charisma Engine, which ATI announced last month at the Game Developer Conference. In non-PC applications in which the host CPU is low-performance for cost reasons, the hardware-T&L-accelerated approach makes sense. Economics is also one of the key reasons that Microsoft's X-Box game console includes a GeForce-derived graphics core with a 600-MHz Intel CPU, which, by the platform's introduction in late 2001, will be relatively inexpensive. But in PCs, unless the content developers radically accelerate their embrace of leading-edge technology, hardware T&L's long-term viability is debatable, except in high-end, polygon- and light-rich configurations or as a life-extending graphics upgrade.

Look closely at some of the benchmarking on GeForce using today's games, and you'll notice that the chip's benefits become apparent only across a narrow sliver of the system-performance spectrum (see sidebar "On your bench(mark), get set..."). Couple the graphics accelerator with a CPU that runs too slowly, and the remainder of the system will starve graphics-subsystem performance by handing it even unprocessed polygons and other data more slowly than its maximum processing rate. Couple the graphics accelerator with a high-performance CPU, particularly one with floating-point SIMD support, such as a K6-2, an Athlon, or a Pentium III versus a Pentium II or a Celeron, and graphics runs no faster in hardware-accelerated versus software-emulated mode. If you throw, for example, complex or too many light sources at the graphics chip, it may render the scene more slowly than the host CPU would otherwise do by itself. Object collisions can also be more difficult to detect if the host CPU doesn't have fast access to the transformed polygon coordinates. Note that GeForce performs much better on synthetic benchmarks, which more accurately reflect next-generation applications' per-frame polygon counts and other features. Reflecting this fact, Nvidia's marketing pitch touts the long-term investment wisdom of purchasing a GeForce-based graphics subsystem.

Unless the application uses hardware-T&L or esoteric graphics features, such as spherical-environment mapping, GeForce performs only about at the level of higher end variants of the TNT2, Nvidia's previous architecture, particularly at low display resolutions. The primary limitation of GeForce is that it offers only one texture pipeline for each corresponding pixel pipeline, whereas the TNT architecture supported a 2-to-1 texture-to-pixel-pipeline ratio. GeForce integrates four parallel pixel pipelines to TNT's two. However, both trilinear-filtering and multitexturing applications, such as the layering of shadow on light map on water on blood on color pattern on rough surface, that games use results in unused pixel pipelines. Even in bilinear-filtering and single-texture-per-pixel applications, the chip's frame-buffer bandwidth can artificially constrain the maximum pixel-fill rate.

Nvidia is returning to a 2-to-1 texture-to-pixel pipeline ratio and preserving GeForce's four-pixel-pipeline architecture with its new GeForce2, which is now available for sampling. ATI takes the evolution one step further, offering a 3-to-1 texture-to-pixel-pipeline ratio on its newest devices. With GeForce2, Nvidia will also convert to a 0.18-µm manufacturing process with an anticipated 50 to 100% higher clock-rate capability than that achievable on GeForce's 0.22-µm lithography. The combination of dual texture-per-pixel pipelines and higher clock rate will, Nvidia claims, give GeForce2 three to four times better peak performance, depending on clock rate, than the first-generation GeForce in advanced filtering and multitexture applications. Part of this performance comes from pipeline tuning like that in the TNT-to-TNT2 redesign. Nvidia believes that this tuning will give GeForce2 10 to 15% higher performance than GeForce even at the same clock rate.

Intel and its CPU competitors are pushing the content-development community to create scalable game engines that will dynamically adapt themselves to the characteristics of each platform they run on. The companies advocate a multiresolution-mesh approach that, by default, assumes a high-polygon-count model akin to the one that artists create when developing their characters. Multiresolution-mesh techniques automatically decrease the number of polygons for lower end systems in a visually pleasing manner that maintains a minimum required frame rate (Figure 4). As system-scalable techniques become more common, they'll probably help the graphics vendors' cause. Artists will no longer have a reason to create polygon-deficient worlds. In fact, manually reducing polygon count while preserving a reasonable-quality representation of each 3-D object frequently takes up a disproportionate percentage of the time spent in today's game development.

Workstation-graphics supplier 3Dlabs has come up with an interesting approach to resolving the CPU-versus-graphics tug of war. Its PowerThreads drivers use hardware to accelerate or software to emulate each OpenGL API call, depending on the graphics subsystem's capabilities and how much processing power it versus the host CPU has available. Although all hardware-T&L engines essentially perform the same floating-point vector-arithmetic functions, important differences exist between them. One disparity involves the amount of internal precision the calculations employ; high precision is less important for mainstream 3-D applications but more important in high-detail CAD work.

Some T&L engines are hard-wired to a specific API, whereas others are programmable and therefore flexible enough to use on multiple APIs. Although a hard-wired approach may be acceptable for a slow-evolving API, such as OpenGL, new DirectX revisions appear yearly, making an easily evolving alternative more appealing. On the other hand, the more programmable the T&L engine, the more it overlaps with the similar function implemented on the highly flexible host CPU. T&L engines also differentiate themselves by the variety of light-source types that they accelerate in hardware.

Communication breakdown

Regardless of what portions of the 3-D pipeline the graphics subsystem handles, an appropriate interface channel must exist between it and the host CPU and its respective core logic. Early PCs used ISA to accomplish this task; PCI brought higher bandwidth potential and other access enhancements. As network, hard-drive, and other traffic constrained PCI bandwidth, the need for a dedicated CPU-to-graphics bus became more critical (references 5 and 6). Nowadays, as a result, most PCs and even Macs use variants of the Accelerated Graphics Port (AGP), whereas some high-end workstations employ proprietary alternatives, such as Sun's multiple-controller-capable Ultra Port Architecture (UPA). Successive revisions of AGP multiply the 32-bit, 66-MHz data channel's peak data bandwidth, from 266 Mbytes/sec with AGP 1´ to 1066 Mbytes/sec with AGP 4´, using single-, double- and quad-data-rate techniques.

AGP variants also add other important features, such as sideband addressing, which, like the Direct Rambus DRAM (DRDRAM) interface, gives address signals their own dedicated pins and allows the average data-channel bandwidth to more closely approximate its theoretical potential. Pipelining lets the CPU queue multiple requests at a time. Fast Writes mode bypasses main memory, enabling direct transfer of information between the host CPU and the graphics architecture. In doing so, Fast Writes sidesteps one common system bottleneck that advanced main-memory architectures, such as DRDRAM, also attempt to solve (Reference 7). Each data transfer down AGP normally requires two CPU front-side bus transfers and three memory accesses (Figure 5). This spring's Intel Developer Forum marked the unveiling of the "Beyond APG 4´ Initiative" (www.beyondagp4x.org). "Virtual-AGP" connections within core-logic chips that include graphics functions can also run at greater-than-AGP-4´ speeds.

Marketing hype aside, is all this bandwidth really necessary? Ask five graphics vendors this question, and you'll get five responses. The answer depends first on how many peak polygons per frame the CPU supplies to the graphics subsystem. Other significant swing factors are textures and how the graphics subsystem manages them. Intel envisioned, when it first came up with AGP, that the graphics chip's local memory, if any, would only find use as the frame buffer. All textures would load in main-system memory, and the graphics accelerator with a small on-chip cache would fetch them as needed over AGP, a technique called AGP Texturing mode. Intel's i740 and i740-derived i810 core-logic chip set work in this way. Low-end i810 configurations even use main-system memory for the frame buffer in a return to the UMA concept of days past (Reference 8). Intel's upcoming i815 (code-named Solano) chip set will also embed a graphics controller, but instead of using extra pins to address an optional frame buffer, the chip set will support an AGP expansion bus should the system manufacturer or end user want to disable the graphics built into the core logic.

In contrast to the AGP Texturing technique, some high-end workstation graphics chips not only have a huge local-memory-derived texture cache but also dedicate pins to its interface—unwilling to share the cache's bandwidth with that of the local frame buffer. A vocal supporter of local texture caching is 3dfx; its chips don't support AGP Texturing mode. Part of this decision derives from 3dfx's significant presence in the retail upgrade market and subsequent support of PCI, for which direct texture fetching is impossible. Part of the reason is 3dfx's restricted, 16-bit-color and low-resolution-texture support—until the introduction of the VSA-100 chip architecture—and the corresponding decreased demand on local-memory density. And DRAM's unusually low prices in recent years support the cause of local texture caching advocates, too. Most mainstream-graphics vendors support an approach between that of Intel and of 3dfx, retaining the ability both to locally cache textures and to fetch them from main memory, depending on what an application requests.

As the amount of texture data, which tracks the number of polygons per scene, increases and as the maximum resolution and, therefore, size of each texture set also grows, AGP-performance headroom diminishes. Perhaps the biggest consumer of AGP bandwidth, should Intel's vision come to pass, is uncompressed HDTV video content sent to the graphics subsystem as a texture. Note too that video textures, by virtue of their constantly changing status, are inappropriate for caching. An interesting approach to using local memory comes from 3Dlabs with its onboard memory-management unit (MMU). The MMU works independently of the API to download over AGP and cache only the level-of-detail multum-in-parvo (MIP) map, or resolution-dependent version of the texture, that the MMU needs at the time.

How can you reduce the amount of AGP traffic, aside from locally caching textures within the graphics subsystem? One approach compresses the texture data in a lossy manner when creating the content for the graphics subsystem to subsequently decompress and display. A number of vendor-proprietary schemes exist for doing this compression. For example, 3dfx just converted its FXT1 approach along with its Glide API to open-source status. However, two years ago, Microsoft licensed an S3-developed texture-compression algorithm, S3TC, for DirectX 6. Thus, S3's competitors prefer to refer to this algorithm as DXTC. S3TC claims visually lossless 4-to-1 compression in most cases, and the industry is slowly converting to this standard (Figure 6). Bump mapping, another data-reduction technique, creates the illusion of a 3-D surface using a 2-D texture (Figure 7). Numerous bump-mapping approaches, such as embossed, dot-product, and environment-mapped techniques, exist, but, as with texture compression, the industry is slowly but surely converting—at least in the Direct world—to a single Microsoft-blessed standard. When graphics chips began adding hardware support for the triangle-setup function a few years ago, the data the CPU sends down AGP significantly decreased. Conversely, should the graphics chip perform hardware T&L and therefore cull any resulting back-facing polygons, the amount of polygon data sent down AGP will be higher than if the host CPU does the T&L and, therefore, the culling. Local-vertex caching somewhat alleviates this added AGP burden, whereas six-texture cubic environmental mapping has the opposite effect.

Graphics techniques that first appear in high-end systems migrate down to the mainstream more quickly than you might expect. Sun Microsystems has taken compressed AGP traffic to the next step, developing a visually lossless geometry-compression technology, which the Java 3-D API supports. Microsoft's Talisman initiative encompassed a number of data-reducing concepts. The graphics subsystem didn't rerender polygons that remained the same from one frame to the next, and it transformed the affines of those whose orientations had changed only slightly. This concept is one that ATI's newest architecture, which supports hardware keyframing—using software algorithms to describe movement—revisits (Reference 2). Other developments in graphics-research laboratories that promise to transform your future system-design balancing act include Bezier patches and nonuniform rational B-splines (NURBs); higher levels of abstraction than polygons provide for representing 3-D surfaces. When the use of these types of parametric models becomes mainstream, it will not only speed the creation of detailed 3-D objects and shrink the amount of data needed to describe them, it'll exacerbate the graphics chip-versus-host CPU tug of war for who gets to process them.

Now that you know about the polygon and texture data that flow into a graphics-subsystem pipeline, look at what comes out the other end of a graphics subsystem. The pixel-fill rate specifies the amount of information transferred to the frame buffer per second. This specification highly depends on the total available controller-to-memory bandwidth. The interface's clock speed, the amount of data that transfers across each data pin during each clock period to each clock, and the data-bus width determine the bandwidth. In most systems, the frame buffers (2-D pixel memory plus depth buffer) share total memory bandwidth with the texture cache. Also note that back-buffer-rasterizing tasks must share time with front-buffer RAMDAC-related display traffic.

What makes the pixel-fill rate so important? Increased frame rate is one factor. Higher resolution also exponentially increases the amount of information transferred during each frame; a 1024´768-pixel XGA display has 2.5 times more pixels than a 640´480-pixel VGA alternative. Make each of those pixels and therefore the textures defining them 32-bit true color, and you further increase the required bandwidth over a 16-bit display. And increase the per-pixel precision of the depth buffer, from 16 to 24 or even 32 bits, and required bandwidth further expands. Fortunately, a floating-point (versus integer) depth buffer or the alternative W-buffer technique, which provides a more linear representation of pixel distance than the nonlinear Z-buffer approach, defers the need to migrate to a higher precision depth buffer.

Finer resolution depth precision implies a more complex scene with numerous objects overlapping each other. Thus, for each object, you read back the appropriate current pixels' values, modifying them with incoming pixel details (based on relative Z-values and alpha, or translucency, information) and writing them back, all before you can display the scene. Multitexturing consumes memory bandwidth both to fetch the MIP map information and to read back, modify, and write the appropriate pixel contents in the absence of multiple parallel-texture pipelines within the graphics chip. Such behavior reveals one advantage of 32-bit color, which experiences less accuracy degradation through multiple pixel-modification passes than the lower precision, 16-bit alternative. Advanced texture-blending techniques, such as trilinear and anisotropic filtering, also can degrade performance. And LCD shutter glasses double the required frame update of a monitor.

Is it any wonder, then, that graphics—more than any other application—pushes commodity-DRAM speed? Specialized graphics-memory variants both offload the graphics logic from some of its processing and lower the required logic-to-memory bandwidth and have in some cases achieved reasonable success. Synchronous-graphics RAM, for example, supplements SDRAM's features with write-per-bit and block-write capabilities. And even more esoteric architectures, such as Mitsubishi's 3D-RAM, embed arithmetic-logic units and other processing elements to reduce or eliminate the common read-modify-write and other functions (Reference 9). Just as vendors assume best-case conditions when specifying polygon geometry and rendering speeds, they play similar specmanship games with fill rate. One or a few large polygons per frame are the norm—with little to no depth complexity or shading and no advanced texture-manipulation, true-color, or other quality features turned on.

In another example of workstation graphics technology migrating down to mainstream computing, 3dfx has recently been touting the T-Buffer capability of its VSA-100 scalable graphics architecture, an approach analogous to the accumulation buffers common today in high-end graphics. T-Buffer enables one or multiple parallel-operating VSA-100 chips to render multiple versions of a frame, a capability you can use with several quality-improvement techniques. Perhaps the most compelling—and the one that requires no explicit application or API support—is full-scene antialiasing (FSAA). A single VSA-100 will support two-sample FSAA, and multiple-chip versions can implement the even higher-quality four-sample FSAA variant. Antialiasing—smoothing jagged edges at color transitions—becomes more important as the number of polygons increases.

The FSAA approach that 3dfx employs involves rendering each incoming polygon to slightly different pixel locations in each of the multiple frame buffers, then combining them before displaying them on the screen. Some vendors use supersampling , an alternative FSAA approach that renders a higher resolution version of the frame, multiplied in x, y, or both dimensions and then combines the pixels before displaying them. Though a more hardware-generic approach, supersampling produces inferior-quality results and more significantly impacts fill rate, according to 3dfx. Supersampling also doesn't support some of the other T-Buffer capabilities that 3dfx touts. A four-VSA-100 configuration can simultaneously support FSAA and other T-Buffer effects, including motion blurring, selective depth of field, and soft-shadow and reflection edges, on each frame (Figure 8). However, DirectX currently does not support these other capabilities. OpenL's extensions mechanism makes it easier to incorporate support for proprietary features, though 3dfx must still convince the application developers to use the features. Unlike FSAA, these other features also require multiple front-end rendering passes for each affected polygon, making it less clear whether developers will support T-Buffer features other than FSAA until 3dfx also brings hardware T&L capability to the table.

Calculated cloning

Subdividing the rendering and rasterizing functions across multiple parallel-operating graphics chips is a viable short-term strategy for increasing a vendor's product-line flexibility, although this subdivision might contradict the long-term trend toward single-chip integration. Workstation-graphics vendors, such as 3Dlabs and Intense3D, take the next step: offering separate geometry chips. The multichip balancing act is trickier than it might seem at first glance, however.

One technique, which ATI Technologies' alternate-frame rendering (AFR) exemplifies, is to subdivide the graphics task on a frame-by-frame basis. While one chip is sending its front-buffer information to the screen (and perhaps rendering the next frame in its back buffers), the other chip is independently processing the subsequent frame or series of frames, and so on. This approach reduces the amount of per-pin local-memory bandwidth necessary to achieve high frame rates. However, it doesn't resolve any frame-rate bottlenecks that front-end polygon transfer or processing limitations cause. The technique is also somewhat inefficient in its use of local memory, because each chip ends up with its own texture cache, probably sharing a great deal of redundant information with its neighbor.

The opposite extreme, which 3dfx's Voodoo2 scan-line-interleaving (SLI) architecture exemplifies, subdivides each frame on a line-by-line resolution and partitions the per-frame processing among the multiple graphics chips on the board or in the system. SLI gives similar fill-rate relief to each graphics chip, and, theoretically, it also cleanly allocates the per-frame rendering task. However, practical limitations constrain SLI's effectiveness. Because most polygons, especially in low-triangle-count gaming environments, span multiple pixels and therefore multiple scan lines, a great deal of redundant processing occurs. Other graphics architectures, therefore, such as 3dfx's VSA-100-based Voodoo5 boards as well as many workstation-class products, are choosing an interim middle-ground approach. Each graphics accelerator handles a group of contiguous scan lines, which the vendor selects based on an average polygon-size prediction.

Region-based, or deferred-rendering, accelerators take a different approach to solving the fill-rate problem (Reference 10). These architectures employ extensive internal caching to process all the polygons that define each pixel or small cluster of pixels before writing them to the frame buffer for display. Traditionally, API incompatibilities—specifically, the inability to read back depth-buffer data after writing it—have limited the use of the region-based technique in PCs, although such limitations are less of a problem in other platforms. However, region-based-accelerator advocate GigaPixel claims to have licked the API problem, and Microsoft seems to agree: GigaPixel's technology nearly became the graphics foundation of Microsoft's X-Box . Other region-based-graphics advantages include automatic FSAA and the need for each pixel to make only one pass through the rendering pipeline.

Yet another approach to solving the frame-buffer-bandwidth issue is to embed the memory array on the same die as the graphics logic with a wide, fast bus interconnecting them. To date, this approach has seen limited success; MediaQ, NeoMagic, and Trident, for example, have implemented it for mobile graphics accelerators, in which 3-D performance is typically less critical than for a desktop PC and embedded memory's low power consumption is equally valuable. Historically, embedded DRAM has been more expensive than the multiple-chip alternative, and the process modifications necessary to incorporate it have encumbered logic performance. But companies such as Bitboys Oy and PixelFusion continue to extol the benefits of the approach, along with AGP Texturing, to limit local-memory-density requirements. Others will most likely join these early adopters as more foundries ramp up embedded-DRAM capability. PixelFusion's architecture is intriguing for another reason as well: Its generic media-processor array supports functions other than graphics, and the driver can dynamically tune the percentage of available processing power devoted to front- versus back-end graphics tasks.

On your (bench)mark, get set....

For better or worse, graphics companies live and die on benchmark scores. A few extra points on Viewperf, 3D WinBench, or 3DMark or a fraction of a frame-per-second higher on 3D GameGauge, is enough justification for a computer user to prefer a certain vendor's technology, and therefore for a system designer to select that vendor's chips or boards. Is this situation just begging for a little cheating? You bet.

Sordid stories abound of graphics companies that have made questionable driver "hacks" after detecting that a benchmarking run is in progress to artificially boost their scores. Some drivers blatantly drop an occasional frame, a deception that a benchmarking test might not catch unless the tester videotapes the display contents and then replays them in slow motion. Others ignore the display preferences the tester specifies, instead downgrading to 16-bit color and lower quality but faster texture-filtering schemes. Still others return a "done" indication before the command queue actually empties.

Some drivers discard small incoming polygons to maximize fill rate. And some override the enabled vertical-sync-controlled, secondary-to-primary buffer-flip setting, a trick that arguably more accurately reflects the chip's true performance but causes disagreeable image "tearing" when the flip occurs in the middle of a frame redraw. Triple-buffering (using dual secondary buffers) gobbles up more local memory but fairly represents graphics-accelerator speed while preserving image quality. Finally, a vendor might supply an internally overclocked chip, running beyond speeds achievable with high-volume production parts for analysis.

Several types of benchmarks exist, all providing useful information but none giving you a definitive comparison of graphics architectures. Application benchmarks include popular games, such as Quake and Unreal , or content-creation packages, such as 3D Studio Max or AutoCAD. The 3D GameGauge benchmark combines the results of tests using several games, spanning multiple application-programming interfaces and genres and incorporating a variety of graphics techniques. Indy3D and SPECviewperf are application-targeted benchmarks for content-creation users.

Although an application benchmark may tell you how your graphics card is performing now, it doesn't necessarily show you how much feature and performance headroom you'll have in the future. Filling that need is one benefit of also using a synthetic benchmark, such as 3D WinBench or SPECglperf. Synthetic benchmarks, by testing not only for performance but also for the presence of quality features, more easily reveal both architecture limitations and driver cheating. Straddling the application- and synthetic-benchmark line, 3D Mark from MadOnion is based on a game engine, but, because MadOnion created the engine, the company has more control over what and how much to test.

I'll be analyzing many graphics cards over the next few months. Instead of repeating someone else's work on one of the above benchmarking programs, though, I'll be doing my analysis using Intel's iPeak Graphics Toolkit. This program, which Intel points out is nota benchmarking suite, nonetheless has several appealing capabilities for graphics-technology providers, analysts, and content developers. It gives the user tremendous control to alter any or all of the variables that define graphics quality and speed. It also bypasses any application bottlenecks that might artificially limit the perceived graphics subsystem capabilities.

Periodically check out the addendum to this article on the EDN Web site to see the results I got using iPeak. The addendum also provides links to other good sources of graphics information.

Boards I'll be testing include:

  • Dlabs Permedia3 Create!,
  • ATI All-in-Wonder 128 Pro,
  • ATI Rage Fury MAXX,
  • Creative Labs 3D Blaster TNT2 Ultra,
  • Diamond Multimedia Fire GL1,
  • Diamond Multimedia Viper II,
  • Elsa Erazor X2 ,
  • Elsa Gloria II,
  • Nvidia GeForce 256 DDR reference board,
  • S3 Savage4 reference board,
  • 3dfx Voodoo3500,
  • 3Dlabs Oxygen VX1 (AGP),
  • 3Dlabs Oxygen GVX1 (AGP),
  • 3Dlabs Oxygen GVX1 (PCI),
  • 3Dlabs Oxygen GVX210 (AGP).

For more information...
For information on subjects discussed in this article, use EDN's InfoAccess service . When you contact any of the following manufacturers directly, please let them know you read about their products in EDN.
Acer Laboratories
1-408-544-3100
www.acerlabs.com
Circle No. 333
Ark Logic
1-408-988-8900
www.arklogic.com
Circle No. 334
ATI Technologies (including ArtX subsidiary)
1-905-882-2600
www.ati.com and www.artxinc.com
Circle No. 335
Bitboys Oy
1-972-744-0222
www.bitboys.fi
Circle No. 336
Broadcom (Stellar Semiconductor subsidiary)
1-949-450-8700
www.stellarsemi.com
Circle No. 337
Evans & Sutherland
1-801-588-1000
www.es.com
Circle No. 338
GigaPixel
1-408-654-8005
www.gigapixel.com
Circle No. 339
Imagination Technologies Group
+44 (0)1923 260511
www.imgtec.com
Circle No. 340
Intel
1-916-356-8080
www.intel.com
Circle No. 341
Intense3D
1-877-286-1185
www.intense3d.com
Circle No. 342
MadOnion
1-416-972-6275
www.madonion.com
Circle No. 343
Matrox Graphics
1-514-685-7230
www.matrox.com
Circle No. 344
Micron Technology (Rendition Graphics subsidiary)
1-408-855-4000
www.rendition.com
Circle No. 345
Microsoft
1-425-882-8080
www.microsoft.com/directx and www.xbox.com
Circle No. 346
NeoMagic
1-408-988-7020
www.neomagic.com
Circle No. 347
NEC Electronics
1-408-588-6000
www.nec.com
Circle No. 348
Nvidia
1-408-615-2500
www.nvidia.com
Circle No. 349
PixelFusion
+44 1454 878740
www.pixelfusion.com
Circle No. 350
Primary Image
+44 (0) 181 339 9669
www.primary-image.com
Circle No. 351
RealVision
+81 (45)473-7331
www.realvision.co.jp
Circle No. 352
S3 (including Diamond Multimedia subsidiary)
1-408-588-8000
www.s3.com, www.diamondmm.com, and www.firegl.com
Circle No. 353
Silicon Motion
1-408-467-9388
www.siliconmotion.com
Circle No. 354
Silicon Integrated Systems (SiS)
+886-2-29161619
www.sis.com.tw
Circle No. 355
SP3D
+49 8151 270 200
www.sp3dtech.com
Circle No. 356
STMicroelectronics
1-781-861-2650
www.st.com
Circle No. 357
Sun Microsystems
1-650-960-1300
www.sun.com
Circle No. 358
3dfx Interactive
1-408-935-4400
www.3dfx.com
Circle No. 359
3Dlabs
1-408-530-4700
www.3dlabs.com
Circle No. 360
Trident Microsystems
1-408-496-1085
www.tridentmicro.com
Circle No. 361
Other companies mentioned in this article

Author Information

 Contact Technical Editor Brian Dipert at 1-916-454-5242, fax 1-530-937-8147, bdipert@pacbell.net.






REFERENCE

  1. Dipert, Brian, "The high-end PC looks for a home," EDN , Nov 24, 1999, pg 145.
  2. Dipert, Brian, "C'mon, baby, do the animotion," EDN , Dec 23, 1999, pg 58.
  3. Dipert, Brian, "Hardware help, happily beheld," EDN , Sept 16, 1999, pg 22.
  4. Dipert, Brian, "Not fade away," EDN , Oct 14, 1999, pg 44.
  5. Levy, Markus, "Unveiling the hidden secrets of PC-bus architectures," EDN , Dec 4, 1997, pg 112.
  6. Wright, Maury, "Technology initiatives stimulate 3-D graphics," EDN , March 27, 1997, pg 47.
  7. Dipert, Brian, "The slammin', jammin' DRAM scramble," EDN , Jan 20, 2000, pg 68
  8. Dipert, Brian, "Integration benefits counteract graphics shortcomings," EDN , May 27, 1999, pg 20.
  9. Dipert, Brian, "Graphics DRAMs address density, performance trends," EDN , Aug 19, 1999, pg 18.
  10. Dipert, Brian, "The best (or worst?) of both worlds," EDN , Nov 11, 1999, pg 139.

ACKNOWLEDGMENT

Simulation-system vendor Quantum3D gave me an interesting end-user perspective on the importance of various graphics-technology features in meeting the company's application needs. At the Platform '99 and Platform 2000 conferences, presentations by Neil Trevett, vice president of marketing at 3Dlabs, were the inspiration for this article. And both Bert McComas from Inquest and Peter Glaskowsky from MicroDesign Resources were invaluable sources of information and perspective. Thanks also to the vendors that supplied hardware and software for the benchmarking project, particularly Kingston Technology and NEC for coming up with now-rare PC800 Rambus in-line memory modules.



Reed Business Information Resource Center

Featured Company


Related Resources

ADVERTISEMENT

ADVERTISEMENT

Feedback Loop


Post a CommentPost a Comment

There are no comments posted for this article.

Related Content

 

By This Author


ADVERTISEMENT

Knowledge Center



Technology Quick Links

EDN Marketplace


©1997-2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites