
Take the marketing hype away from the trendy name "multimedia instructions," and you end up with a class of CPU operations that use a single-instruction-multiple-data (SIMD) architecture. The SIMD instructions benefit multimedia applications, such as MPEG, image processing, audio synthesis, and 3-D graphics. But, more fundamentally, these so-called multimedia instructions benefit any application with vectorizable code, code that contains algorithms that perform localized, recurring operations on small data. These algorithms include FIR and FFT filters, which DSP-based applications commonly use.
The concept of including multimedia instructions on a CPU is not new. For example, multimedia instructions were an integral part of Intels 860 and the 750 video processor. The exploding multimedia market has driven several µP vendors to integrate a variety of multimedia instructions into their processors. Most popular among these processors is Intels Pentium-MMX, code-named the "P55C." Although Intel has not yet begun to ship the P55C, hardware and software designers for PC-based applications are wondering how to take advantage of the architectures 57 new instructions. (Intel also plans next year to deliver MMX support on the PentiuµPro.) Other x86 µP vendors, including Advanced Micro Devices and Cyrix, plan to develop processors with MMX-compatible instructions.
Other general-purpose processors that boast multimedia instructions are Hewlett-Packards (Colorado Springs, CO) PA-7100LC and Suns UltraSPARC. Sun refers to UltraSPARCs special instructions, which are not part of the SPARC V9 architecture, as the "visual instruction set" (VIS). Chromatics MPACT and Philips TriMedia processors, dedicated multimedia processors, also use an SIMD architecture.
PCs arent the only beneficiaries of these multimedia instructions. For example, one company is using UltraSPARC and its multimedia instructions for real-time, 3-D medical imaging. The TriMedia processor, another example, can serve as a coprocessor in a PC, although Philips designed the device to operate in a stand-alone mode, enabling you to use it in applications such as set-top boxes.
Howd they do it?
To use the SIMD architecture, the processor vendors defined four data types: packed byte, packed word, packed double word, and quad word. As a example, a packed byte is 8 bytes packed into one 64-bit quantity (
Figure 1). This data type differs from a standard 64-bit word in that the bytes are distinct elements. For example, you can use a packed byte for storing the red, green, blue, and alpha components of 2 pixels.
Multimedia instruction types include basic arithmetic operations (add, subtract, multiply, divide), logical operations (AND, OR, AND NOT, etc), compare operations, conversion instructions to pack and unpack data elements, shift operations, and data-movement instructions. To accommodate the new instructions and data types in the x86 architecture, Intel defined eight 64-bit MMX registers, MM0 to MM7. Intel designers obtained these registers for free by aliasing them with the floating-point registers.
Register aliasing has its pros and cons. From the pro perspective, aliasing eliminates the need to add silicon for new registers. Aliasing also eliminates the need to modify the operating system or system BIOS, which must keep track of these registers. On the con side, aliasing prevents you from performing routines that combine floating-point and MMX instructions; switching from MMX instructions to floating-point instructions can take as many as 50 clock cycles. After each MMX instruction, the CPU sets the entire floating-point tag word to valid. Before the CPU can execute a floating-point instruction, you must use the empty-MMX-state instruction to set the entire floating-point tag word to the empty condition.
| Using MMX to Build a Lowpass Filter |
|---|
| The following example uses MMX technology to change the sampling rate of an audio signal before mixing two signals sampled at different rates. Multirate systems are common in a variety of signal-processing applications, so you can generalize this example for applications beyond audio. The process of increasing sampling rate is called interpolation; the process of decreasing sampling rate is called decimation. The example mixes two 16-bit audio signals, one sampled at 22 kHz, as is common in .wav files, and the other, at 44 kHz, as is common in CD-quality audio. To increase the sampling rate of the 22-kHz .wav file, the interpolation routine first inserts zero samples between each pair of existing samples and then applies a FIR lowpass filter to smooth the function. The application of the FIR lowpass filter in the time domain yields the appropriate interpolated amplitude for the zero samples inserted in the time domain.
![]() You can use the Unpack instruction to unpack the 22-kHz data and insert a zero sample between each value, as follows:
With the zero samples inserted, the sampling rate is effectively 44 kHz. The insertion of zero samples, though, introduces high-frequency artifacts that you must filter out. The FIR lowpass filter removes the high-frequency components, smoothing the audio data stream. You calculate the filter output as follows: where: N=filter length,The output-data stream comprising y(n) samples is the interpolated 44-kHz signal. Choose the filter length and coefficients, N and c(k), to implement the desired lowpass function. For this example, the filter length is 65, because it gives the desired response with suitable cutoff and ringing.
Using the MMX pmadd instruction, the inner loop performs four 16-bit multiply-accumulate operations per iteration. The upper and lower 32-bit halves of the accumulator register, in this case, mm7, store the results. The program computes the final accumulated result, y(n), by adding the upper and lower halves of mm7. You repeat this process for all n elements. The result is a 44-kHz interpolated version of the original 22-kHz .wav file. MMX instructions access data 64 bits at a time in the FIR example. To avoid misaligned memory references and their associated performance penalties, align the data and coefficients on 8-byte boundaries. You should also follow general optimization techniques, such as loop unrolling and instruction scheduling for paired execution, to get the most performance out of the algorithm. The clock counts for this interpolation routine using standard integer, floating-point, and MMX instructions are 5503, 2518, and 933, respectively. In each case, Intel hand-tuned the algorithms in assembly language for optimal performance on a Pentium. The comparison with floating point is relevant, because you sometimes implement audio-processing functions in floating-point rather than in standard-integer format to take advantage of the fast floating-point multiply in the Pentium (approximately three times faster than standard integer multiply). MMX technology improves performance more than five times over standard integer code and more than 2.5 times over tuned floating-point code. Though somewhat simplified, this example illustrates the potential of MMX technology to speed inner loops by processing data in parallel. |
Some of the MMX instructions perform saturation arithmetic. For example, a packed addition adds 2 packed bytes and clips, or "saturates," the result to the maximum value if there is an overflow (
Figure 2). Likewise, a packed subtraction that causes an underflow clips the result to the minimum value. Saturation arithmetic is useful in audio and graphics applications. For audio, saturation helps reduce noise from the system. Suppose youre mixing 8-bit sound signals (with values of +127 to -128 by adding the signals. Without saturation, an addition may cause the result to wrap around and turn a positive value into a negative value, thus producing unexpected sounds. Although saturation mildly distorts the sound, you hear only a soft distortion. Another benefit of saturation is in pixel calculations, in which, for example, a wraparound addition would cause a black pixel to turn white during a Gouraud-shading loop.
The multiply-accumulate (MAC) operation is the heart of most digital filters, critical in practically every multimedia application. Assuming that data is in the on-chip cache, the non-MMX Pentium performs a MAC in three instruction cycles. At 166 MHz, this equates to 55.3 MACs per sec. With MMX, you can use the packed-multiply-and-add (PMADD) and packed-add (PADDD) instructions to perform four MACs in parallel (see box, "Using MMX to build a lowpass filter"). MMX compare instructions allow you to simultaneously compare as many as eight data elements and to shift as many as four data elements.
Intel designers kept the MMX instructions as generic as possible. Although UltraSPARCs VIS and Intels MMX instructions are similar, VIS is more robust. Using a constant multiplier during a multiplication loop provides one example of VIS robustness. This capability is useful for functions such as scaling pixel data with a constant alpha value and performing filter functions with fixed multipliers. With MMX, if you want to multiply four data values in an MMX register by the same coefficient, you must duplicate the coefficient four times in memory (
Figure 3). Although VIS lets you perform the same task, VIS also offers a multiply instruction (vis_fmul8316ax) that allows you to multiply each value by a single 16-bit component.
Suns designers also included several application-specific multimedia in-structions. An align operation allows the processor to access data types in the middle of a 64-bit word. When an image starts or ends on pixels that are not on 64-bit boundaries, the edge instruction compares the address of the edge with that of the current pixel block and calculates the appropriate byte mask. VIS has an array operation for traversing volumetric data sets used in 3-D visualization. For example, a magnetic-resonance-imaging scan creates a 3-D array of 2-D images. Each pixel in a 3-D image has x, y, and z components. The array instruction converts the 3-D, fixed-point addresses into a blocked-byte address, enabling the viewpoint to move along any line or plane with good spatial locality.
VIS includes a pixel-distance instruction for motion estimation in MPEG encoding, real-time video compression, and accelerating optical- or speech-pattern recognition. The pixel-distance instruction combines 24 operations into a single cycle by taking the absolute value of the difference between eight pairs of 8-bit data components and then accumulating the results. The Trimedia processor, with 37 SIMD instructions, also includes a motion-estimation instruction, which processes four data elements. Intel designers feel that a motion-estimation instruction is too special-purpose to "permanently" build into the CPU. The P55C achieves a similar result using about a dozen instructions.
Suns UltraSPARC can also perform block load/store instructions to transfer 64 bytes of data directly to and from memory and registers, circumventing the cache. These block-level instructions are useful for updating areas of an image and for handling packets of data coming in over a network.
| Development tools support multimedia extensions |
|---|
| Currently, the only way to incorporate Intels MMX instructions into your software programs is by inserting the instructions at the assembly-language level. However, Microsoft plans to introduce C-compiler support for MMX by the third quarter. In addition, NuMega Technologies has enhanced its SoftICE product, a $499 Windows debugger, for debugging programs that contain MMX instructions. American Arium provides Pentium and PentiuµPro in-circuit emulators (ICEs) that support MMX instructions. The companys ICE supports a code window in the debugger that disassembles these instructions. The ICEs trace window also supports MMX. The basic Pentium ICE sells for $6000; the version with real-time trace costs $32,480. Likewise, the basic PentiuµPro ICE sells for $10,000; the real-time trace version costs $45,480. Intel offers the VTune performance-optimization tool, which generates a timing analysis of the runtime execution of your code. You can use VTune to isolate the computation-intensive sections of code and determine where MMX may be beneficial. A demonstration is available on Intels Web at intel.com/ial/vtune. Sun supports UltraSPARC with a VIS software developers kit that sells for $995. The kit includes the VIS users guide, VIS "C" simulator, a debugging tool, a "nearly" cycle-accurate simulator for code tuning, a Solaris 2.5 linker upgrade, and sample C and VIS code. The VIS "C" simulator allows you to test VIS code on any SPARC platform. The code includes routines for image inversion, FIR filters, and convolvers. Sun also offers a virtual device interface (VDI) that implements more than 400 algorithms using VIS. The VDI uses macro functions to call these algorithms from within a C program.
|
The reincarnation of NSP
Native signal processing (NSP) is the real-time execution of digital-signal-processing algorithms on a host processor. In 1994, Intel delivered its plans to promote NSP and implied that Pentium would displace DSPs in PCs. As you can imagine, Intels zeal caused quite a ruckus in the PC industry, so the company quit its aggressive tactics. The advent of MMX technology has breathed new life into the capability of host-based processing. But, even without MMX instructions, the natural progression of each processor generation means that you get more free instructions per second to do NSP.
The most challenging aspect of NSP is defining the balance between what to do on the host and what to offload to dedicated hardware. To varying degrees, even the nonMMX Pentium can perform almost every type of multimedia application in software, depending on the performance trade-offs you are willing to make. Applications, such as 3-D graphics and 3-D sound, will continue to expand their feature sets, either using most of the CPU or limiting the deliverable features. And, in many cases, PC users want to run more than one application simultaneously, further complicating the host/hardware balancing act.
InQuest, a market research and consulting company, specializes in examining the multimedia balancing act. The company published a 400-page technology reference that analyzes the marketing and technology issues challenging PC multimedia developers (Reference 1). The reference discusses the concepts and feasibility of host-based processing. In addition, it is the only reference that explains almost every type of multimedia application.
Scalable signal processing
Some hardware vendors are preparing for the unknown benefits of MMX by developing scalable products. For example, the AD1815 and AD1816 sound controllers ($15) from Analog Devices use continuous-time oversampling (CTO) to synchronize a variety of analog and digital audio signals from the host CPU or hardware-accelerated algorithm engines. CTO uses digital PLLs to boost the sample frequency of the samples before mixing. The analog-mixer portion of the AD1815 complies with Intels analog codec 96 specification, which you can find at Intels Web site.
Chromatics MPACT media processor is another example of a scalable device. The processor contains a programmable DSP engine that is tuned to support multimedia applications, in-cluding 3- and 2-D graphics, MPEG-1 and -2 video, audio, telephony, fax/modem, and videophone. Proponents of the media processor focus on its high level of functional integration. Opponents say the device runs out of gas when you try to simultaneously perform multiple functions either degrading MPACT performance or forcing responsibility back to the host. An example of reduced performance is that MPACT drops frames when processing an MPEG video. If the host CPU must handle the multimedia application, opponents argue, then, let the host do it in the first place and use dedicated hardware accelerators to perform the functions the host performs poorly. Furthering the argument, a dedicated hardware accelerator typically delivers higher performance than does a media processor but with less flexibility.
Chromatic uses a resource manager kernel to schedule tasks for MPACT. The kernel splits code that runs on the x86 and on MPACT. This approach allows the system to take advantage of MMX if it is available.
Microsoft has built scalable signal processing into the DirectX application-programming interfaces (APIs) for Windows 95. The DirectX APIs include Direct3D and DirectSound. The DirectX programming model provides a multilayered, standard hardware interface that allows software vendors to develop their applications without locking into a specific hardware implementation (
Figure 4). The hardware-abstraction layer (HAL) lies at the lowest level of the DirectX API. One of HALs jobs is to query the performance characteristics of the underlying hardware to supply information about the subsystems capabilities. If the HAL reports that the system lacks a particular hardware accelerator, such as a 3-D graphics controller, DirectX transparently redirects the applications request for a hardware function to the hardware-emulation layer (HEL). The HEL provides software emulation of features lacking in hardware. Microsoft will incorporate support for MMX instructions into the HEL in the second half of the year.
The host processor can perform most multimedia functions, including graphics, in software. The PC provides three approaches to process graphics. The most common approach for processing graphics is having the host CPU perform the geometry, lighting, and delta calculations and then pass an integer number to a hardware accelerator for rendering. Approximately 22 companies, including Alliance Semiconductor, Brooktree, Cirrus Logic, and S3, currently support this approach. In all but the lowest end applications, the CPU performs the geometry processing in floating point to preserve the datas precision. Vendors are exploring ways to use MMX to speed the geometry pipeline, but this approach is questionable, because MMX caters only to integer data. A variation on this approach will occur when the hardware-accelerator vendors develop chips that use the Accelerated Graphics Port (AGP). Typically, the CPU passes graphics data to the graphics controller through the PCI bus. With a theoretical maximum data-transfer rate of 132 Mbytes/sec, the PCI bus is faster than the ISA bus. But AGP, a variation on PCI, operates with a 133-MHz clock, is dedicated to graphics, and transfers graphics data at 533 Mbytes/sec. AGPs speed may allow the 3-D graphics pipeline to be partitioned in a way that allows data to be passed back and forth between the accelerator and host CPU to take advantage of the MMX instructions for rendering functions.
The second and highest performance approach to processing graphics is to build in a parallel graphics-processing engine, in which the host CPU facilitates only the data transfer. 3Dlabs high-end, $37 Delta chip falls into this category. Delta, a geometry processor, performs transformations and lighting and setup calculations using floating-point data. It then converts the floating-point data to the correct input for the rendering stage of the graphics pipeline. MMX adds no value to this approach.
The third approach for processing graphics is to have the host CPU do everything, a "sweet spot" for MMX. Argonauts BRender and Criterion Softwares RenderWare are software-rendering engines that take the place of a hardware accelerator. Microsofts Direct3D also contains a software-rendering engine. Using MMX, these software renderers should be four times faster than many of the low-cost 3-D accelerators, which process 150,000 triangles/sec. To worsen matters, these slow accelerators add overhead for triangle processing, because they require you to convert the data to a format that the chip wants to see. When the Direct3D HAL reports back on the slow performance of these accelerators, Direct3D turns off the hardware and resorts to software emulation. Software rendering does have limitations, however, especially when compared with the higher performance accelerators. For example, to have its rendering engine process as many as 400,000 triangles/sec, Microsoft had to simplify the engines lighting model and moderate other features, such as texture mapping, Z-buffering, alpha blending, Gouraud shading, pixel rendering, fogging, and clipping.
MPEG 1 displays 30-frame/sec video at 3523240-pixel resolution and plays 44-kHz, 16-bit-stereo, CD-ROM quality audio compressed to a 150-kbps stream. MPEG 1 decoding requires 200 million to 250 million operations/sec to attain the desired frame rate and resolution. This level of performance should consume only about 30 to 40% of a 166-MHz Pentium-MMX, assuming that the system has dedicated hardware for color-space conversion and scaling. Most of the decoding algorithms calculations are addition, subtraction, and multiplication, an ideal situation for MMX instructions. Because color-space conversion and scaling are data-, not processing-intensive operations, these operations would require about 50% more of a P55Cs performance to handle.
Companies such as CompCore and Mediamatics provide software MPEG decoders. Microsofts ActiveMovie, a real-time extension of Windows 95 that performs various multimedia functions, also does software MPEG 1 at 24 full-screen frames/sec and 11-kHz stereo. The companies are retrofitting their software to support MMX. Comp- Core will take advantage of MMX to perform Dolby AC-3 and, possibly, MPEG 2 decoding in software. MPEG 2 will gain popularity when digital video disk players become more prevalent.
Other companies are planning to use MMX to raise the performance baseline and lower the cost of their MPEG hardware. For example, unlike most MPEG decoders, S3s Scenic/MX2 MPEG 1 audio/video decoder requires the host CPU to demultiplex the MPEG 1 system stream into separate, compressed video and audio data streams. The host CPU must also preprocess the compressed audio data. The system stream is a well-defined data stream that you can manipulate using MMX compares and shifts.
PC multimedia audio has come a long way since the original Creative Labs Sound Blaster. Wave-table music synthesis and 3-D audio technologies enhance your listening pleasure but can consume unlimited amounts of computational power. You must be careful when developing an audio application, because the human ear is extremely sensitive to audio discontinuities. Under Windows 95, DirectSound helps deliver the low-latency audio effects for sound playback and mixing. DirectSound allows you to simultaneously mix eight audio sources. It currently supports wave-table and FM synthesis. This year, DirectSound will also support 3-D positional audio. Like the other DirectX APIs, DirectSound includes software emulation for audio that will take advantage of MMX.
Brooktree offers WaveStream, a DirectSound-compatible, software wave-table synthesizer. Because WaveStream performs adaptive dynamic filtering and interpolation techniques, the product will benefit from MMX. Wave-table synthesis creates an instrument library by digitally sampling multiple note ranges on the actual instruments. The wave-table synthesizer plays a note by searching the library for the note range closest to a musical scale. The process then performs DSP algorithms to provide the pitch shifting, filtering, interpolation, and signal conditioning to re-create the original instruments timbre.
Kurzweil offers software audio synthesis on host (SASH), a software product that can be bundled with an audio codec. SASH is a scalable product, depending on the host processors bandwidth. As a rule, programmers dont mind spending about 5% of the host CPU for audio processing. SASH consumes 18% of a 100-MHz Pentium to handle eight voices at 22.05 Hz. At 100% CPU usage, SASH provides 32 voices with reverberation, echo, and delay at 44 Hz. The verdict is still out on SASHs performance with MMX.
To obtain normal stereo, your system must split a sound into two pieces and multiply each piece by different values. The multiplier values determine the sounds position. Good 3-D audio splits the sound into more pieces and enhances the quality by adding in environmental effects, such as room simulation and Doppler shifts. To produce the desired sound at the left and right speakers, the host CPU must perform DSP filters to create the localization effect. These filters, such as the FIR, include many MAC operations. Both Crystal River and QSound provide 3-D audio drivers that perform these functions. These drivers consume about 5% of a 90-MHz Pentiums instructions to support four channels without environment simulation. The companies estimate that this same functionality may consume only 0.5% of a 166-MHz P55Cs instructions.
There are two types of multimedia applications: bounded and unbounded. A bounded application is one that simply cant get any better. Examples include MPEG 1 decoding and a 14.4-kbps modem. Unbounded applications have no limits. For example, you can always make 3-D graphics more realistic. Or, you can better 3-D audio by adding voices and special effects. Although multimedia instructions dont allow you to reach the limit of an unbounded application, they help raise the performance bar to the next level.
| Looking ahead |
|---|
|
Although its still too early to comprehend the effects of the MMX instructions, vendors of PC products are busy figuring out how to take advantage of MMX. Certainly, new applications will arise, and existing applications will improve. MMX also will most likely maintain the Ping-Pong effect: Vendors demand more computational power, so that the performance lead continues to pass back and forth between processors and hardware accelerators. For example, now that MMX practically enables MPEG 1 decoding, Microsoft introduces its ActiveMovie product that allows interactive MPEG 1again consuming the CPU and requiring hardware acceleration. As more companies take advantage of host-based processing, DSP-related benchmarks, such as FFTs and FIRs, will become more important. Berkeley Design Technology is developing a report that presents benchmarks on Pentium, PowerPC, and other CPUs with DSPlike enhancements. The difficulty with these benchmarks is that they compare only µPs, not the systems in which they reside.
|
A special thanks to Bert McComas of InQuest, Spencer Greene of Alliance Semiconductor, Tom Clarkson of Brooktree, Paul Cobb of LSI Logic, and Satish Gupta of Cirrus Logic for their extra efforts.
| Manufacturers of multimedia-instruction products | ||
|---|---|---|
| Alliance Semiconductor Corp San Jose, CA (408) 383-4900, ext 102 | AMD Austin, TX (512) 462-4360 www.amd.com | American Arium Tustin, CA (714) 731-1661 www.arium.com |
| Analog Devices Norwood, MA (617) 329-4700 www.analog.com | Argonaut Technologies London, UK (44) 181-3582993 www.argonaut.com |
Berkeley Design Technology Inc Fremont, CA (510) 791-9100 www.bdti.com |
| Brooktree Corp San Diego, CA (619) 452-7580 www.brooktree.com | Chromatic Research Mountain View, CA (415) 254-1600 www.chromatic.com | Cirrus Logic Fremont, CA (510) 623-8300 www.cirrus.com |
| CompCore Multimedia Inc Santa Clara, CA (408) 567-0552 www.compcore.com |
Criterion Software Ltd Guildford, Surrey, UK (44) 1483-406200 www.csl.com |
Crystal River Engineering Palo Alto, CA (415) 323-8155 www.cre.com |
| Cyrix Corp Richardson, TX (214) 994-8388 www.cyrix.com |
InQuest Gilbert, AZ (602) 813-7785 www.inqst.com/inquest/ |
Intel Literature Center Mount Prospect, IL (800) 548-4725 www.intel.com |
|
Kurzweil Technology Group Waltham, MA (617) 890-2929 www.youngchang.com |
Mediamatics Inc Santa Clara, CA (408) 496-6360 |
Microsoft Corp Redmond, WA (206) 882-8080 www.microsoft.com |
|
NuMega Technologies Inc Nashua, NH (603) 889-2386 www.numega.com | Philips Semiconductors Sunnyvale, CA (800) 234-7381 www.philips.com |
QSound Labs Inc Calgary, AB, Canada (403) 291-2492 www.qsound.com |
| S3 Inc Santa Clara, CA (408) 980-5400 www.s3.com |
Sun Microelectonics Mountain View, CA (408) 774-8119 www.sun.com/sparc | 3Dlabs Inc San Jose, CA (408) 436-3455 3dlabs.com |
| Vivo Software Inc Waltham, MA (617) 899-8900 www.vivo.com |
Yamaha LSI San Jose, CA (408) 437-3133 www.yamaha.com | |