Decoding and virtualization bring surround sound to the masses
People commonly use the terms "stereo," "two channel," and "two speaker" to mean the same thing, but they're committing an etymological error. "Stereo" derives from the Greek word stereos, meaning "solid." In other words, a stereo recording-and-listening environment is one that reproduces the 3-D atmosphere of the original performance. Bob Dylan's Live 1966 ideally should sound just like it did when first performed in Manchester, England, and Miles Davis' Kind of Blue should transport you to 1959 and Columbia's 30th Street Studio, New York, even if you're listening to them in your living room in Sacramento, CA, in 2001. And, when the Tyrannosaurus Rex in the movie Jurassic Park growls on-screen, you should hear it over your shoulder and 20 feet above you, and the hair on the nape of your neck should rise.
As early as the 1930s, Bell Labs researchers JC Steinberg and WB Snow determined that you need at least three transducers to realistically reproduce an audio source (Figure 1a and b). Their research didn't comprehend additional speakers necessary to replicate the acoustics of the listening environment. Monophonic radios, phonographs, and tape players, eventually supplanted by their two-channel variants and by audio CDs, existed for reasons of the economics and technology limitations of the time, not because they delivered realism. Some of you may be familiar with the ill-fated quadraphonic (four-channel) systems that briefly appeared in the early 1970s. The theory behind quadraphonic audio was fairly solid, but the implementations weren't. Manufacturers' proprietary systems and corresponding media were incompatible with each other, and compelling content was lacking. Deluded audiophiles like to point to the failure of quadraphonic sound as justification for two-channel-only audio (Reference 1).
The fact is surround sound in movie theaters has thrilled consumers for decades. Leopold Stokowski and the Philadelphia Orchestra performed the classical music in Disney's (www.disney.com) 1940 film Fantasia in Fantasound, and it is but one early example of surround sound. Now-ubiquitous Dolby Surround first appeared in the mid-1970s, Dolby Digital debuted with the movie Batman Returns in June 1992, and Jurassic Park followed in 1993 in DTS (Digital Theater Systems) surround. Dolby Surround-encoded television broadcasts and videotapes are now commonplace and, along with Dolby Digital- and DTS-aware DVD and audio CD players, have brought surround music, movies, and other programs into living rooms and automobiles. DVD-Audio and SACD (Super Audio Compact Disc), assuming they succeed in the market (which is not a foregone conclusion), will accelerate this awareness, as will immersive gaming and other computer-based audio environments (references 2 and 3). Teleconferencing, another potential surround-sound application, enables listeners to differentiate individual speakers, talking at the same time at the other end of the line, within a group.
Another common argument of the audiophile Luddites against audio with more than two channels is the baffling statement that they need only two speakers because they have only two ears. In reality, the human auditory system, working with other sensory faculties, such as vision, can accurately locate a sound source in 3-D space using both absolute and differential time-of-arrival, intensity, and frequency cues along with training in how the head, shoulders, and ears modify incoming sounds. Even if the sound source is directly in front of you, the echoes and reverberations of the recording environment will alter it unless the recording occurs in an anechoic chamber. Because the end listening environment doesn't have the same acoustics, you need audio processing to re-create a semblance of the original.
Now that listeners have enjoyed surround sound with portions of their television programming, movies, and music collections, they'd like to extend immersiveness to all of their multimedia experiences, regardless of the sources' characteristics—that is, monophonic or two-channel—or the listeners' settings, such as in an automobile, an airplane, an office, a conference room, a living room, or on the jogging track. But, because of financial and aesthetic resistance to buying and installing center, rear, and subwoofer speakers—the so-called SAF (spousal-acceptance factor)—many of them would like to gain an approximation of the true surround-sound experience using their current two-speaker or headphone configurations. Fortunately, just as with lossy audio compression, an in-depth understanding of both the strengths and the shortcomings of the auditory system, along with some DSP horsepower and memory, can credibly accomplish these seemingly divergent objectives (references 4 and 5).
If you're starting with a single-channel—that is, monophonic—audio source, it might seem impossible at first glance to create a two-channel variant—never mind a full-blown surround-sound representation—of it (Reference 6). Recall, though, that people generally regard low-frequency sounds as nondirectional; that is, the human auditory system can perceive them no matter what location in the listening area they come from. Therefore, you can apply a lowpass filter to the source and direct frequencies of less than 100 Hz or so only to those channels whose transducers will likely be able to reproduce the frequencies, such as a distinct 0.1 subwoofer channel.
Next, delay the highpass-filter output by an adjustable amount of time if you want to allow for user customization and then add it to the nondelayed signal to create one output channel and subtract it from the nondelayed signal to create a second channel (Figure 2). The results are true and complementary comb filters, and, when you present the two channels through two speakers, the frequencies split evenly between them. Running both channels through the same speaker cancels out the delay and re-creates the original monophonic signal. Advanced technologies from some of the companies that the sidebar "For more information..." lists employ more elaborate filtering techniques to transform monophonic audio. Patent searches and conference papers reveal some clues on these advanced techniques, but you must sign a nondisclosure agreement to get all the details (see sidebar "Haven't heard enough?"). In contrast, some simple and cheap pseudo-two-channel algorithms simply subdivide the audio into multifrequency bins, allocating the bins among the various channels.
Now, you've got a two-channel audio clip, either as an original recording, which, let's assume, contains no matrix-encoded center and mono-surround information, or one resulting from the earlier mono-to-two-channel conversion. Next, you might want to first alter the "sweet spot," or region within which a person listening to the audio can hear both channels. For a computer user sitting before a display, the sweet spot can be narrow and shallow, which often results in more accurately perceived sound positioning and which is particularly appealing with 3-D games. A home-theater setting, in which listeners may be in 10-seat rows or in which the listeners are milling about instead of remaining in one location, requires a wide, deep sweet spot. A similar situation exists in automobiles, in which neither the driver nor any of the passengers is in an ideal listening location. Bigger sweet spots result in a more immersive surround-sound experience at the possible expense of reduced sound-source-positioning precision.
The sweet-spot characteristics also depend on the anticipated placement of the two front speakers and on the geometrical relationship between their spacing and the listener's location. If the speakers are several yards apart, as with most audio/video receiver setups, the sweet spot is naturally wider; however, the center-perceived location of audio material that they share is indistinct. Adjacent transducer placement, such as with speakers on either side of a television tube or computer display, creates a narrow sweet spot but a well-defined center location for dialogue and other common material. To enhance the center-channel characteristics, you determine what material the front left and right channels share and then emphasize it using HRTF (head-related-transfer-function) transformations. Ideally, center-channel information transmits through a dedicated speaker, because a "phantom center," which you create by coupling the left and right front speakers, exhibits timbre that differs from the real thing.
In addition to emphasizing the shared information, you might also want to broaden the audio image created by data in one channel and not in the other. One quick and dirty means of accomplishing this goal involves inverting the phase of one of the two channels; the "stereo-wide" button in low-end consumer-electronics gear frequently activates this normally undesirable technique. This technique obliterates center imaging of shared-channel content. A more elaborate technique that less destructively scatters the sound-source-directional cues involves first calculating the channel (A–B) and (B–A) information and then employing frequency-dependent time, spectrum, and overall intensity alteration to create the perception that the sound is originating beyond the physical boundary of each speaker.
Next, how do you create listening-room acoustic effects for the rear surround speakers? First, you must differentiate among early reflections, echo, and late reverberations (Figure 3). These three phenomena result from sound reflecting off objects before entering listeners' ears, but the auditory system perceives them in different ways. Early reflections—those lagging behind the original sound by as much as 30 msec—enable the ear and brain both to locate the sound source and to perceive the room dimensions. Their amplitude depends on the reflectivity of objects the sound waves bounce off before entering your ear, whereas their delay is a function of room width, depth, and height and of the presence of reflective objects within a room.
We perceive direct reflections—those that bounce off only one or only a few objects before entering the ear—as echoes beyond 30 msec; hence, these reflections tend to degrade the sound. Conversely, sounds that have been reflected many times with attenuation at each reflection point come at a listener from all directions. Many of these low-amplitude, diffuse late reflections, or reverberations, simultaneously arrive at the listener. A certain amount of reverberation is generally desirable. Think, for example, how much richer your voice sounds when you sing in the shower. Conversely, in rooms with little reverberation, the sound you hear is often unpleasant.
You create the perception of reflections and reverberations by employing RAM-based delay lines, along with signal processing, to modify the audio as a real-life reflection would. Different delays, intensities, and spectral transformations can create the illusion of a cavernous concert hall or an intimate jazz club. Artificial ambiance seems appealing in theory, but reality is often underwhelming, especially if a listener exaggerates the effect. An acoustical model that sounds good with a symphony orchestra, for example, might sound horrible with a solo pianist or vocalist. Short cuts in memory and processing power to reduce system cost and power consumption leave the resulting reverberation sounding artificial. Artificial ambiance also clashes with other ambiance already in the audio.
An alternative approach is more processing-intensive but more authentic in its results. It involves analyzing and extracting this existing ambiance and sending the reflections and reverberations to the rear channels. Audio engineers capture this ambiance during the original recording by using binaural, hypercardioid, or omnidirectional microphones, or they can add the ambiance to the audio during mixing—a high-tech version of singing in the shower. Directing reflections and reverberations to the rear speakers can be an effective arrangement in an automobile. The traditional auto audio configuration replicates the right and left audio channels in both the front and the rear speakers. This so-called dual-stereo configuration significantly diminishes the listeners' appreciation: The speakers not only are in poor locations but also bombard your ears with destructive crosstalk from both the front and the rear.
After creating additional audio channels, you might also want to enhance the perceived low frequencies of the audio to compensate for anticipated bass-deficient speakers or the high frequencies to counterbalance the effects of lossy compression. Harmonics play a part in both operations. Mix and play back 100- and 150-Hz tones in an audio-editing program, and you'll also hear what sounds like a 50-Hz tone. At the high end of the frequency spectrum, Kenwood's (www.kenwood.com) Supreme technology interpolates high-order fundamental tones from lower-order harmonics that have survived lossy encoding. Thomson Multimedia's (www.thomson-multimedia.com) MP3pro compression scheme takes similar advantage of harmonics to shift high frequencies beneath harm's way during encoding, subsequently restoring them during decoding.
At this point, you've created front left and right channels, a center channel, one or more rear channels, and maybe even a subwoofer channel. Today's most common ideal reproduction setting is a six-speaker setup like the one that the ITU (International Telecommunication Union) defines (Figure 1c). Aesthetic considerations can drive subwoofer placement because the channel's sound isn't directional. However, if you place the subwoofer against a wall, especially in the corner of a room, you will perceive its sound as the loudest. For music reproduction, all other speakers should have full frequency response, and the surround-sound speakers should be directly radiating. Conversely, for movies, the center channel often carries only dialogue, and the surround-sound speakers find most use in reproducing special effects. Such home theaters often employ bipole or dipole surround-sound speakers for immersiveness. Unfortunately, they also often trade off speaker frequency response and other characteristics to reduce cost.
What if having more than two speakers isn't an option? In this case, you need to create the illusion of more speakers than actually exist. Recall that, in the ITU configuration, each ear of each listener perceives sounds that originate from all six speakers. Interaural time differences play a key role in locating sound sources of frequencies of 1 kHz and lower (Figure 4). Conversely, interaural intensity differences are the primary means by which the auditory system locates the source of sounds higher than 1 kHz. When a sound source is close to a listener, the spherical outward radiation of sound emanating from an off-center source and the resulting level difference between the ears of the listener are additional factors in determining location. The unique shape of each person's head and shoulders is an appreciable barrier to and spectral modifier of sound waves, and the spacing between ears, shape of each ear, and shape of the auditory canal leading to the eardrum are also key identifiers of a sound's direction.
HRTF curves summarize these listener-specific phenomena. By transforming a sound's timing and frequency spectrum using the HRTF data, you can theoretically "place" a sound anywhere in space around a listener's head using only two speakers. Some algorithm developers claim that if an audio engineer has employed basic amplitude-based effects rather than more elaborate HRTF-based techniques to give the illusion of a sound moving from front to back, an HRTF-aware virtual-surround-sound system can deliver a more realistic result than a front- and back-speaker combination. Because our ears are on the sides of our heads, we're particularly sensitive to shortcomings in "phantom" midlateral speakers, which HRTF transforms account for, particularly if those transforms also adjust for interaural crosstalk effects.
Conversely, when a front-placed speaker pair creates virtual surround, it's difficult to spin a psychoacoustically altered sound completely around to the back of your head and equally challenging to communicate sound source height cues. If you're playing a video game, your head is stable and centered, and you also have visual "suggestions" in the form of on-screen objects to aid in locating sound. However, in the absence of visual assistance, you may experience front-versus-back location confusion. In real life, you'd turn your head to help find the sound. But, because the sound's location is virtual and the HRTF-transformation effect depends on your head's orientation to the real speakers, this reflex reaction makes the problem worse.
Ideally, HRTFs should be customizable for each listener, because we all have different head shapes, ear spacings and contours, and concha openness and depth. Multilistener environments obviate the effectiveness of such customization, however. Some HRTF algorithms also adapt their responses to plane sound sources, to nearby sources where spherical effects are important, and to listeners who are not in the sweet spot. (For example, when a train speeds by, it does not serve as a point source.)
When audio engineers mix music and movie soundtracks, they assume that users will play the end results on traditionally placed speakers. As noted, even in a simple two-speaker configuration, each ear senses not only the output of its corresponding front speaker, but also a spectrally modified and time-delayed version of the opposite speaker. However, this acoustic crosstalk doesn't occur with headphones, in which each channel is isolated to only a single ear. Uncompensated audio, intended for speakers but played back through headphones, produces an annoying in-the-head effect, compared with a compensated alternative, which mixes each channel with a time-delayed and spectral-modified version of the opposite.
HRTF transformations for virtualization with more than two speakers over headphones take a different form from their two-speaker equivalents. Because the speakers are on the ears and follow the head regardless of its orientation, sweet-spot issues disappear, more realistically achieving full 360° sound placement and convincing vertical positioning. However, the direct coupling of the transducer to the ear effectively removes outer-ear HRTF effects, which can make it difficult to achieve consistent results from user to user. Headphone-virtualization algorithms are also sometimes complicated because they can't use the acoustics in the listening room to help recreate the original environment. On the other hand, if the listening-room acoustics are detrimental (echo-conducive, for example), you should probably begin with a "clean slate."
To achieve more realistic reproduction of sound-source height (think of a rocket taking off) and to improve the realism of front-to-back and back-side-to-side sound source movement, audio pioneer Tomlinson Holman advocates a 10.2-channel system (Figure 1d). His approach builds on the 5.1-channel ITU standard with the addition of dual side—that is, midlateral—channels, a rear center channel, dual subwoofers, and two height channels. The height channels target reproduction over speakers ±45° horizontally and 45° vertically away from the listener. As an interim step to 10.2 channels, several 6.1-channel approaches are gaining popularity. Dolby Digital Surround EX, which Dolby Labs developed with Holman, matrix-encodes a rear center channel, as does DTS-ES Matrix. DTS-ES Discrete employs a distinct rear center channel; Digital Theater Systems designed its bit-stream format to be backward-compatible with respect to additional audio channels; higher precision sample sizes; and higher sampling frequencies, such as the upcoming 24-bit, 96-kHz DTS. And THX Ultra2 technology extracts seven full-range channels and a subwoofer channel from 5.1-channel source material.
Guttenberg, Steve, and Brent Butterworth, "Stereo vs 5.1: Is more more....or less?" Stereophile, August 2001, pg 49.
Dipert, Brian, "'Bassless' buzz impairs advanced audio's image," EDN, May 24, 2001, pg 20.
Dipert, Brian, "Security scheme doesn't hold water(marking)," EDN, Dec 21, 2000, pg 35.
Dipert, Brian, "Digital audio gets an audition: part two, lossy compression," EDN, Jan 18, 2001, pg 87.
Dipert, Brian, "Digital audio breaks the sound barrier," EDN, July 20, 2000, pg 71.
Rose, Jay, "Stereotypes," DV Magazine, September 2001, pg 98.
Layton, Leonard, "Surround sound takes to the air," IEEE Spectrum, August 2001, pg 56.
Kraemer, Alan, "Two speakers are better than 5.1," IEEE Spectrum, May 2001.
Rumsey, Francis, Spatial Audio, ISBN 0-240-51623-0, Focal Press, Woburn, MA, 2001.
Coulter, Doug, Digital Audio Processing, ISBN 0-879-30-566-5, CMP Media, Lawrence, KS, 2000.
|Thanks to Andrew Reilly from Lake Technology, Scott Willing from QSound Labs, Randy Roscoe from Spatializer Audio Laboratories, and Alan Kraemer from SRS Labs, for their assistance and feedback.|