Now hear this
By Brian Dipert, Technical Editor -- EDN, 2/3/2000
Advanced lossy audio-compression algorithms can squeeze the average 16-bit, dual-channel (32-bit total) digital-music sample down to just over 2 bits (a 96-kbps audio stream), while retaining high enough sound quality that an overwhelming majority of audiences can't distinguish the compressed version from the original source material. Even if listeners can detect a difference, odds are good that they won't be able to consistently identify the original version as better and the compressed version as worse.
When archival storage requires lossless compression, multimedia-optimized routines can still condense the average sound file to nearly half its original size and sometimes even smaller. (The use of the term "lossless" is relative. Any digital conversion of analog sound, by virtue of its inherent quantization, is a lossy process. However, lossless compression subsequently preserves all bits of the digital material; lossy compression further reduces its content.)
What sonic characteristics help the hardware and software work their bit-rate-reduction magic? (For more on the hardware and software, see "The Internet-audio (r)evolution," pg 101.) And what are the secrets behind the compression algorithms' tricks? Consider the following attributes of sound creation, delivery, and reception.
The source
Audio, like video, often exhibits a lot of sample-to-sample redundancy, especially over short periods of time (Figure 1). This correlation extends to similarities between left and right channels of conventional stereo recordings and even to some of the additional channels in surround-sound systems. With the possible exception of live music recordings and material originally recorded on now-degraded analog tape, little to no phase differential exists between the left and right channels, either. Live recordings also commonly contain a greater amount of high-frequency ambient noise than studio recordings.
Most music concentrates most of its energy in the lower portion of the high-fidelity frequency range, and this situation is always true for content consisting entirely of vocals and low-frequency effects. In surround-sound movie systems, recording engineers usually intend the center channel to convey dialogue. The dialogue's frequency extends only up to a few kilohertz; the low-frequency-effect channel's energy ends after a few hundred hertz. Rear surround channels primarily provide 3-D ambiance, and sound quality is often less critical in the rear than in the front stereo channels.
With music, however, you can't assume that the information intended for rear channels is less valuable than the information targeted at the front speakers. Music systems also derive the subwoofer content from a lowpass-filtered, monophonic combination of the other channels, rather than assuming a separate information stream in the source material.
The delivery mechanism
Most people own inexpensive headphones and speakers. These devices significantly attenuate the signal at the low and high ends of the high-fidelity audio spectrum, and the speakers' response is irregular even across their usable range. Successive analog amplification, switching, and other stages degrade the digital source material before it reaches the speakers.
The transmission medium
The transmission factor comprehends two aspects of audio: the environment between the speakers and your ears and the environment between a digital- media source and a destination. Ambient noise, which is present in automobiles with open-air headphones and in other nonideal listening environments both decreases the audio's effective dynamic range and masks portions of the source material's frequency spectrum.
Whereas 16-bit, dual-channel, 44.1-kHz sampled audio requires more than 1.4 Mbps of streaming bandwidth and nearly 11 Mbytes/minute of storage space, most of today's Internet users have 56-kbps analog modems, often with a usable bandwidth that barely exceeds 40 kbps. Single-channel integrated-services-digital-network bandwidth is only 64 kbps, and guaranteed-base-rate asymmetrical-digital-subscriber-line bandwidth pokes along at 384 kbps. Cable-modem bandwidth depends on the number of users sharing the same head end and on the bandwidth of the link between the head end and the Internet (such as a 1.5-Mbps T1 line).
The receiver
Many people believe that high-fidelity audio spans frequencies of 20 Hz sub-bass to 20 kHz ultratreble, and they make purchasing decisions based on how well a system claims to represent this range. But only a small percentage of these people can actually hear information across that entire range, especially at the high end (Figure 2). Even most young people's hearing rolls off at around 17.5 kHz. By the time they enter their second half-century, their ability to perceive audio information diminishes beyond 14 kHz. Aggressively quantizing high-frequency content, converting it from stereo to mono to create redundant information in all channels or, in some cases, completely eliminating it, is an effective compression technique, because high-frequency data creates the greatest sample-to-sample variability.
Sound-pressure testing indicates that our auditory system is most sensitive at 2 to 5 kHz, with 4 kHz being the "sweet spot." This characteristic shouldn't be too surprising; evolution has prepared us well for the frequencies that dominate human-to-human conversation and the sounds that life-threatening animals create. The organ of Corti is the structure in our inner ear responsible for translating incoming audio waves to nerve impulses. Its basal-membrane width, thickness, and stiffness and hair-cell clustering suggest that the detection of incoming audio information groups into a number of critical frequency bands; most of these bands are below 5 kHz (Table 1). Notice that the bands' widths increase as the corresponding center frequencies rise.
Many of you have probably already heard something about, and all of you have undoubtedly experienced, the psychoacoustic phenomenon called masking. Simply defined, masking occurs when a tone of a given energy and frequency blocks you from hearing lower energy tones of nearby frequencies. This phenomenon operates within a frequency band and, because of harmonic effects, also across other frequency bands. Masking also has a temporal perspective; a high-energy tone can mask both same- and nearby-frequency tones occurring up to a few milliseconds both before and after it (Figure 3). If you can't hear a valid audio tone, why bother storing it, and if you can't hear quantization noise, why bother using more bits than are necessary to overcome it? Audio engineers define such information as irrelevant.
Our auditory systems are better able to sense the origin of a sound as its frequency increases. This characteristic explains why the high-frequency tweeter in a speaker is often at the top of the driver stack, as close as possible to ear level. It also explains why people often put the subwoofer in the corner of a room or sometimes use it as a coffee table and why its large bass cone frequently points toward the floor. If multiple audio channels contain redundant low-frequency information, you can discard all but one of them.
Algorithm constraints
Exploring the ins and outs of a number of compression schemes, you can find tremendous diversity among encoding and decoding techniques (Table 2). This variety reflects the trade-offs necessary to balance encoding and decoding speed with compressed audio size and quality. Additional things to keep in mind when evaluating codecs include the required amount of logic and memory and their associated costs. Sometimes the designers implement the algorithm in ASIC hardware (for speed, cost, and power reasons), sometimes they rely on software running on a CPU or DSP (for flexibility and upgradability), and sometimes they choose a programmable-logic-based, middle-ground approach.
Beyond analyzing the encoder and decoder's absolute complexity, you also need to look at their relative complexity. Is the application a one-to-one interactive configuration, such as teleconferencing, in which both encoding and decoding occur at each node? Or is it a one-to-many unidirectional distribution, such as digital radio? Will compression and decompression occur infrequently and in a manner that is not time-critical, implying that codec speed is less important than compressed media size? Or does the application use streaming media, in which encoder and decoder performance is paramount? Streaming applications place greater than normal demands on the decoder, which must now also buffer the incoming information and gracefully degrade its output quality if it doesn't receive portions of the data in a timely fashion.
Does the application require a constant per-sample bit rate and guaranteed file size for an audio sample of a given duration, or can it tolerate a variable bit rate in exchange for potentially higher audio quality? Even in constant-bit-rate environments, advanced codecs add complexity by dynamically allocating all available bits to the various frequency bands on a frame-by-frame basis in an attempt to maximize quality.
We've only just begun
The codecs themselves are just as complex and interesting as the audio theory behind them. In a future article, I'll discuss how lossless and lossy compression take advantage of audio's attributes and how some well-known codecs trade off design constraints to suit their application objectives. Until then, check out the URLs listed in Table 2, references 1 through 12, and the sidebar "For more information."
Industry standardization delivers a number of benefits, not the least of which are the high quality and cost savings you obtain as a result of the large number of vendors that invariably pursue any standards-based business opportunity. However, just as closed-box designs are good candidates for proprietary wavelet-based video codecs, these systems don't necessarily need to use an industry-standard audio codec. By investigating some of the less well-known algorithms mentioned in this article, you might find one that better matches your requirements.
Even with an industry-standard codec, a lot of opportunity exists for vendor-specific optimizations. MPEG, for example, defines the compressed-audio- file format and therefore the decoder. But MPEG says nothing about the encoder and how it might handle various compression trade-offs. Double-blind listening tests have identified wide quality variances between different MPEG-1 Layer 3 (MP3) encoders at the same bit-rate setting, and decoder bugs sometimes degrade the results you hear. Defining a standard by documenting its compressed-file format also enables manufacturers to make incremental encoder improvements over time (as Sony has done with adaptive-transform-acoustic coding, for example) while remaining backward-compatible with first-generation decoders, thereby avoiding product obsolescence.
I'm not enamored with testing that uses sine waves, white noise, short tone spikes, or other unnatural sounds. Benchmarking methods such as these imply that the person doing the study has preconceptions of encoder and decoder weaknesses and has created artificial signals designed to expose these weaknesses. Also, although you could argue that music is nothing more than a combination of multiple sine waves, music's inherent variability does a good job of hiding sins that an infinitely repeating fixed pattern would reveal. Just as the still or slow-transition display of MPEG-2 frames might lead you to an excessively negative opinion of MPEG-2 video quality, testing audio codes with inputs other than the voice, music, or other audio material that the codec will see in real life is of questionable value.
|
For information on subjects discussed in this article, use EDN's InfoAccess service . When you contact any of the following manufacturers directly, please let them know you read about their products in EDN.
|
Author info
![]() |
Contact Technical Editor Brian Dipert at 1-916-454-5242, fax 1-530-937-8147, bdipert@pacbell.net.
REFERENCE
1. Dipert, Brian, "Compression puts images on a diet," EDN, June 18, 1998, pg 71.
2. Eyre, Jennifer and Bier, Jeff, "DSPs court the consumer," IEEE Spectrum, March 1999, pg 47.
3. Gibson, Jerry D, et al, Digital Compression for Multimedia, Morgan Kaufmann Publishers, 1998, ISBN 1-55860-369-7.
4. Nelson, Mark and Gailly, Jean-Loup, The Data Compression Book: Second Edition, M&T Books, 1996, ISBN 1-55851-434-1.
5. Pohlmann, Ken C, Principles of Digital Audio; Third Edition, McGraw-Hill, 1995, ISBN 0-07-050468-7.
6. Ranada, David, "MP3 sound quality?," Stereo Review's Sound and Vision, April 1999, pg 40.
7. Ranada, David, "Download Showdown," Stereo Review's Sound and Vision, September 1999, pg 98.
8. Rao, KR and JJ Hwang, Techniques and Standards for Image, Video and Audio Coding, Prentice Hall Professional Technical Reference, 1996, ISBN 0-13-309907-5.
9. Solari, Stephen J, Digital Video and Audio Compression, The McGraw-Hill Companies, 1997, ISBN 0-07-059538-0.
10. Steiglitz, Ken, A Digital Signal Processing Primer, with Applications to Digital Audio and Computer Music, Addison-Wesley Publishing Co, 1996, ISBN 0-8053-1684-1.
11. Strassberg, Dan, "Hello...is anybody out there?" EDN, Dec 23, 1999, pg 97.
12. Watkinson, John, The Art of Digital Audio: Second Edition, Focal Press, 1994, ISBN 0-240-51320-3.
ACKNOWLEDGMENTSoftSound Ltd, which manufactures both lossless and lossy codecs, provided comprehensive information about this article's topic. Thanks to SoftSound's Tony Robinson for all his research assistance.

















