Feature
Digital audio gets an audition: Part two: lossy compression
Lossless compression can shrink file sizes down a fair amount, but for serious weight loss, you need to permanently discard some of the data. Find out how the lossy codecs work and whether you can hear the differences between them.
By Brian Dipert, Technical Editor -- EDN, 1/18/2001
Compressed file sizes aren't meaningful comparison points for lossy-compression algorithms, because the objective is to always encode to a specific bit rate (Table 1). The time it takes to encode to that bit rate for a given µP type and speed differs from one algorithm to another, though, as does the quality at that bit rate. Quality is a user-, environment-,and application-dependent metric. Last year, my neighbor swore that he was unable to hear any differences when he listened to an original audio CD, a 128-kbps MP3 stream, a 64-kbps WMA stream, and a 16-kbps RealAudio stream through his PC's low-cost speakers. However, when I brought him to my house and played the same files on my PC's higher-end speaker set, he immediately understood the need for those files "that take so long to download." I generally resist the urge, therefore, to make quality comments on various codecs, though if one stands out as sounding particularly good or bad to my less-than-golden ears, I'll mention it. Even my "slow" 533-MHz CPU can rapidly encode and decode a 30-sec test-tone clip. Therefore, for the performance-analysis portions of this lossy-compression project, I employ the same 19 songs used to evaluate lossless-compression algorithms (Table 2 from part one of this series). In addition to measuring performance, I also hope to reveal the presence of various lossy-compression techniques and their artifacts. The list of things I am looking for include: More on test tones The white- and pink-noise clips I use in the lossless-compression study are also useful in my lossy-compression work. Equal-intensity noise channels, converted to a frequency-domain display via a spectrum analyzer, enable me to identify any lowpass, bandpass, or highpass filtering that a codec performs. Channels of differing intensities provide additional details—specifically if the encoder is converting the source material from stereo to mono within certain frequency ranges. The human auditory system groups its detection of incoming audio information into a number of critical frequency bands, with most of the bands residing at less than 5 kHz (reference 1 and reference 2). Note that in Table 2, the bands' widths increase as the corresponding center frequencies rise. A structure in the inner ear called the organ of Corti translates incoming audio waves into nerve impulses. Its basal-membrane width, thickness, stiffness, and hair-cell clustering define the critical band-frequency ranges and endpoints. What better way to continue my test-clip development, then, than by combining tones at the midpoints of each critical band? Syntrillium Software's Cool Edit Pro, which costs roughly the same as Sonic Foundry's Sound Forge, includes a 64-track mixer that I use extensively. Cool Edit Pro enables me to create and combine precisely defined audio tones, as well as generate white, pink, and brown noise. Its time-based (oscilloscope) and frequency-based (spectrum- analyzer) output displays are more informative and have more robust features than those in Sound Forge. All of my critical-band-derived sound clips have one channel 180° out of phase from the other, to give the encoder one more challenge to surmount and to enable me to look for phase collapse in the subsequent decoding. As with the pink- and white-noise clips, I created two versions of each file; one with both channels at equivalent amplitude, and the other with the left channel 20 dB "louder" than the right. To generate each file, I first created a number of 32-bit per-sample and per-channel, 44.1-kHz-sampled, single-tone sources, then mixed them together at 32-bit resolution in Cool Edit and attenuated the result to the desired maximum amplitude. I then needed to convert them to 16-bit equivalents. After discussions with both Syntrillium Software and with audio consultant Arny Krueger, I chose the following sample-type-conversion settings: Next, to test for frequency masking, I regenerated my critical-band midpoint mix, but this time mixed in 50 additional coincident tones, half of them at the one-quarter point across each critical band and the other half at the three-quarter point. One- and three-quarter-point test tones were 20 dB quieter than their midpoint neighbors. To test for temporal masking, I first determined that the pre-tone masking duration extended no further than 50 msec ahead of the masking tone, and the post-tone duration extended no more than 200 msec beyond the masking tone (Reference 3). Therefore, I again created my 30-sec midpoint tone combination. But this time, I preceded it by 50 msec of the same tonal mix, but 20 dB quieter, and followed it with 200 msec of the same 20-dB-quieter mix. Finally, to find pre- and post-echo noise around sharp audio transients, I turned to three tracks on the EBU SQAM disc: track 27 (castanets), track 32 (triangle) and track 35 (glockenspiel). Although I created the noise and test tones in Cool Edit Pro, I switched to Sound Forge for the lossy-compression process because of its more comprehensive format support. Sound Forge version 4.5h can encode MP3, RealAudio G2, and Windows Media Audio 7 files. It can also decode MP3 files back to WAV, but licensing restrictions preclude it from supporting RealAudio and WMA decoding, forcing me to rely on RealNetworks Real Jukebox and Microsoft's command-line decoder, respectively. By encoding from the same WAV file to each of the three formats within an otherwise-identical software environment, I hope to be most accurately measuring the speed of the encoding algorithm, with other system overheads canceled out. The need for speed I ran 19 song clips and 13 test tones through MP3 encoding 10 times, with each iteration a combination of one of five compressed target bit rates, in conjunction with either a quality- or performance-optimized encoder configuration. I ran them through WMA encoding four times and through RealAudio encoding twice. I also ran each resultant MP3 file through the decoder built into Sound Forge. That's 512 total encoder runs, 320 decoder runs, a whole lot of mouse clicks, and a whole lot of time spent staring at a computer monitor. Fortunately, Sound Forge supports batch-mode capability and gives you the option to create a log file that captures time to encode and decode. For Windows Media Audio 7 decoding, I used a DOS-command-line utility that Microsoft supplied me. I was unable to figure out how to capture to a file the time-to-decode message displayed on-screen, so I manually logged each displayed value as the batch file ran. RealAudio G2 decoding uses RealJukebox's convert-to-WAV capability, and I referenced the "created" and "modified" time/date stamps, which are viewable through Windows Explorer, to determine decode time. In analyzing the encode- and decode-performance-testing results, several trends are evident (Table 3 ). (You can view the results for all 19 music genres in PDF format.) Look at the disparity in encoding times between MP3's "fastest encode" and "highest quality" settings, even at the same bit rate; 64 kbps deviates from the general trend, but a good reason for this anomaly exists. The MP3 encoder, when set to 64 kbps, down-samples the original 44.1-kHz material to 22.05 kHz and severely lowpass filters out the upper portion of the frequency spectrum. These alterations ensure that the encoder has less source data to work with and at least partially explain why the "highest quality" and "fastest encode" results are more alike at this bit rate. Also, notice that MP3 "highest quality" encoding to 192 kbps is actually faster than encoding to 160 kbps. Although the encoder is generating more compressed data at the higher bit rate, this trade-off gives an overall performance benefit: The encoder needs not work so hard at 192 kbps to squeeze the data down while maintaining quality. This result also suggests that, thanks to a fast hard drive and DRAM, the additional system overhead that my PC needs to store the larger compressed bit stream is an insignificant factor in the results. I was actually measuring the encode speed. Table 2 in part one of this article lists the songs I used for each music genre, their duration, and their uncompressed WAV sizes. Match this information with that of Table 3 in this article, and you'll find that, as with lossless compression, songs of similar duration but different genres sometimes have significantly different encoding delays. This trend indicates that some types of music are "harder" to compress to a given bit rate and quality than others, and it validates the hunch that prompted me to do all this work in the first place! The results make sense: Compare a techno track to spoken word, for example, and you'll find that the techno track has a broader meaningful frequency spectrum, increased high-frequency content, greater channel-to-channel variation in both amplitude and phase, and more abrupt transients. Evaluate WMA against MP3, particularly in the context of the quality results that follow, and WMA will probably impress you. As a general rule (with a few exceptions), the WMA encoder performance approximates that of the MP3 encoder set to "fastest encode," while its quality at least matches (and, at lower bit rates, exceeds) that of MP3 files created using the "highest quality" setting. RealAudio's encoder speed is approximately the same as that of WMA and MP3 set to "fastest encode," but the quality news isn't so good. On both test tones and music tracks, RealAudio files consistently sounded the worst and contained the largest number of lossy compression artifacts. And what about decoders? In all three cases, their speed scaled with the bit rate of the file they were decoding. (More bits to decode means a slower decoding speed, all other factors being equal.) At 64 kbps, MP3 decoding runs much faster than the other two decoders, but remember that the encoder had previously halved the sample rate, halving the size of the resulting decoded WAV file and giving the MP3 decoder a significant built-in speed advantage. At greater than 64 kbps, MP3 and WMA decoder speeds were comparable. Poor RealAudio, though, was consistently slower than its peers, roughly twice as slow on average. Be careful when drawing definitive conclusions here. I used three decoding-software packages, so some of these differences may be the result of factors other than the decoding algorithms themselves. Now let's see how well the test tones unveil the secrets behind the lossy codecs' magic. First, look at a spectrum-analyzer (frequency-sweep) plot of the original sound clip 2 (Figure 1a), along with its 64-kbit MP3 (Figure 1b), RealAudio (Figure 1c), and WMA (Figure 1d) counterparts. This diagram, and all subsequent MP3 diagrams, show the output of the encoder set to its "fastest encode" setting. As you examine the data that follows, as well as the additional information in this article's Web site addendum compare the trade-offs that codec developments made at each compressed bite rate, such as encode and decode speed, noise floor versus frequency, overall frequency range, and type and amount of various artifacts. As expected, the original file shows content extending to 22.05 kHz; the summation of the left channel's frequency components is 20 dB "louder" than the right. (The left channel appears in aqua, and the right channel appears in violet). Also, notice the negative slope of both channels' amplitude-versus-frequency plots. This negative slope occurs because pink noise, which contains equivalent audio energy in each octave frequency, proportionally places a greater amount of content in low frequencies than it does in high frequencies. A white-noise graph, in contrast, would show a flat amplitude slope versus frequency. Now, compare the original plot to the MP3 graph. Two things are immediately evident. First, the upper end of the MP3-encoded frequency range terminates at just greater than 10 kHz, meaning that the encoder has lowpass filtered and discarded all information above this point. One reason the encoder does this filtering and discarding is because it makes the highly dubious assumption (at this chosen cutoff frequency) that many of us would be unable to hear high-frequency content above this point even if it existed. Secondly, compression algorithms work best if they can reduce sample-to-sample variation. For audio, this variation is most significant at high frequencies. Next, notice that the amplitude deviation between the two channels is much less pronounced in MP3 than in the original, particularly at high frequencies. This trend indicates that the encoder is doing a frequency-dependent stereo-to-mono partial conversion to reduce channel-to-channel differences and consequently simplify its job. Compared with MP3, RealAudio looks pretty good. The frequency response extends quite a bit higher, past 16 kHz, and the channel-to-channel amplitude difference is better preserved across the entire frequency range. Finally, take a look at WMA. Of the three lossy codecs, WMA delivers the widest frequency response, and the left channel looks pretty good. But what about that right channel? Much of the frequency detail has been altered and discarded. Noise files provide useful data on how the compression algorithm works, but their results don't necessarily correlate with how real-life compressed audio sounds. So don't reject WMA quite yet. Also because the files contain random noise, they tend to obscure the subtle alterations that the codecs make. So, next analyze sound-clip 6. First look at the original file (Figure 2a). As expected, it contains 25 distinct tones; both the left and right channels are at uniform amplitudes across frequency, and the right channel is 20 dB below the left. No tone information exists between the 25 critical band midpoints; the noise floor is –120 dB. he MP3 file looks ugly (Figure 2b). Notice again the lowpass filtering: The last three tones in the original file (10,750; 13,750; and 18,775 Hz) are now missing. Also notice the suppressed amplitude difference between the left and right channels even at low frequencies and how this difference further diminishes as frequency increases. Finally, and perhaps most obviously, look at all the added noise clustered around each of the original tones. Pragmatically, it looks worse than it is; at –80 dB it's not very audible, particularly outside the human auditory system's 2 to 5 kHz "sweet spot." As the prior pink-noise results predicted, RealAudio has a better-preserved channel separation and frequency response. (The 13,750-Hz tone survived compression; the 18,775-Hz tone did not.) (Figure 2c). But at what trade-off? Here, the noise floor extends at times above –60 dB, just a few decibels below the "real" right-channel information. WMA compression (Figure 2d), in contrast, delivers clean stereo separation and wide frequency response. (Even the 18,775 Hz tone made it through.) Its noise floor, at no greater than –80 dB, is comfortably below the levels of even the right-channel tones. WMA seems to like critical band midpoints much more than pink noise. The mask Next, look at test tone 8. First, a spectrum-analyzer display of the original file clearly shows the quarter-, mid- and three-quarter-band tones, with the quarter- and three-quarter-band info 20 dB down (in both channels) from the mid-point tones (Figure 3a). Now look at the MP3 version, and you'll see little evidence that the algorithm has done any frequency masking (Figure 3b). The encoder algorithm did not eliminate any of the quarter- and three-quarter tones, at least the ones that survived the lowpass filter. Note that the 10,125-Hz quarter tone made it through the lowpass filter, but the corresponding 10,750-Hz midpoint tone and 11,375-Hz three-quarter-band tone did not. he RealAudio graph is a mess (Figure 3c). From the frequency plot, you can't distinguish a distorted quarter- or three-quarter-tone from unwanted noise. In test tone 8, I intentionally set the amplitude of the original file's left channel quarter- and three-quarter-tones identical to the amplitude of the right channel's midtones. I suspect this amplitude and tone combination didn't simplify the encoder's job, although it appears that as with test tone 6, the midtones in both channels survived the encoding process pretty well. The additional and altered stuff in between the midtones causes the problems. What about WMA (Figure 3d)? Remember that with the pink-noise file, the left channel survived pretty much unscathed, but the right channel came out looking very different from its original state. A similar phenomenon happened here. The quarter- and three-quarter tones of the left (louder) channel remain intact. But right-channel quarter- and three-quarter-tones, particularly below critical band 18, are nonexistent; the disappearing act is most obvious with critical band 0 data. Keep in mind that this artifact, as is the case with many artifacts I find in this study, isn't necessarily "bad"; frequency-masking theory dictates that even if the quarter- and three-quarter-tone data remains, you might be unable to hear it. I mentioned earlier that the original quarter- and three-quarter-tone data seemed to survive MP3 encoding. But the MP3 compression algorithm wasn't immune from sound-altering behavior with test clips 7 and 8. Look at the additional, slowly decaying amplitude in both channels in the first few hundred milliseconds of the MP3-compressed version of test-tone clip 7 (Figure 4a), representing increased volume absent from the original WAV file (the left channels on top with the right channels below it) (Figure 4b). For an even stranger oscilloscope plot, look at the results from test-tone clip 8 (Figure 4c and Figure 4d), in which the increased amplitude in the left channel corresponds with decreased amplitude in the right channel. Similar MP3 behavior occurred with some of the other test tones, although not to this extreme. Neither RealAudio nor WMA exhibited similar behavior. My initial attempts to uncover temporal masking were unsuccessful, but they did reveal other strange encoder and decoder behavior. Take a look at the first 200 msec of test tone 9 (Figure 5a). If temporal masking had occurred as part of lossy compression, you would see an interval with a reduced amplitude or a completely silent interval in the lossy-compression clips just prior to the onset of the "normal" audio material (at the 50-msec point in the original WAV file). Neither the MP3 (Figure 5b), RealAudio (Figure 5c), nor WMA (Figure 5d) versions of the test tone exhibit such masking evidence. Also note that all three lossy codecs appear to have preserved at least some of the channel-to-channel phase differences present in the original; one channel is a mirror image of the other. You should, however, notice a couple of odd occurrences in Figure 5. First, see how the MP3 algorithm significantly attenuates the original signal, whereas the RealAudio and WMA clips are as "loud" as the original version. Also, notice that MP3 inserts in its compressed version of the test tone a filter-bank-delay-created 55-msec initial silent gap, and WMA inserts a 45-msec gap. These gaps are neither present in the original nor does RealAudio insert them. RealAudio's gap addition occurs at the tail end of the sound clip. Comparing Figure 6a with Figure 6c, RealAudio inserts at the end of the test tone 1.385 sec of silence. The 50-msec gap added at the MP3 version's back-end (Figure 6b) is smaller than RealAudio's but still present, and WMA (Figure 6d) sticks an even smaller 30-msec gap at the end of the test tone. Why didn't I find temporal masking? Keep in mind that with all of these test tones, I chose specific frequencies, as well as specific masked- and masking-tone amplitudes. Changes in any of these source variables can trigger temporal masking or any other lossy-compression technique, as can compressing to a different bit rate or compressing with a different encoder setting combination. For example, versions of the Fraunhofer MP3 "engine" in some software packages enable you to select whether to allow the encoder to use channel-combining joint stereo techniques; Fraunhofer MP3 encoder versions in other products don't give you this customization option. Finally, let's look for echo artifacts. First, a quick review about what causes echo in the first place might be helpful. One of the first steps that nearly all lossy-compression audio algorithms (as well as lossy codecs for still images, such as JPEG, and video, such as MPEG) take involves converting a group of contiguous samples (called a frame) from their time-domain representation to the frequency domain. This process is analogous to the algorithm my computer uses to create the spectrum-analyzer plots in this article. Once in the frequency domain, the encoder decides which portions of the frame's data are inaudible and, therefore, appropriate to diminish in importance or even discard. This culling process can inject into the frame quantization noise and other undesirable data. The corresponding frequency-to-time retransformation within the decoder spreads this noise throughout all of the frame's samples. Ordinarily, the noise isn't a big deal; the "real" audio data covers it up. Similarly, temporal masking can hide noise injected after a sharp audio transient (a tap on a cymbal or a handclap, for example). But prior to a transient, the "real" audio information is subdued, or worst-case, silent. Pre-echo not only smears transients, it also injects annoying hiss into the previously quiet gaps ahead of transients, hiss which temporal masking only partially hides. My wife, a Cuban music aficionado, listened to the uncompressed and lossy-compressed versions of the castanets in EBU SQAM test tone 27. Even with my PC's low-quality speakers and without my prompting, she immediately pointed out the pre-echo noise in the 64-kbit RealAudio file (Figure 7c). This echo doesn't exist in the original WAV (Figure 7a), lossy-compressed MP3 (Figure 7b), and WMA (Figure 7d) files. The added noise prior to the onset of the transient in the RealAudio file should be obvious to your eyes. And it'll be obvious to most ears, too. In fairness to RealAudio, different codecs use different frame sizes for their time-to-frequency transforms, so other types of transients, or those occurring at other points in time, might cause the other codecs problems, too. More advanced codecs minimize pre-echo effects by supporting multiple frame sizes. They use less efficient, smaller frames when the encoder detects a transient and longer frames for more conventional material.
Author Information
REFERENCE
ACKNOWLEDGMENT Audio consultant Arny Krueger, who maintains the PXABX (www.pcabx.com) and PC AV Tech (www.pcavtech.com) Web sites, provided me with extensive and much-appreciated assistance throughout my entire project. I'm also grateful for the guidance and suggestions of Jim Johnston of AT&T; Eric Benjamin, Andrew Fischer, Ken Gundry, and David Robinson of Dolby Labs; Sean Alexander and Amir Majidimehr of Microsoft; Rebecca Grow and others at Sonic Foundry; Nariman Sodeifi at Syntrillium Software; and numerous employees at Fraunhofer. This article ran on page 87 of the January 18, 2001 issue of EDN. | ||||||||||||||||||||||||














If double-digit compression is your goal, lossy codecs are the only way to go.
With all the advantages of lossless compression that
Contact Technical Editor Brian Dipert at 1-916-454-5242, fax 1-530-937-8147, e-mail 
