Design Feature: January 4, 1996
The introduction of the digital-audio compact disc several years ago established a new standard in audio recording and reproduction quality. The wide and rapid acceptance of digital audio is indicative of public demand for high-quality sound. The industry is presently undergoing a similar transition from analog to digital in video. Two vital technologies bringing about this change are digital image resizing and video compression. These technologies are working together to make the transition to digital video possible.
Teleconferencing is one of the hottest new applications for digital video. The ability to interact visually with people around the world has been gaining rapid acceptance in the corporate world, and image resizing and video compression play key roles in this technology. Digital video has also found a home in many other diverse applications. Interactive CD-ROMs, which store hundreds of video clips, are considered standard equipment in todays desktop computer systems. Video-in-a-window applications are becoming available to the public at affordable prices. And many of todays computer games now use full-motion video to give games an extra sense of realism.
All of these applications share common obstacles. For practical reasons, the images used in these applications are generally resized to a fixed, and often smaller, image size. Users, however, want the ability to resize video images back to a comfortable viewing size. For viewing purposes, this resizing requires high-quality image scaling for both encoding and decoding at each end of the process. It is vital that the image resizing process preserve image integrity and not introduce visible distortion.
Whether it be telephone-line bandwidth, hard-disk space, CD-ROM data-transfer rates, or video-bus bandwidth, all digital video applications must deal with the fact that quality digital video requires an enormous amount of bandwidth. A short 10-sec video clip chews up hundreds of Mbytes on your hard-disk drive and requires several Mbytes/sec of bandwidth. Even todays seemingly vast storage and transmission capacities require some form of digital video-information compression. Quality video resizing with proper filtering is also a key factor in video compression.
Resizing changes sampling rate
Resizing changes the size of an image- - the number of pixels used to represent the image- - to an arbitrary target size, effectively changing the sampling rate of the image. Therefore, resizing and resampling digital video are the same process. The difficulty in resizing and, in particular, shrinking video is that you want to preserve as much of the source information as possible. At the same time, you want to properly limit the bandwidth of the signal to comply with the Nyquist sampling criterion. Accomplishing this correctly requires some digital filtering. (See box, "DSP basics.")
Unfortunately, proper filtering has historically been impractical due to memory-storage requirements. Horizontal filtering is relatively straightforward, because video is transmitted on a line-by-line basis. The filter delay elements, therefore, are simply registers. Vertical filtering, on the other hand, typically requires an expensive line store for each filter tap. Due to this memory limitation, resizing has generally been accomplished with little or no filtering, resulting in poor image quality and aliasing distortions. This poor resizing has a compounded negative effect on video compression. Two incorrect shrinking methods commonly used are pixel dropping and bilinear interpolation.
Pixel dropping selects nearest neighbor
Pixel dropping selects those source pixels that most closely line up with the output pixel grid. First, calculate the distance between target pixels with respect to source pixel spacing, determining the source pixels nearest to the desired target pixels. This value, Tinc, is found by dividing the source gaps by the target gaps, with gaps referring to the spaces between pixels:
Tinc=((Source Pixels)-1)/((Target Pixels)-1).
Figure 1 demonstrates this algorithm. Here, Tinc=(5-1)/(4-1)=4/3. T1 lines up with S1, so T1 becomes the value of S1. T2, on the other hand, is 1/3 of the distance from S2 and 2/3 of the distance from S3. Because T2 is closer to S2, T2 becomes the value of S2. By following this algorithm, you will notice some of the source pixels do not contribute to the output. In fact, the value of S3 has not been used at all, and, as a result, information is lost. This means thin lines in the source image may be thrown out. Pixel dropping is extremely phase sensitive, meaning that the algorithm may or may not drop a line, depending on the starting pixel. In video, this may result in line flicker, a highly noticeable and undesirable effect.
DSP basics | ||
|
One of the key concepts in DSP is the sampling of a continuous time signal. This sampling converts an analog waveform to a digital signal by measuring the analog signals amplitude at discrete time intervals. The frequency at which the sampling occurs is called the sampling frequency and is generally denoted as Fs. The sampling of a signal has a direct effect on the signals frequency spectrum. The Fourier transform provides a means of switching from the time domain to the frequency domain. It represents a signal as a weighted sum of sinusoids of all frequencies.
A fundamental DSP theorem governs whether the harmonics of a digital signal overlap and whether you can reconstruct the analog waveform from its digital samples. The Nyquist sampling theorem states that if a signal x(t) is bandlimited to B, then you can reconstruct x(t) from its samples xs[n], if you sample x(t) at Fs>2B. The frequency B is called the Nyquist frequency, and Fs>2B is called the Nyquist criterion. Digital filtering Use digital filters to manipulate the frequency spectra of digital signals, for example, to resample the signal. Typically, digital filtering has been accomplished using finite-impulse-response (FIR) filters. A FIR filter takes a coefficient-weighted average of a finite number of source pixels, called taps, to calculate the filtered target pixel. In general, the more taps, the sharper and flatter the filter response.
| ||
The pixel-dropping method represents resampling with no filtering at all. This method causes visual distortions. As basic DSP theory tells you, without proper filtering, you get aliasing. (Figure 2 illustrates the aliasing effect. Figure 2a shows the source spectrum.) As you decrease the sampling rate, Fs moves down in frequency to the left, to Fs'. This causes the spectra of the baseband signal and the first harmonic to overlap, as shown in Figure 2b. The energy from the first harmonic corrupts the baseband spectrum, resulting in aliasing.
Despite the scaled images poor quality, pixel dropping is a commonly used method of shrinking video. Its main advantage is its ease of implementation: It doesnt require storage elements. Pixel dropping is well-suited for low-end video-resizing applications; however, due to the lack of spatial filtering and the presence of aliasing, pixel dropping is unacceptable for video compression applications.
Bilinear interpolation is a weighted average
Bilinear interpolation is a popular technique that compromises image quality in exchange for reduced circuitry. Linear interpolation between the two closest source pixels or lines yields the output pixel. Basically, it is a weighted average between two pixels, S1 and S2. If the target pixel, T, lies between S1 and S2 at a distance a from S1 (and therefore 1-[alpha symbol] from S2), then by linear interpolation:
T=[alpha symbol]×S2+(1-[alpha symbol])×S1.
Bilinear interpolation does provide some filtering suited to upsampling and magnifying; but as a decimation filter, bilinear has a poor response.
Consider Figure 3a. Once again Tinc=4/3 and T1 lines up with S1, so T1=S1. (Specifically, T1=1×S1+0×S2.) T2, if you recall, is 1/3 of the distance from S2 and 2/3 of the distance from S3; so, T2=1/3×S2+2/3×S1. In this case, you can see that, unlike pixel dropping, all source pixels contribute to the target pixel stream. Because bilinear transformation looks at more source pixels, it performs better than pixel dropping. However, the bilinear methods limited filtering is insufficient, particularly for shrinks of 50% or more (Figure 3b). This figure shows a 50% shrink where some of the source pixels contribute to the output and some pixels are dropped. Specifically, you never use the values of source pixels S2 and S5 to calculate the resized image.
Bilinear interpolation requires only one line store for vertical resizing and is, therefore, a popular scaling method. Images reduced with bilinear interpolation are still choppy and phase sensitive. In video, these distortions are represented as moving patterns and shimmering. The bilinear method also suffers from aliasing and, therefore, performs poorly with compression. It has the same problems as pixel dropping, including phase sensitivity, but to a lesser degree.
Textbook-correct resizing fits the bill
Textbook-correct resizing refers to proper scaling based on traditional multirate DSP theory. High-quality resizing should be smooth and continuous; it should work for any resize factor and should scale image-frequency content proportionately to the image size.
Arbitrarily resizing an image requires a two-step process of interpolation and decimation (Figure 4). L/M, where L and M are integers, determines the resize factor. You determine the factor L/M by reducing the ratio of target gaps to source gaps.
L/M=((target Pixels)-1)/((source pixels)-1) F'=(L/M)×F,
where F is the old sampling rate and F' the new one.
The first step, interpolation, increases the sampling rate F by the factor L to give an intermediate sampling rate of
I = L×F,
where I is the intermediate sampling rate.
The intermediate sampling rate is obtained by interpolating L-1 pixels in between the source pixels. Figure 5a shows the source frequency spectrum. After upsampling by a factor L, the spectrum remains unchanged; however, it increases the sampling rate to I. Figure 5b shows the unwanted imaging frequencies that need filtering using an interpolation filter, hI. Figure 5c shows the resulting post-interpolation frequency spectrum.
The next step, decimation, decreases the intermediate rate I by the factor M to give the desired target sampling rate F9
F'=I/M=(L/M)3F.
Figure 6a shows the intermediate frequency spectrum and the required decimation filter, hD. You must filter the intermediate pixels to bandlimit the data by a factor 1/M as in Figure 6b. This is necessary to prevent frequency-band overlap and to avoid aliasing. Once filtered, you can safely reduce the sampling rate from Fs to Fs' by removing M-1 samples between the desired output pixels (Figure 6c).
Because both the interpolation (hI) and decimation (hD) filters operate on the same intermediate data rate, I, you can combine the two filters to a composite interpolation-decimation filter.
Figure 7 shows textbook-correct resizing for the previous 50% shrink case where the bilinear method drops pixels. The correct method properly handles this case. You shrink from 6 to 3 pixels; so, L=3-1=2, and M=6-1=5. To interpolate by a factor of two, you must interpolate one intermediate pixel between each pair of source pixels.
After interpolation, you need to decimate the intermediate stream by a factor of 5. Generally, for greater decimation, you require more decimation filter taps. Thus, even for greater shrink factors, the correct method still uses all source pixels in calculating the target pixels. In this case, you use an 11-tap filter that passes only the lower one-fifth of the frequency spectrum, removing the upper four-fifths of the frequency content from the signal.
Textbook-correct resizing is difficult to implement, for various reasons. Each resize factor theoretically requires a different set of interpolation and decimation filters with different sizes and coefficients. The method has to support all coefficients, either stored in memory or calculated for each resize factor. Good-quality filters require many taps, corresponding multipliers, and delay elements. In the vertical direction, classical filter architectures require expensive line stores to implement each filter tap, greatly limiting the number of vertical filter taps. Poor vertical filtering is compounded by the fact that in video, the input signal is not bandlimited vertically, so there may be frequency content at the vertical Nyquist rate. Thus, vertical aliasing is a more prominent problem than horizontal aliasing.
Perhaps the biggest difficulty in implementing proper resizing for arbitrary resize factors is that the intermediate sampling rate, I, depends on the resize factor and requires increased internal clock signals to maintain real-time rates. For instance, resizing from 100 to 99 pixels requires an internal clock rate 98 times the source sampling frequency (I=98×Fs). You cant design any system using the textbook-correct resizing algorithm to handle arbitrary data rates in real time.
Video compression with filtering and resizing
Video resizing is often used as a first step in video compression. If you make the image size smaller, youll have less data to transmit and store. You can resize the video stream to a small image size, for example, Common Intermediate Format (CIF), 352×288 pixels, or Quarter Common Intermediate Format (QCIF), 176×144 pixels, send it to its final destination, and then zoom it back to a larger size for viewing. To demonstrate the bandwidth savings that are possible, consider resizing CCIR601 (720×240 pixels) video to CIF-sized video. This scaling results in a more than 40% bandwidth reduction.
Resizing video is an example of lossy compression and therefore requires high-quality resizing. You want to preserve as much image content as possible without introducing any visible artifacts. As we have seen, this requires textbook-correct resizing, because bilinear and pixel dropping methods may produce aliasing distortions.
For most applications, resizing the video-image size to CIF or even QCIF is not sufficient. Even at CIF size, a 10-second video clip (running at 30 frames/sec) still requires tens of megabytes of storage and several Mbytes/sec of transmission bandwidth. Further compression, for example with MPEG-1, MPEG-2 or H.261, is required.
MPEG-1 is a video-compression scheme proposed by the Moving Pictures Experts Group, a joint international committee. MPEG-1 was designed specifically for broadcasting video at a data rate of approximately 1 to 1.5 Mbps. MPEG-2 provides an enhancement to MPEG-1, supporting higher resolution images and higher data rates. MPEG has quickly grown to become the industry standard for compressing video streams. H.261 is similar to MPEG but is more flexible in terms of the data rates it supports. The International Telecommunication Union (ITU) developed this standard for video-teleconferencing applications and, accordingly, has a simpler encoding process.
At the heart of these video compression schemes is the discrete cosine transform (DCT). The DCT is in many ways similar to the Fourier transform. It converts image data from the spatial domain to discrete spatial frequency coefficients. Because the eye is less sensitive to high frequencies than to low ones, you can quantize the higher-frequency coefficients more coarsely. In fact, you can quantize most of the high-frequency coefficients to zero, producing long strings of zeroes, which run-length encoding can effectively compress.
In order to get higher compression ratios, you must quantize the DCT coefficients more. The amount of quantization is called the DCT quality factor. Of course there is a trade-off: as you quantize more and degrade the quality factor, the images become more blocky and distorted.
As stated earlier, aliasing, a side effect of bilinear and pixel dropping resizing, is the overlap of harmonic frequencies with the baseband frequencies. This results in spurious high-frequency distortions, and the high-frequency DCT coefficients will not likely quantize to zero. Thus, aliasing hampers MPEG compression.
Textbook-correct resizing, on the other hand, properly filters out high spatial frequencies from the image. This filtering works well with the DCT to help ensure long strings of zero high-frequency coefficients and favorable run-length encoding compression. For fixed-bandwidth applications, textbook-correct resizing allows an improvement in the DCT quality factor. Figure 8 shows a video-compression system using video resizing and a MPEG encoder and decoder.
Applying the technology
What can you do with video resizing and compression? If you recall, resizing and compression play key roles in video teleconferencing.
Video-teleconferencing systems allow the sharing of data, particularly audio and video information, over existing telephone or ISDN lines. The telephone-line transmission bandwidth severely constrains these systems. To address this limitation, you must substantially compress the data using, for instance, H.261 with high quantization. In order to achieve the high compression required, you must sacrifice image quality, which results in blockiness and jerky motion.
By properly resizing the video before compression, you can reduce the input data, filter the high-frequency information, and maintain a reasonable image quality. This allows for more efficient compression with fewer compression distortions.
Additionally, by changing the resize factor, you can dynamically reallocate bandwidth. This lets you implement more advanced features and share other information. For instance, you can scan in documents and transmit them during a teleconference call. By shrinking the video even smaller, you can free bandwidth for transmitting the scanned document. If you zoom the received video proportionately, you get video that is blurry during document transmission.
Multimedia has rapidly become a buzzword of the '90s, and, with it, digital video has gained enormous exposure. In digital video, two technologies stand out. Resizing reduces the amount of video data required to transmit or store, it allows users the option to choose what size they view video images at, and it may also enhance compression. Compression, on the other hand, brings the bandwidth requirements for digital video to more manageable proportions. Together, the two technologies make digital video a reality.
Calvin Ngo is a research engineer at Genesis Microchip Inc (Markam, Ont, Canada) where he has worked for the past 1 1/2 years. In his position, he participates in the research and development of resizing algorithms. He has helped to develop the next generation of video-resizing engines. He has a BAS from the University of Waterloo and enjoys skiing, basketball, and baseball in his spare time.