Feature

Memory-organization challenges with real-time video encoding on embedded signal processors

As current standards evolve and new compression standards emerge, memory size and bandwidth will remain the limiting factors to accomplishing real-time video encoding.

By Gabby Yi, Ke Ning, and Marc Hoffman, Analog Devices -- EDN, 11/27/2003

Modern multimedia applications have been appearing and developing in the marketplace at an increasingly rapid pace. Used in applications such as handheld personal digital assistants, cellular phones, and home-entertainment set-top boxes, various compression standards compress incoming video streams for displaying and archiving. These video-compression standards include MPEG-1, MPEG-2, MPEG-4, H.263, and WMV (Windows Media Video), as well as many others. Each standard has its strengths, shortcomings, and target applications, and developers are finding ways to port these standards onto fully programmable embedded processors.

Multimedia applications are quickly moving from desktop computers to portable handhelds and networked devices. For many of these devices, a fully software-based solution is highly desirable. Given that compression standards are always evolving and new standards are always emerging, developers are looking toward embedded signal processors to quickly implement these standards. Embedded signal processors offer a fast and flexible way to get these standards into products and on the market. But using embedded signal processors also has its pitfalls and obstacles. One especially notable collection of challenges involves limitations in the characteristics of available memory.

Compression standards aren't generally known for including mathematically complex algorithms, so the major problems facing developers attempting to port video-compression standards onto an embedded signal processor involve the restrictive data flow, limited bandwidth, and excessive latency of memory. The many available video-compression standards share similar sets of stages and include a domain transformation (usually using a DCT, or discrete-cosine-transform, algorithm), quantization, zigzag-pattern reordering, temporal correlation and, finally, variable-length encoding.

As processors gain speed and power, and instruction sets ideal for video-processing applications complement them, real-time encoding of D1-resolution video sequences becomes easier. Greater compression retains a high peak SNR (signal-to-noise ratio) but allows simple bit-stream storage and transmission through a network. With a fast processor and much data to process, real-time encoding is at the mercy of the processor's memory architecture. With limited fast internal memory and limited bandwidth to external memory, a bottleneck often appears between the processor and the data.

Memory requirements and motion estimation

Video processing brings entire frames that are stored in external memory into internal memory for processing. Ideally, processing an entire frame in internal memory would eliminate the issue of memory latencies and bandwidth. Data wouldn't have to travel up hierarchies, and the processor could access it within one cycle. Unfortunately, the limited sizes of internal memories make this objective difficult to achieve. Typical D1 sequences consist of a full-resolution luminance channel of 720×480 pixels per frame, along with two quarter-resolution chrominance channels of 360×240 pixels per frame. If each color component uses 1 byte, the total size for one frame is 518,400 bytes. Although this amount is less than 1 Mbyte, other buffers and data tables required to process a frame (such as DCT coefficients, zigzag offsets, and an output buffer) also consume memory. This scenario assumes that the compression scheme does not employ motion estimation for temporal prediction. Predictive coding using a motion-estimation algorithm requires simultaneous access to data from at least two frames, calling for 0.5 Mbytes more of storage.

Motion estimation correlates the current frame with some previous frame and produces a motion vector. It is an effective way to exploit the temporal redundancy in all video sequences and is therefore a key technique that almost all modern video-compression standards employ. The basic premise behind motion estimation is that, by coding motion vectors, you achieve greater compression than if you just compressed each frame in a stand-alone fashion. To achieve even higher compression, some video-compression standards allow the motion search granularity to be a half-pixel or even a quarter-pixel.

Many motion-estimation algorithms have emerged during the last two decades. The FBMA (full-search block-matching algorithm) is one of the most basic options. For every block of the current frame, FBMA searches all of the possible matching locations in the reference frame and finds a minimum point. FBMA is a resource-costly algorithm that requires lots of data movement, because it covers all the pixels in the reference frame. A full search consumes more than half of the total computation time of a video encoder. Smarter algorithms search on a subset of the candidate locations, a technique that turns out to have almost the same compression performance as FBMA. Even with decimated search points, the data-movement bandwidth is still high, a performance bottleneck for most real-time video-encoding products. Therefore, a good motion-estimation algorithm is the most critical piece of a real-time video-encoding system. Your challenge is to decrease the memory bandwidth as much as possible without losing too much compression performance and quality at the output.

Because the FBMA is too time-consuming, many companies and academic researchers have developed faster algorithms for performing motion estimation while still maintaining high quality in SNR. One set of fast motion-estimation algorithms is called hierarchal fast-block matching. Generally, these algorithms decimate the image into a number of smaller resolutions. Block matching begins on the smallest, most decimated level to find a starting motion vector and then continues on a subportion of the next level, dependent on the result from the previous level. This iteration continues for all of the following levels until the algorithm reaches the bottom level.

This family of motion-estimation algorithms generally gives good results but at a price. Because the system needs to decimate and store the image, it needs more memory and increased memory bandwidth to move the data between internal and external memory. In addition, performing the search on the upper levels depends on the previously found intermediate motion vectors. This characteristic makes it difficult to pipeline the data transfer. Other video algorithms exhibit irregular memory addressing along with other noncausal postprocessing filters, such as for deringing and deblocking, that put additional performance pressure on the memory subsystem.

In all video-encoding systems, you can store only a small portion of the image for processing in the limited-size internal memory. To minimize performance-sapping data transfers, it is advantageous to keep data in the on-chip memory for as long as you need it. The irregularity and noncausality of filters and other post-processing functions, though, make it difficult to control the data-buffer movement into and out of the on-chip memory. Typically, to execute a filtering operation, you must temporarily move some data into external memory to make room for other data to be processed.

Embedded-processor memories

Typical embedded processors comprise multitiered memory hierarchies, in which the fastest internal L1 memory is smaller in total quantity than the slower, less expensive, and more plentiful external memory. Although the size of the internal memory is on the order of kilobytes, external memory can provide megabytes' worth of storage but with much higher access latencies. If external memory had no latency, you could process video frames from external memory. Unfortunately, this situation is not the case.

Modern embedded-processor architectures that can perform dual memory accesses in parallel with an arithmetic computation help to achieve real-time video processing. This parallelism allows for greater code density, especially for tighter inner loops. For example, a data- load computation, such as multiply/add, and data store could condense from three cycles to one, assuming single-cycle accesses to internal memory. Parallelism is made possible by using different memory banks for code and data, and the data memory itself also comprises multiple banks. Separating the memory into different banks and using different buses to connect to each memory bank allows for the simultaneous accesses of an instruction fetch and dual data fetches within the same cycle.

Video encoders/embedded processors

Two memory-management techniques find common use in video-processing algorithms. One approach configures the internal memory as a cache bank and processes each frame as a whole. The alternative technique configures the internal memory as SRAM and processes each macroblock (one 16×16 luminance block and two 8×8 chrominance blocks) in a loop, thereby pipelining the action of bringing in the macroblock from external memory via a DMA (direct-memory-access) transaction (Figure 1).

Cache configurations allow for easier management of memory requirements, because the system stores all data in external memory and brings it into the cache when needed. A system configured for data cache might not, however, handle the nonregular nature of video sequences. Too many cache misses would occur, resulting in time wasted fetching the required data. With a full SRAM system, the developer would have greater control, as well as ultimate responsibility, in determining which data is in internal memory at a particular time. For instance, some data might be necessary for every frame, and other pieces of data might be needed for every processed macroblock. In this case, it would make sense for the less-often-used data to remain in external memory while continuously used data remained in internal memory.

For video-compression standards, it is hard to predict the execution time of a cache-configured system. Although an optimal caching strategy might lead to better bandwidth usage for video processing, not all embedded processors have configurable cache-replacement policies. In addition, the video encoder typically isn't the only algorithm concurrently running on the embedded processor. Other processes, such as audio and communication applications, both requiring fewer resources, might pollute the cache and result in a degradation of the video-encoder performance.

The performance of cache-based systems also depends on the locality of data accesses. For video encoders, most table look-ups exhibit poor data locality, especially if the tables are large. For example, Huffman tables for variable-length coding can reach several kilobytes, and the accessed table data depends on the original data from the video sequence. Cache-based systems can improve memory-bandwidth usage but give developers less control over the data flow and hence less control over optimization. With locality present in other stages of a video encoder, though, a cache-based system still improves memory-bandwidth usage in a real-time encoder.

As previously discussed, a system with an SRAM configuration would need to periodically bring in data from external memory to internal memory. Therefore, you should employ a DMA engine. Otherwise, using the processor to bring in the data wastes precious clock cycles needed for processing data already in memory. Specifically, you should use a flexible DMA engine that can stride across data to bring in 2-D subimages. Because video frames are 2-D, and video processing algorithms generally tend to access 2-D subportions of the frame, a DMA engine with 2-D access capabilities would significantly facilitate real-time encoding. Another advantage of a 2-D DMA capability is the reduced setup time. Otherwise, you need setup for every individual row of the subimage.

Another consideration that you must take into account is the latency associated with DMA transfers. Although DMA transfers can occur in the background while the processor is operating on other data, synchronization must occur so that the data will be ready for processing when the processor is ready to process it. You can usually solve this synchronization problem by using a circular-buffering scheme, a technique found in a variety of applications, such as network control and operating systems. The use of circular buffering in video processing allows pipelining so that you can streamline the data transfer and processing.

Along with the use of DMA to transfer data from one portion of memory to another comes the possibility of bus collisions during the transfer. In an optimized system, a DMA transfer should not access memory that the processor might simultaneously access (Figure 2). Some processors that include an intelligent DMA engine that will prevent memory collisions might still incur a penalty for wasted idle time. Other architectures have multiport, multibank memory that allows simultaneous access from the CPU and the DMA engine. Although this approach eliminates possible collisions from bus and memory conflicts, the developer is still responsible for appropriately scheduling the data accesses and memory transfers.


Author Information

Marc Hoffman, is a DSP software engineering manager at Analog Devices. His in-depth background spans optimizing compilers and tools, DSP architectures, algorithm development, signal processing, and imaging/video processing. Since 1999, he has been focused on software applications in the domain of real-time-based video-compression and -decompression systems on the BlackFin Multimedia Processor. He holds a master's of software engineering from the University of Massachusetts.

 

Ke Ning is a digital media software engineer at Analog Devices with responsibilities including embedded-system-architecture design, system profiling, software design, and implementation and integration of multimedia and communications systems applications. Ning holds an MSEE from the University of Minnesota.

 

Gabby Yi is a digital media software engineer at Analog Devices whose responsibilities include the design, debugging, and profiling of embedded systems for multimedia applications and the implementation of various video-compression schemes on embedded DSP platforms. Yi holds a BSEE from Northeastern University.



ADVERTISEMENT

ADVERTISEMENT

Feedback Loop


Post a CommentPost a Comment

There are no comments posted for this article.

Related Content

 

By This Author

There are no additional articles written by this author.


ADVERTISEMENT

Knowledge Center



Technology Quick Links

EDN Marketplace


©1997-2008 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites