Video CoDecs in software: some reflections on programmable-hardware approaches
Today ARC will announce a family of silicon intellectual property (IP) cores—the VRaptor family–for implementing multi-standard video CoDecs. Given that ARC is primarily a RISC processor IP vendor, the offering makes an interesting comparison to the more traditionally SoC-centric products that have come to market in the last couple of months. Some important points emerge about programmable, vs. merely configurable, solutions to challenging real-time computing problems.
Perhaps the first observation is that there are sufficient niches in the video CoDec market to entice an amazing array of architectural approaches. With applications all the way from encoding video captured on low-end cell-phones to encoding live sports events for HD broadcast to decode for a CIF playback device, there is a huge range of cost, power, performance, resolution, and quality requirements. Accordingly, we are seeing CoDecs in every form from software on server farms to small IP cores.
The ARC offering fits toward the low end in performance. It is scalable to provide a range of performance points from CIF decode to Base-profile D1 encoding. Accordingly, die area and power are important points here. But ARC also, naturally since the solution is going to be based on an ARC processor core, emphasizes the flexibility to move between standards—H.263, H.264, MPEG-4 and JPEG–resolutions and bit rates with only software changes.
The are some interesting ideas in the architecture. As you might expect, you don’t do even CIF decode on an unassisted processor core of this size. In order to meet real-time deadlines and power constraints, ARC employs a 700-family CPU core, a SIMD execution unit that is a video-tuned extension of their standard SIMD offering, and a hardware block for entropy decoding. Going from CIF decode to a full D1 Base-profile CoDec requires the addition of a second SIMD engine and two more hardware blocks: one for motion encoding and one for entropy encoding.
While the compute resources are interesting, the communications architecture of the core is even more so. ARC has clustered the specialized execution units—but not the CPU itself—around a single-port local RAM with a capacity somewhere in the neighborhood of 16-32 Kbytes, interfaced through a five-port arbiter. This permits video data to reside in a local buffer common to all the number-crunchers, and provides adequate short-term storage, given that the CoDec does not attempt to generate B-frames.
This memory in turn must be fed efficiently, of course. Interestingly, ARC did not leave this task to the CPU, but developed a specialized DMA engine that sits between the local memory and the core’s system-bus interface. Senior director of product marketing Gagan Gupta says that the DMA engine has considerable intelligence, including an understanding of macroblocks and the sorts of scatter/gather algorithms necessary to efficiently map video onto DRAM. Thus some of the intelligence that would normally reside in a smart DRAM controller has migrated to the internal DMA block, requiring the DRAM controller designer and the system architects to understand clearly just what the DMA engine—itself somewhat programmable—is doing.
There are simpler DMA engines in the motion and entropy encoder blocks as well. There seems to be a trend in architectural thinking this year, of which the ARC design is very much an example, to place more intelligence in, and reliance on, DMA controllers for data movement.
Task management falls to the CPU. But the ARC architects apparently felt that normal shared-memory message-passing was not going to be sufficiently fast and predictable for the encode job. So they have developed a set of proprietary high-speed channels that form point-to-point links between the CPU and the computing engines. The channels appear to the CPU as sets of reserved registers, and are used for passing short messages such as pointers and task-control information directly from processor to processor.
A few conclusions suggest themselves. For one, the old truth that programmability costs power has not really changed. Many tasks are a lot more efficient on specialized engines. But moving the inner loops to specialized hardware while keeping sequencing on a fully-programmable CPU does provide flexibility. It also makes it quite difficult to characterize the core. Since the CoDec algorithms are rendered as a collection of software tools that can be individually enabled or disabled, the trade-off between bit rate, image quality and power consumption is dependent not only on the hardware configuration you chose to implement, but on the software routines you have enabled. This makes accurate system-level modeling a critical task.
Katemymn commented:
ron commented:
Hmm commented:















