Stream Processors aims at parallel signal processing
By Robert Cravotta, Technical Editor -- 2/12/2007
The processor architecture relies on two MIPS 4KEc processor cores in conjunction with a DPU (data-parallel unit) that consists of a scalable number, currently eight or 16, of processing lanes. The system processor, a 4KEc core, runs the application operating system and software, and it manages the system I/O.
The other MIPS core and the DPU make up the DSP-coprocessor subsystem. The MIPS core communicates with the DSP dispatcher that manages the runtime synchronization of instructions and DMA data loads for the kernel functions that will execute in the DPU. The multilane DPU architecture executes the same VLIW (very-long-instruction-word) instructions across all the lanes. Each lane includes five 32-bit ALUs, including MAC (multiply/accumulate) units, four LRF (lane-register file) Ld/St (load/store) units, and a COM unit for interlane communication. Each ALU in the lane is independent and operates on local data.
This processing architecture best suits applications that are heavily computationally intensive on streaming parallel data. One of SPI’s processors can encode high-definition 1080p video (H.264 HD) in real time and still perform custom video enhancements, image tuning, and content analysis. Because the target applications are streaming data in nature, the system has no conventional cache. Instead, the compiler allocates the data into each device lane through an operand-register-file hierarchy. The same kernel function executes across all of the lanes, with each lane operating on a unique set of data. A high-speed interlane switch supports data exchange across all of the lanes.
The SPI compiler can support and exploit a C-programming model without special parallel constructs. After a designer explicitly marks the beginning and end of the computationally intensive kernel functions and the associated input and output data streams with intrinsics, the compiler implements static-flow analysis to effectively unwrap loops and optimize the on-chip-memory allocation to best use the local memories in each processing lane. By allowing the compiler to implement the parallelism from the C source, the source code remains compatible with chips with different number of lanes.
© 2009, Reed Business Information, a division of Reed Elsevier Inc. All Rights Reserved.


