Texas Instruments TMS320C6000
EDN Staff - March 30, 2000
TI's TMS320C6000 is a general-purpose DSP based on a very-long-instruction-word (VLIW) architecture. This architecture includes the fixed-point C62x, the floating-point C67x, and the new C64x families. The C64x is object-code-compatible with the C62x but with significant architectural enhancements and an initial operating frequency of 750 MHz. TI created the C67x by adding floating-point capability to six of the C62x's eight functional units, so the C67x instruction set is a superset of the C62x instruction set. The C6000 lacks a dedicated multiply-accumulate (MAC) unit. Instead, it performs MAC operations by using separate multiply and add instructions. Although this operation requires two instruction cycles, the pipelined effect yields apparent single-cycle execution. (Unless otherwise indicated, all C62x details that follow also apply to the C67x.)
This architecture comprises dual datapaths and dual matching sets of four functional units. The eight functional units on the C64x and C62x comprise two multiply (M) units and six 32-bit arithmetic units with a 40-bit ALU and a 40-bit barrel shifter. The C64x M units perform two 16×16-bit multiplies every clock cycle, compared with one multiply on the C62x. In addition, each M unit on the C64x can perform four 8×8-bit multiplies every clock cycle. Bit-count and rotate hardware on the M unit extends support for bit-level algorithms. The C64x also has beefed-up capability in other functional units. For example, the logical (L) units can perform byte shifts and quad 8-bit subtractions with absolute value. This absolute-difference instruction benefits motion-estimation algorithms. TI also added bidirectional variable shifts to the M and S units. The C64x D unit can perform 32-bit logical instructions in addition to the S and L units. The L and D units can load 5-bit constants in addition to the S unit's ability to load 16-bit constants.
In the C64x, each functional-unit set has its own bank of 32 32-bit registers; the C62x has 16 32-bit registers per bank. A program can use the general-purpose registers for data, data-address pointers, or condition codes. In all C6xxx devices, you can use registers A4 through A7 and B4 through B7 for circular addressing. A program can use any register as a loop counter, which can free the standard-condition registers for other uses. On the C64x, any member of a functional-unit set can access the other functional-unit set's register bank; the functional-unit set performs this procedure through one data bus; on the C62x, all units except the two data units have a data cross-path to the other set of units.
The C64x data-cross-path accesses allow multiple units per side to simultaneously read the same cross-path source. Thus, one, multiple, or all the functional units on a side in a VLIW-execute packet may use the cross-path operand for that side. In the C62x, only one functional unit per datapath per execute packet could access an operand from the opposite register file.
The C62x register files support packed 16-bit data through 40-bit, fixed-point and 64-bit, floating-point data. You can store values larger than 32 bits in register pairs. The C64x register file supports all the C62x data types, packed 8-bit types, and 64-bit fixed-point data types. Packed data types store four 8-bit values or two 16-bit values in a single 32-bit register or four 16-bit values in a 64-bit register pair. Each C64x multiplier can return a result as large as 64 bits, so an extra write port is available from the multipliers to the register file.
The C6000 families support no separate X- and Y-memory spaces. Instead, they provide a single data memory with two 64- and 32-bit paths, respectively, for loading data from memory to the register banks. Two other 32-bit paths (64 bits for C64x) store register values to memory. A 32-bit address bus supports these datapaths. The C64x can also access words and double words at any byte boundary using nonaligned loads and stores; the C62x requires alignment on 32- or 64-bit boundaries. A 32-bit address bus addresses the program memory, but the single datapath is 256 bits wide. This width allows the C62x to fetch, but not necessarily execute, eight 32-bit instructions per cycle. TI calls this approach a fetch packet. The C62x architecture does not allow fetch packets to cross fetch-packet boundaries, resulting in compiler-generated nonoperation (NOP) instructions to pad fetch packets. The C64x architecture resolves this "code-bloat" issue with instruction packing in the instruction-dispatch unit. This approach removes execute-packet-boundary restrictions and eliminates all filler NOP instructions.
The CPU can execute one to eight instructions per cycle, but data dependencies, instruction latencies, and resource conflicts limit optimal performance. Multiple execute packets allow fully parallel, fully serial, or parallel/serial combinations; therefore, eight serial instructions require the same code size as eight parallel instructions. The compiler and assembly optimizer play big roles in establishing the sequence of instructions for the C6000 to execute. The programming tools link instructions in a fetch packet by the least significant bit of an instruction. If the bit is set, the C6000 executes the instruction in parallel with the subsequent instruction.
The assembly optimizer performs dependency checking and parallelism among instructions. Therefore, the code executes as programmed on independent functional units and eliminates the need for core features, such as out-of-order execution or dependency-checking hardware.
Two devices from these families, the C6211 and C6711, are the industry's first DSPs with L1 and L2 on-chip cache memory. The C6211 incorporates a two-level cache structure with 4-kbyte Level 1 program and data caches. The internal Level 2 cache memory is a unified 64-kbyte data and instruction RAM. The C6211 also includes a 16-channel DMA controller that tracks 16 independent transfers and allows you to link each channel to a subsequent transfer.
The C6202, C6203, and C6204 have a 32-bit expansion bus that replaces the 16-bit host-port interface and complements the external memory interface (EMIF). The second bus for I/O devices reduces the loading on the EMIF and increases data throughput. The EMIF and the expansion bus are independent of each other, allowing the CPU to perform concurrent accesses to both ports.
Addressing modes—The C6000 performs linear and circular addressing. However, unlike most other DSPs that have dedicated address-generation units, the C6000 calculates addresses using one or more of its functional units.
Special instructions—All C6000 processors conditionally execute all instructions, a method of reducing branching and, therefore, keeping the pipeline flowing. On the C64x, the MPYU4 instruction performs four 8×8-bit unsigned multiplies. The ADD4 instruction performs four 8-bit additions. All functional units can perform dual 16-bit addition/subtraction, compare, shift, minimum/maximum, and absolute-value operations. The M units, and four of the six remaining functional units, support quad 8-bit addition/subtraction, compare, average, minimum/maximum, and bit-expansion operations. TI also added instructions that operate directly on packed 8- and 16-bit data. Bit-count and rotate hardware on the M unit extends support for bit-level algorithms, such as binary morphology, image-metric calculations and encryption algorithms.
The C64x's the branch-and-decrement (BDEC) and branch-on-positive (BPOS) instructions combine a branch instruction with the decrement and test positive of a destination register, respectively. Another instruction helps reduce the number of instructions needed to set up the return address for a function call.
The dual 16-bit arithmetic combines with six of the eight functional units and a bit-reverse (BITR) instruction to improve FFT cycle counts by a factor of two. The Galois field-multiply instruction (GMPY4) provides a performance boost over the C62x for Reed Solomon decoding using the Chien search. Special average instructions improve the performance of motion compensation by a factor of seven on a per-clock cycle basis versus the C62x. The quad-absolute-difference instruction bolsters motion-estimation performance by a factor of 7.6 on a per-clock-cycle basis for an 8×8-bit minimum-absolute-difference (MAD) computation. The C64x provides data packing and unpacking operations to allow sustained high performance for the quad 8-bit and dual 16-bit hardware extensions. Unpack instructions prepare 8-bit data for parallel 16-bit operations. Pack instructions return parallel results to output precision, including saturation support.
Support—The eXpressDSP software-technology strategy includes DSP integrated development tools; a scalable, real-time software foundation; standards for application interoperability and reuse; and a growing base of TI DSP-based software modules from third parties (www.ti.com/sc/docs/general/dsp/expressdsp/index.htm). The Code Composer Studio, an integrated suite of DSP-software-development tools, incorporates TI's C6000 C compiler with the Code Composer integrated development environment, DSP/BIOS, and Real-Time Data Exchange technologies. The assembly optimizer simplifies assembly-language programming and automatically schedules and parallelizes instructions from serial, inline assembly code. The assembler reads straight-line code without regard to registers or functional units and does the resource assignment. Deterministic operation allows the debugger to lock-step through the code. The debugger performs code profiling to determine the amount of time the processor spends in various portions of the code. Free tools are available for a 30-day trial on the Web at www.ti.com/sc/docs/tools/dsp/6ccsfreetool.htm.
Third-party tools and application algorithms are also available. See www.ti.com/sc/4123 for more details. TI offers hardware-emulation boards and starter kits.