Subscribe to EDN
RSS
Reprints/License
Print
Email

Customized processors: Have it your way

You can now tune processors to suit your needs. These options range from the addition of new instructions to the inclusion of a hardware accelerator.

By Markus Levy, Technical Editor -- EDN, January 7, 1999

Image: Burger



Hold the pickles and the lettuce and throw in a multiply-accumulate (MAC) instruction, a barrel shifter, and a communications coprocessor. Just like the famous fast-food restaurant, processor vendors are making it easier for you to have it your way. In the embedded industry, customization at the processor level plays a critical role in allowing you to differentiate your product and hit the necessary performance/price points to be successful. It wasn’t that long ago that customized processors were only for the rich and famous and for those willing to wait during the lengthy development cycles. Today, companies such as ARC, Philips, and others make it almost as easy as a mouse click to allow you to have it your way.
AT A GLANCE
*A processor’s ability to include custom instructions may significantly help you increase the performance and decrease the code size of your application.
*Tools, such as the ARC wizard, make it simple for you to add optional instructions to a standard core.
*Adding new instructions may also require significant upgrades to your software-development tools.

Depending on the capabilities that you are after, vendors offer a variety of options to help you tune their processors. Less traditional methods of processor customization include the addition of new, and sometimes proprietary, instructions and execution units. More traditional methods include adding a coprocessor or peripherals to increase performance and deliver hardware-specific functionality for your design. Each option has trade-offs based on a list of criteria that includes performance, code density, power consumption, design complexity, development-tool support, development time, and cost.
Determining the performance you really need
Ideally, a compiler can analyze programs and determine the optimal instruction-set architecture for that program. For example, the code for a JPEG decompression algorithm would run best on an instruction-set architecture, which supports vector-style operations, but bit-manipulation instructions would benefit an encryption algorithm. This methodology works well if you’re building a processor from scratch, but, most likely, you will start off with a commercially available core. With the addition of peripherals and memory, this core may be all that you need. You can analyze the behavior of your application code by running it through a code profiler, such as the Run-Time Analysis (RTA) tool from Diab Data ( www.diabdata.com). Such profilers point out the "hot spots" in your code, indicating where you should concentrate on optimizing your design. You may determine that the application requires a coprocessor running in parallel to the core to meet the performance goals. In other cases, you may be able to reduce an entire subroutine to a few custom instructions.

The instruction-set architecture of a processor defines its operational characteristics and is typically fixed due to the vendor’s investment in tools and application code. This fact implies that vendors must allow for scalability when they design their processor architectures. Motorola’s MCore architecture is one example in which scalability comes from the 11.2% of unimplemented operating code in that instruction-set architecture. Hitachi’s SH-DSP is another example of scalability; the SH-DSP’s designers expanded the standard core’s 16-bit operating codes to 32 bits to accommodate the DSP. The BOPS ManArray, far from being a traditional processor uses an extendable instruction-set architecture that includes flow-control, load/store, arithmetic/logical, and pluggable instructions ( Figure 1 ). The pluggable-instruction-set capability makes it possible to develop application-specific cores with large groups of new instructions for optimized capabilities in targeted products. With new specialized instructions, you may also have to develop new execution units, which plug into the core design.

Although these semiconductor vendors and a few others have built-in instruction-set extensibility, none of their products is easier to modify than the ARC core. ARC licenses its 32-bit RISC architecture as a set of technology-independent VHDL source files, which you can synthesize using the target ASIC vendor’s library. ARC designed its VHDL source code with extensibility and synthesis in mind. Furthermore, ARC provides its PC-based ARC configuration wizard, which allows you to easily select the instructions, cache structure, and other architecture features. Metaware ( www.metaware.com) supplies software-development tools for the ARC device, including an extensible assembler that supports any ARC extensions that you implement.

One of ARC’s licensees, BlazeNet ( www.blaze-net.com), has developed a switch that forwards network traffic from one port to another. The design uses a combination of firmware and hardware contained in a patented, ASIC chip set called BlazeFire. The traffic forwarding is based on the Layer2/MAC or Layer3/Network address and also on information contained in the higher layers of each packet. When the BlazeFire chip set receives packets, it segments them into formatted cells and transfers them over a high-speed internal bus to the second chip, or Queue Manager (QM). The QM shuffles data to and from memory and manages all low-level queuing operations. The QM also orders packet transmissions using complex scheduling algorithms and transferring data from memory for reassembly and transmission onto the wire.

The system passes QM-managed traffic to and from BlazeFire’s Relay Engine, which contains hardware and firmware to perform high-speed packet classification. The Relay Engine classifies a packet by examining its contents using complex pattern recognition and analysis. Next, the Relay Engine binds the packet to a traffic flow to forward it to the appropriate output queue. Although the hardware directly performs the most frequently executed operations, the processor must still operate on every packet, so performance is critical.

To help hit the performance requirements, BlazeNet engineers extended the ARC instruction set and condition codes. In particular, they used ARC’s wizard to select multiply, barrel-shifter, and normalize instructions. Normalize is useful for speeding floating-point calculations that you use in algorithms that control packet-transmission scheduling and bandwidth management. You can add these instructions with a mouse click ( Figure 2 ). ARC’s engineers designed the core with these optional instructions in mind, so when you want to include them, you may find that these instructions can take advantage of resources within the ARC’s ALU or other functional units. Although the flexibility of the ARC architecture also allows you to add your own instructions that are unavailable through the wizard, you (or the ARC design team) may have to re-engineer portions of the core, such as the decoding unit, and perhaps even add execution units.

Figure 2 also shows the gate-saving advantages of omitting the optional instructions in the default implementation. The lower left corner of the wizard’s graphical user interface shows the estimated number of gates that you need to implement an ARC core. A minimal core with a 2-kbyte cache requires approximately 12,680 gates; throw in a 32X32-bit barrel shifter and a 16X16-bit MAC instruction, and the gate count increases by more than 60%.

ARC’s flexible architecture also allowed BlazeNet engineers to improve their application’s performance by increasing the number of scheduled loads. As long as the pending load does not use the target, the pipeline may continue with other operations and allow the data to independently return to the registers. The deep pipelining of the design imposes a large latency penalty for doing loads from memory. The standard ARC core uses "scoreboarded" loads and can post as many as four loads before stalling; the BlazeNet design requires the ARC core to post as many as eight loads. (Scoreboarding is a method of keeping track of register resources to facilitate delayed loads.) You can access this feature through ARC’s configuration wizard.
Real custom instructions
Although Philips, with its REAL (Reconfigurable Embedded DSP Architecture Low power/Low cost) DSP, doesn’t offer the same user friendliness for adding instructions as ARC does, the company implements a powerful mechanism that allows you to optimize the instruction set for your application. The REAL DSP, with 16-bit-instruction operating codes, contains dual multipliers and two 40-bit (or four 16-bit) ALUs. The DSP’s decoder needs approximately 52 bits to select four operations and to use the dual multipliers, each with two input operands and one result register. The question is: How do you get this level of control with only a 16-bit instruction?

Philips has devised a way for you to define an application-specific instruction (ASI) that controls a high level of parallelism using a 16-bit instruction. An on-chip look-up table (RAM or ROM) contains the ASIs, each of which is a 96-bit, very-long-instruction-word-like instruction. A 16-bit instruction contains an index into this table to activate the ASI operations ( Figure 3 ). If the silicon implementation of the DSP’s look-up table is in RAM, you can download sets of ASIs to the chip while the application is running and customize the DSP core on the fly.

The first step toward implementing an ASI is to use profiling tools to determine the performance bottlenecks in your algorithm. Next, you would remap those bottlenecks onto the REAL DSP’s dual datapath to see what resources the DSP would use. You may have to reorder data to take advantage of the dual multipliers. This task may consist of switching the data from a single-sample to a block-based process. Next, you would use the parallel ASI syntax to map the new algorithm on the ASIs. You write a block FIR-filter algorithm that uses an ASI as follows:
& P>asia0+=p0,a1+= p1,p0=x0*y0,p1= x1*y0,x0=*px0++,y0=*py0++;.
Notice that, except for the asi key word, the syntax is similar to that of a regular instruction. Additionally, you have many more fields to specify. This FIR-filter example uses two 40-bit ALU operations, two multiply operations, and two memory accesses. The assembler, linker, and instruction-set simulator account for the ASIs. You have to specify only the key word, asi, followed by all the operations that the core would execute in parallel. The assembler/ linker then checks for duplicate ASIs, translates the instructions to an ASI look-up table, and, if needed, downloads them to the DSP. You can program as many as 256 ASIs. Philips engineers claim that the ASIs allow the REAL DSP to execute a G.726 algorithm in 488 cycles—down from 1600 cycles.

The REAL DSP’s VHDL-synthesis model allows designers to add application-specific execution units (AXUs) at specified points in the datapath or in the address-computation units. An AXU can use standard DSP resources, such as a limited set of DSP registers. Philips designers built hooks in the DSP-instruction decoder to allow DSP instructions to control an AXU. To integrate an AXU, you must modify the synthesis and timing scripts. Philips supplies a stand-alone verification suite that tests the AXU specification and the way it interacts with the core. Users define AXUs or select them from library modules. For example, Philips has AXUs that include a 40-bit barrel shifter, a normalization unit, and a division-support unit. A few reserved bits within the ASI bit patterns control AXU hardware. The assembler takes care of mapping the AXU commands.
New spins on coprocessor interfaces
The AXUs for Philips’ REAL DSP provide one example of how you can add horsepower to a processor. You can also increase performance by using a coprocessor. Although the concept of a coprocessor and its associated coprocessor interface is not new, several companies have developed new spins to ease your development efforts. For example, Motorola has implemented a hardware-accelerator interface on its MCore core, which allows you to replace a software subroutine call with a hardware call. The software developer would write the program in the normal way, including the jump to the subroutine. But, at link time, you would tell the linker/loader to use the hardware accelerator. This task employs hardware-accelerator primitives, such as HEXEC, HRET, HCALL, HLOAD, and HSTORE.

There is a 32-bit instruction path and a 32-bit datapath between MCore and the hardware accelerators. Hardware accelerators have direct access to MCore’s registers R4 to R7; there are even MCore instructions, loadquad and storequad, to expedite your access to these registers. The HLOAD/HSTORE instructions, which move data into and out of the hardware accelerator, optionally allow you to simultaneously copy this data into the MCore’s general-purpose registers. The hardware accelerators can snoop the MCore registers to determine when an update occurs. This feature allows MCore to call the hardware accelerator and tell it to wait for a data value in an MCore register. Then, during a write-back, the core would automatically copy the data into the hardware accelerator.

Lexra’s LX-4080 custom-engine interface demonstrates another interesting hardware-accelerator-like approach. The LX-4080 is a MIPS R3000-class processor that fits in 50% of an Altera ( www.altera.com) Flex 10 200E FGPA and runs at 33 MHz. (Note that Lexra bases the LX-4080 on the MIPS I instruction-set architecture. At this time, the company is not a MIPS licensee, and the two sides still disagree on patent infringement.) Lexra’s custom-engine interface allows the company’s customers to add as many as 16 instructions to the MIPS I instruction base. These custom instructions, such as MAC and parity check, occupy unreserved, unused operating codes. The custom-engine-interface block provides the necessary logic to support integration directly into the LX-4080 pipeline and to expand the functions of the core ALU ( Figure 4 ). This approach is easier than modifying the core itself because it avoids dealing with pipeline interlocking and critical timing parameters.

The custom-engine-interface block decodes the instruction operating code, generates the pipeline controls, and performs the custom-engine datapath functions. When the core detects one of the 16 special operating codes, it passes the OPCODE_SF to the custom-engine module for decoding. If the function in the custom-engine datapath has a single-cycle latency, the integration requires no connections to the pipeline controls. However, a multiple-cycle instruction requires pipeline and output controls. For example, the logic must set a counter to a value based on the instruction’s latency. The counter value is important in ensuring data integrity. The custom engine uses the counter value to determine whether the custom engine is currently busy and then asserts the pipeline-stall signal. A custom-engine-hold (CEHOLD) signal is an input to the custom-engine block that allows the CPU to stall the pipe during a cache miss, for example. Upon completion of the custom-engine instruction, the data returns to the CPU via the custom-engine-result (CE_RES) bus.


MARKUS LEVYMarkus Levy, Technical Editor
You can reach Technical Editor Markus Levy at 1-916-939-1642, fax 1-916-939-1650, or markus.levy@worldnet.att.net.


For more information:

For more information on products such as those discussed in this article, use EDN's InfoAccess service . When you contact any of the following manufacturers directly, please let them know you read about their products in EDN.

ARC Cores Ltd
Edgware, Middlesex, UK
+44 0 181 951 6123
www.arccores.com

BOPS Inc
Santa Clara, CA
1-408-327-8765
www.bops.com

Hitachi America Ltd
Brisbane, CA
1-800-285-1601
www.hitachi.com/semiconductor

Lexra Inc
Waltham, MA
1-781-899-5799
www.lexra.com

Motorola
Austin, TX
1-800-765-7795, ext 604
www.mot.com

Philips Semiconductors
Sunnyvale, CA
1-408-991-3518
www.semiconductors.philips.com

RSS
Reprints/License
Print
Email
Talkback
Canon Resource Center

Featured Company


Most Recent Resources

Advertisement
Related Content

No related content found.

  • 0 rated items found.
Advertisement

KNOWLEDGE CENTER

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)
Engineering Careers
Jobs sponsored by
Advertisement
About EDN   |   Site Map   |   Contact Us   |   Subscription   |   RSS
© 2012 UBM Electronics. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other UBM Canon sites

UBM Canon | Design News | Test & Measurement World | Packaging Digest | EDN | Qmed | Pharmalive | Appliance Magazine | Plastics Today | Powder Bulk Solids | Canon Trade Shows