EDN logo


Design Feature: October 12,1995

Streamlined custom processors: when stock performance won't cut it

Marcus Levy,
Technical Editor

Designing your own µP, µC, or DSP core-based system on a chip gives you access to core interfaces that wouldn't be available in a standard, off-the-shelf device. To make the most of your custom designs, familiarize yourself with these interfaces and various hardware emulators/ debuggers.

What you really want is your own custom processor. You want to combine a µP, µC, or DSP core processor with a set of peripherals, memory devices, and your own proprietary logic. Then you want to apply that combination to an ASIC technology to create a custom, integrated "system on a chip." Your goal is a processor that is everything you want it to be.

Unfortunately, you'll quickly learn that achieving your goal is not only difficult, but expensive. You may have to qualify for a vendor's core-based development program for which you typically have to fork over $125,000 in NRE costs and commit to building 100,000 units. The cost gets even higher if the ASIC vendor does your design work. The up side, though, is that you can reduce the cost of each system you build.

A custom processor can reduce system costs in several ways. For one thing, the smaller process geometries now available allow higher integration in silicon and reduce the number of board-level components. Also, a custom processor may cost less than an off-the-shelf version, because you can optimize the custom processor's feature set for the target application. And, even if you don't reduce cost directly, you may reduce it indirectly through savings in board real estate. For many applications, board size may be the key factor in determining design feasibility.

Furthermore, higher integration most likely helps reduce power consumption, which directly affects battery life and, potentially, lets you use a smaller power supply. Higher integration also adds the benefit of increased reliability. For example, you can reduce to one component a board-level design that otherwise would require 20 components with 16 to 64 pins requiring, conservatively, 350 connections. This reduction would yield a several-orders-of-magnitude increase in product reliability.

Fortunately, it is not only possible, but feasible to design your own processor. As indicated by the number of IC manufacturers in the box, "For free information. . .," there are plenty of companies to turn to for help. Developing a custom processor is no small feat, though. To begin, you have to sort through a wide variety of cores and peripherals. After that comes the difficult task of connecting all the pieces.

You can consider the step of connecting your design's pieces from the electrical-interface or from the design-methodology angle. Analyze the approaches vendors have developed to making their cores easier to use and to maximizing your system on a chip's performance. Examine the interfaces that connect the cores to the peripherals, coprocessors, memory, and computational units that become part of the CPU itself.

In addition, you must consider emulator and hardware-debugging support for cores. Core-based devices have neither standard pinouts nor standard functionality: The more integrated a design is, the fewer the pins that have to come out of it. This fact makes it impossible to develop an emulator that uses a probe or pod to replace the target processor. So, you must consider the emulators and debuggers for custom-processor designs.

As you begin to assemble your custom processor, note that most of the pieces, or modules, are reusable, and you can interchange them among designs (see box, "Designing good cores and peripherals"). The ability to reuse pieces reduces your design time. Also, integrating system functions onto one piece of silicon helps to maintain the confidentiality of your design: Competitors must reverse-engineer your device to determine your "system" components.


Determining flexibility

The ingredients of a custom processor—core, peripherals, megafunctions, and so forth—come in three formats: hard macros, soft macros, and "parameterizable" modules. You cannot alter hard macros. They are physically laid out with fixed boundaries and predefined test schemes. Soft macros are available in either a behavioral or gate-level hardware-description-language (HDL) format or as netlists. You generally cannot alter soft macros, and they are not physically laid out. Parameterizable modules are soft modules that you can alter by inserting predefined parameters. With parameterizable modules, the definition of the parameters generally includes the definition of the test scheme. On-chip memory, such as cache, RAM, and ROM, is typically available as parameterizable modules, as are most peripherals. Below are examples of the parameters of a UART and a counter/timer.

UART

Counter/timer


What makes a core "good"?

As EDN Technical Editor Jim Lipman says, "There's no such thing as a bad core, only a misunderstood one." A good core is one that was originally designed to be a core. A core's ability to break down into modules usually indicates whether it was designed to be a core; the components that constitute the core are a subset of what you use to make the entire chip. A good core has nothing except what everyone using the core needs and one that has industrywide acceptance.

For example, Motorola's ColdFire derives from the original 68000, which is prevalent in the market—generally a good indication that a wide range of tools supports the core. These tools range from hardware-development tools to compilers to electronic-design-automation tools.


LSI Logic's CoreWare

The above definition qualifies LSI Logic's CoreWare system-on-a-chip design as a "good" core. The design is based on the company's MiniRISC family of µP cores, which are MIPS R4000 derivatives. The first three MiniRISC members, the CW4001, CW4010, and the CW4100, provide a 25- to 250-MIPS performance range. The CW4001 comprises a three-stage pipeline with a unified instruction and data bus; the CW4010 is a superscalar processor that has a 64-bit memory interface. The device simultaneously distributes two instructions among five independent execution units. The CW4100 is similar to the CW4010, except that the CW4100 has a 128-bit memory interface. LSI Logic also licenses the DSP Group's Oak core as encrypted Verilog, VHDL, or Motive static-timing-model formats.

Looking Ahead
Custom processor development is spreading like wildfire, and ASIC vendors are busy building their macro libraries. Vendors are also focusing on chip-testing methods, such as built-in self-test. Other tidbits of things to come include:
  • IBM is waiting for an opportunity to produce a NexGen 586 core—if and when the industry is ready. In 1996, the company will make its MWave DSP available as a core.
  • Oki will add its nX 66K core to the QuickCore program. The company is looking at DSP and other peripherals for applications such as telecomm.
  • Motorola will add PowerPC and DSP to its FlexCore offerings. The company will also publish a spec for either the Kbus interface or controller to allow customers to design proprietary devices that will attach to the Kbus.

The MiniRISC cores come as hard macros and comprise only a CPU-execution unit. You can configure all other features, such as on-chip memory size, cache type, bus interfaces, DRAM controllers, memory management, peripherals, and coprocessors. LSI has developed several types of on-chip interfaces for linking these features. MiniRISC processors directly support two high-performance interfaces: the CPU Bus (CBus) and an arithmetic, or computational bolt-on (CBO), interface. You must always use the CBus, which provides the main communication channel between the core and other main functional blocks, such as the cache, the memory-management unit (MMU), coprocessors (only in the CW4001), a bus-interface unit and cache controller (BBCC), and on-chip RAM and ROM.

Although coprocessors in the CW4001 have unique control signals, the coprocessors share the data bus with the caches. This data-bus sharing creates loading, but you can minimize the loading effects by applying good design techniques, such as letting the device-layout software keep the traces short. Also, a field in the coprocessor instructions limits the MIPS architecture to support of only four coprocessors, thereby limiting the number of potential loads. On the other hand, coprocessors in the CW4010 have dedicated instruction-code and data buses.

The CBO interface, which is unique to LSI's cores, allows "partial customization" of the CPU core. For example, you can opt to attach a special arithmetic unit, such as a multiply/ divide, bit-manipulation, or multiply-accumulate unit, and insert specialized arithmetic instructions into the core (Fig 1).

You may initially be averse to partial customization, because it sounds complex. However, this interface is surprisingly simple, and, whether or not you use it, it does not require you to modify the CPU.



Besides a few control signals, the CBO interface comprises 32-bit, instruction-operand source buses (rs, rt) and destination buses (rd) as well as two subfields of the instruction registers (IR). The CPU presents the IR fields and rs and rt operands at the CBO interface at the beginning of execution of each instruction. The rs and rt operands obtain their values from the CPU's registers, giving the CBO virtual direct access to the CPU's register file. Using the IR fields, the CBO decodes and executes the instruction and writes the results back to the rd bus at the end of the cycle. The CPU handles register-file updates during write-back. Unlike a coprocessor, the CBO incurs no penalty when transferring data to and from the CPU. Furthermore, data hazards and incoherence problems between the CPU and the CBO do not occur, because CBO operations synchronize with those of the CPU. The CBO can also assert a stall-request signal when it needs to stall the CPU's pipeline.

If you decide not to use some of the optional functional blocks that attach to the core's interface, you must properly tie off the unused input signals to the core. This approach provides feedback to the core that a block is absent.

For example, if your design lacks an MMU and your software tries to execute an MMU instruction, the CPU would raise a trap. (The MIPS architecture accesses the MMU as a coprocessor.) The interrupt routine that handles this trap could be written to emulate the unexecutable instruction. This technique is useful for allowing user code to migrate across silicon products.

You can use the BBCC to generate the 32-bit BBus, a secondary-bus interface, for indirectly attaching on- or off-chip peripherals to the CW4001 core. Such peripherals include DRAM, DMA, and PCI controllers; timers; and serial/parallel ports. The BBCC handles block fetching, burst writes, and bus arbitration. It also supports a hardware-test mode, so that data can be read or written from caches without going through the CPU. This approach facilitates cache testing. The BBCC also has a write buffer that lets you determine the best depth—up to eight entries—for your application.

To assist you in developing a CW4001-based system on a chip, LSI offers the MR4001, a "loaded"-software model as well as actual silicon. Besides the core, this chip contains an MMU, a cache, a multiply/divide unit, the BBCC, two general-purpose timers, and a PCI-like interface. You can start with this software model and remove or add the appropriate functions. Again, if you do not use blocks such as the MMU, you must tie off the corresponding signals.

LSI offers the ScanIce hardware-debugging mechanism, which allows you to use a JTAG (Joint Test Action Group) front end. ScanIce allows you to shift a pattern of an entire state into the device's circuitry; your design can then operate from that known state. You can switch the scan chain into a shift mode at any point and clock all the information out of the chip. ScanIce also comprises breakpoint hardware that can load on-chip registers with breakpoints; you can also use CPU instructions to load these registers.


Motorola FlexCore Program

Another vendor, Motorola, began offering core-based designs several years ago, with the 68000 as the foundation. The company also stripped down the 68020, providing no MMU or floating-point-unit (FPU) interface. The resulting cores, the CPU32 and CPU32+ have 16- and 32-bit data buses, respectively. Motorola first used these cores in the 683xx products, most of which Motorola's customers specified. This use was the unofficial beginning of the FlexCore program, which includes a library of megafunctions and a design methodology. Although Motorola performed the original designs, FlexCore's goals were that customers would do the design work and that Motorola would do the fabrication.

The 68000 core, called the SCM-68000, resembles the original chip implementation. One difference, however, is that SCM68000's address bus is 32 bits wide and suits use as either a 24-bit address bus, as on the 68000 µP, or as a 32-bit address bus. The SCM68000 also includes functions such as processor-status, pipeline-refill, and interrupt-pending signals, all of which the µPs lack. These signals support emulation and facilitate interfacing between the core and on-chip logic.

Orion Instruments offers the $9000 to $15,000 Orion 8800 emulator/analyzer, which supports SCM68000 designs, although it does not target a specific processor. You can use processor-specific probing and software to tailor the instrument to any processor. The Orion 8800 uses the target processor as the execution engine. The emulator clips onto the processor and accesses only its address, data, status, and control signals; on-chip peripherals do not affect the emulator. Logic in the probe allows the 8800 to monitor and control the processor by overdriving, or "backdriving," the appropriate signals and momentarily forcing gate inputs to desired states. One drawback of clip-on emulation emerges when the processor executes from its internal cache; during this process, you cannot trace the execution flow.

Motorola designed the new ColdFire architecture, also an offspring of the 68000 family, as a core. ColdFire's modular architecture supports a reduced instruction set, an enhanced pipeline, and a hierarchy of internal buses. It will probably become Motorola's FlexCore leader. Besides the execution unit, ColdFire supports as many as three internal buses. The core bus (Kbus) attaches directly to the core and is hierarchically similar to the MiniRISC CBus. The Kbus allows the core to perform a 32-bit fetch from internal memory in one clock by pipelining the address and data. A controller interface on the Kbus indirectly attaches the core to user-selectable cache, ROM, and RAM modules. The master bus (Mbus) offers centralized arbitration. A special module connects the Mbus to the Kbus. The slave bus (Sbus), for standard peripherals and for an interface to outside the chip, attaches to the Mbus through a system-bus controller. These buses offer silicon efficiency in that if you don't need a bus, you can eliminate it and its associated overhead.

ColdFire has an integrated debugging-module interface for full-featured emulation. The interface supports three modes: real-time trace, real-time debugging, and non-real-time debugging. Real-time trace uses four pins that reflect the processor's status, indicating events such as instruction completion. Four additional pins monitor change-of-flow target addresses. Real-time debugging supports three types of hardware breakpoints: PC relative, operand address, and operand data. Non-real-time debugging resembles background debugging mode on 683xx products. In this mode, a three-pin serial interface can read register contents, generate an infinite-priority interrupt, and force the CPU to halt.

Emulator developers committed to ColdFire include Applied Microsystems, Embedded Support Tools, Huntsville Microsystems, and Orion Instruments.


IBM's system building blocks

Another company with a core-development program is IBM Microelectronics. The company's ASIC program focuses on developing products using the PowerPC cores, including the 602 and 603, and, eventually, the 403 and the new 401 PowerPC core.

IBM's core approach follows that of LSI and Motorola in that IBM's architecture uses a two-bus hierarchy. The two buses include the on-chip peripheral bus (OPB) and the bus-device-controller (BDC) interface (Fig 2).

IBM provides the specifications of these buses if you sign a nondisclosure agreement. These specifications let you design the interface for your proprietary devices that attach to either bus. IBM hopes that these bus specs will become standard enough within the company that it can use the interfaces with any of the company's cores. This approach lets you reuse logic in different CPUs.


The BDC interface is basically a pin-unlimited processor bus. You can design in options, such as separate instruction and data-cache paths and the ability to make instruction and data accesses in parallel. Devices that attach to the BDC interface include SRAM, DRAM, DMA, and OBP controllers. The predominant use of OPB is for peripheral devices, such as serial ports and PCI bus interfaces. Although typically faster than the processor's external bus, the OPB is roughly two times slower than the BDC interface. The reason for this delay is that OPB accesses go through an extra layer that delays the bus by at least one clock. This delay is common to any core that supports a bus hierarchy. You can attach any number of devices to the OPB, because the synthesis tool selects the appropriate driver for the load.

The RISC Watch debugging tool for IBM's core-based products uses the chip's JTAG-like interface to perform circuit debugging, but it does more than boundary scan. RISC Watch provides all the usual debugging functions: stop, breakpoint, and non-real-time trace. An external controller attaches to the five- to eight-pin interface and commands the target processor to scan in and out whatever registers apply.


ARM throws small core

Advanced RISC Machines (ARM) designs and licenses µPs cores for its partners, including Digital Equipment Corp (Maynard, MA), GEC Plessey (San Jose, CA), Sharp Corp (Mahwah, NJ), and VLSI Technology (Phoenix, AZ). The ARM processors implement a load/store architecture and a three-stage fetch, decode, and execute pipeline. ARM designed the processors from the ground up as cores that occupy only 9000 to 15,000 gates. These devices' coprocessors, cache, and core share address and data buses. The ARM architecture accommodates as many as 16 coprocessors on this internal bus, so you must consider loading effects. Designers typically use only one coprocessor, however. This unit may include an FPU, a graphics accelerator, or an MPEG (Moving Pictures Experts Group) bit-manipulation unit.

Like other architectures' coprocessors, ARM coprocessors extend the instruction set. ARM reduces the number of signals by requiring no unique signals for each coprocessor; instead, each coprocessor must examine the instruction the core sends and determine ownership based on the instruction's coprocessor-identifier field. ARM offers functional blocks, such as the cache and MMU, as options. The ARM architecture treats the MMU as a coprocessor and identifies it as coprocessor 15. If you omit the MMU and the core executes an instruction to this unit, the core does not get a return handshake and generates a fault. This situation implies that you need not tie off signals or perform any other design modifications in the presence or absence of the MMU.

A drawback to implementing the MMU as a coprocessor is that you impact performance by having to load the MMU's registers through the coprocessor instructions instead of using a more common memory-mapped approach.

As you might expect, on-chip memory accesses are faster than off-chip accesses. In one clock, the ARM core sends a memory request, then an address, and then a data strobe, producing a pipeline effect and generating single-cycle data accesses. Based on a 25-MHz clock rate, this pipeline effect yields a 40-nsec on-chip access time. Going off-chip, however, generally adds 10 nsec and may require the insertion of a wait state. Avoiding off-chip accesses also yields a power-savings advantage, because the processor doesn't have to drive the external data pins.

To help you develop your ARM processor, VLSI offers the ARM650 development chip, which contains a DMA controller and buffer, a 32-bit timer, a video interface, an interrupt controller, a coprocessor, and a serial I/O port. Use this chip as a starting point and strip out unwanted peripherals or add to the design.

Emulator support uses debug and ICEbreaker extensions, which give access to the core through a JTAG interface. Hooks in the on-chip circuitry allow setting and reading of registers and setting breakpoints. VLSI also offers a ROM in-circuit emulator (ICE) that hooks into the ROM socket and captures the instruction flow, except when the processor executes from cache.


SGS-Thomson does DSP

SGS-Thomson's contribution to 16-bit, fixed-point DSPs is the D950-Core. The key to the core's performance is parallelism, which lets the D950-Core perform multicycle functions simultaneously with other processors. The company supports this core with an ASIC-development program and a variety of peripherals. Although the core is inflexible, you can use its coprocessor interface to add processing capability, such as a complex multiplication cell for a modem or a Viterbi decoder. When the D950-Core detects a dedicated coprocessor instruction, the core asserts the valid-coprocessor-instruction (VCI) signal to the coprocessor. The coprocessor than goes to the instruction address and executes the instruction.

The core contains an emulation-and-test unit for JTAG compatibility. SGS-Thomson provides a JTAG-emulation board with a graphical, high-level source debugger that connects to your PC through a serial link with a JTAG test-access-port (TAP) controller. To further support development efforts, SGS-Thomson provides the ST18951 test/emulator chip. The device contains interrupt and DMA controllers, a TAP, a bus-switch unit, and a coprocessor. You can use this chip to debug your code in parallel with chip development.


DSP cores from the DSP Group

DSP Group developed its own 16-bit fixed-point DSP core architectures, the Oak and Pine DSPCores, and licenses them to companies, including GEC Plessey, LSI Logic, and VLSI. Once again, using a core adds benefits over using standard products. Oak and Pine cores support eight off-core, on-chip, user-defined registers that appear in the data-register fields of all relevant instructions. In terms of the programming model, these registers are part of the register set. You can build on-chip computation units, such as an FFT block or an FIR filter, around these registers. Loading these registers via an appropriate instruction directly loads the computation unit. At the end of the computation, you can load internal core registers with the resultant data in one cycle. You can also use these registers as interfaces to dedicated hardware, such as timers, serial/parallel ports, or a host port.

The DSP Group provides emulation/evaluation boards that include general-purpose Oak and Pine emulation processors and a wire-wrap area. These processors bring internal buses to pins on the chip, making all internal circuitry and timing easily accessible. A special on-chip emulation module hooks into the inner core. The core processors obtain information through these hooks and store it in registers that then transfer the information to the outside world. You can use an on-chip serial port on final-silicon versions to obtain this information.


Oki provides a "quick" core

Oki optimizes the pick-and-choose method of core-based processor development with its QuickCore program. It allows you to combine the 8-bit nX 65K core with a variety of precharacterized nX series peripherals and memories. QuickCore's advantage over standard off-the-shelf nX 65K devices is that it allows you to remove any unnecessary peripherals and to include the right mix for your application. The disadvantage of this approach is that it doesn't allow you to modify the peripherals. The core does provide a user-logic interface in the form of a memory or I/O bus, however. You can add as many as 50,000 gates of additional logic onto this interface.

MetaLink Corp has developed an emulator to support Oki's QuickCore program (Fig 3).

The emulator contains a "fully loaded" version of the nX 65K that includes all fixed peripherals. A cable from the emulator connects the chip's user interface to a wire-wrap board containing your custom logic, which is stored in FPGAs or discrete circuitry. The iceMaster 65K with an emulator base and a probe card costs $3000.



Hitachi's MicroCore ICs

Hitachi's SH-1 and H8/300H cores suit designs with the company's HG72C series cell-based ICs. The 32-bit H8/300H processor cranks out 1.8 MIPS of processing power. Higher up on the scale is Hitachi's SH-1, a 32-bit RISC controller that delivers up to 16 MIPS at 20 MHz. You can add as many as 250,000 user-logic gates to these cores. Hitachi's program is similar to Oki's in the sense that you can use off-the-shelf peripherals to easily design a custom chip in one day, according to the company.

For about $20,000, you can purchase Hitachi's E7000 universal emulator, which works with all Hitachi processors. The E7000 comprises two units: the emulator station and the processor-specific emulator pod. The pod connects to the target system and provides a general-purpose evaluation chip. You can start with a standard implementation and ignore the unneeded on-chip peripherals. A third piece of the emulator is a board that has additional Hitachi-specific peripherals in standard-cell devices; you can connect them using jumper wires. A socket on the board accommodates custom logic.

Quickturn Design offers emulators that comprise a large array of Xilinx FPGAs. Software that comes with the emulators maps any design, including cores, in a netlist format into the individual FPGAs. The emulators partition the circuitry in a way to guarantee that timing violations—that is, delays between FPGAs—do not occur. Another important piece of the emulator is a logic analyzer that provides waveform display, system triggering, and functional testing—the ability to drive vectors using logic probes. Additional pieces of software include libraries to convert from the ASIC library you are targeting into the FPGAs.

This emulator lets you perform functional verification of your design at a six-orders-of-magnitude-faster rate than software verification, according to the company. The Quickturn system integrates many FPGAs, however, making it difficult to control the "system-out-of-the-chip" clocking. As a result, you can run the "chip" only as fast as 1 MHz. Price may be another deterrent from using this system. A minimal system supporting designs with as many as 50,000 gates sells for $50,000. Prices for larger designs cost $250,000 per 250,000 gates.

Now that you know about many of the available cores and their interfaces, you have to learn how to put them together. In EDN's October 26 issue Technical Editor Jim Lipman covers design methodology: how to connect all the pieces.



You can reach Technical Editor Markus Levy at (916) 939-1642, fax (916) 939-1650.

Acknowledgment

I would like to thank Paul Cobb of LSI Logic, Craig Trautman of Motorola, and Dave Stauffer of IBM for their extensive technical support.


For more information...
When you contact any of the following manufacturers directly, please let them know you read about their products at the EDN Magazine WWW site.
ASIC/FPGA vendors offering cores Altera Corp
San Jose, CA
(408) 894-7000
GEC Plessey Semiconductors
San Jose, CA
(408) 451-4700
Hitachi America Ltd
Brisbane, CA
(415) 589-8300
IBM Microelectronics Inc
Essex Junction, VT
(802) 769-6408
LSI Logic Corp
Milpitas, CA
(408) 433-8000
Motorola Inc
Austin, TX
(512) 891-2000
Oki Semiconductor Inc
Sunnyvale, CA
(408) 720-1900
SGS-Thomson Microelectronics
Phoenix, AZ
(602) 867-6200
Sharp Microelectronics Corp
Mahwah, NJ
(201) 529-8200
Symbios Logic
Fort Collins, CO
(970) 223-5100
Texas Instruments Inc Literature Response Center
Denver, CO
(800) 477-8924, ext 4500
Toshiba America Electronic Components Inc
Sunnyvale, CA
(408) 737-9844
VLSI Technology Inc
San Jose, CA
(408) 434-3000
Core-only vendors
Alta Group
Foster City, CA
(415) 574-5800
DSP Group
Santa Clara, CA
(408) 986-4315
Mentor Graphics
Wilsonville, OR
(503) 685-7000
Sun Microsystems Computer Corp
Mountain View, CA
(408) 779-8119
3Soft Corp
San Jose, CA
(408) 467-0410
Vautomation
Nashua, NH
(603) 882-2282
Western Design Center
Mesa, AZ
(602) 962-4545
Emulator/debugger vendors Applied Microsystems
Redmond, WA
(206) 882-5326
Embedded Support Tools
Canton, MA
(617) 828-5588
Huntsville Microsystems
Huntsville, AL
(205) 881-6005
MetaLink Corp
Chandler, AZ
(602) 926-0797
Orion Instruments
Sunnyvale, CA
(408) 747-0440
Quickturn Design
Mountain View, CA
(415) 967-3300


| EDN Access | feedback | subscribe to EDN! |
| design features | design ideas | columnist |


Copyright © 1995 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.