EDN Access

 

March 14, 1997


Switching RISC architectures the easy way

John Canosa, Questra Consulting, SES Technology R&D Group

Performance demands and discontinued product families can force you to switch RISC processors for your next-generation product. Choosing the right replacement can ease the pain of the resulting hardware and software redesign.

The never-ending demand for more features and better performance is forcing designers to re-evaluate the microprocessors that their products incorporate. Outside forces can also affect processor selection for a next-generation product; AMD's (Sunnyvale, CA) decision to end development of the 29K family could leave many users without a migration path to the next performance level. To ease the inevitable shift to a new architecture, you need to examine and understand the RISC alternatives.

In most designs today, software is a large component of the overall system-design effort and cost; the main issue in changing processors is software-related. If, for instance, you are currently using a real-time operating system (RTOS) in your product and are considering changing to a new architecture, you must first make sure that your RTOS vendor supports the processor that you have in mind. RTOS vendors typically support many, but not all, architectures. An architecture switch may also involve changing tools and using a different compiler vendor or debugger. The impact of processor choice on the RTOS and tools could, therefore, narrow your alternatives. Processor vendors can supply you with a list of which RTOSs and tools support which processors.

Changing processor architectures can also lower software performance. Most designers understand how things such as cache size and clock speed affect processor performance; however, the underlying architecture has a more significant effect on performance. Understanding the RISC instruction pipeline, what causes stalls, how the processor fills delay slots, and whether the processor uses speculative branching is useful when you are trying to write high-performance code.

Such understanding is useful, even if the development team writes all its code in high-level languages. C and Ada are wonderful (although C zealots may not think that Ada is wonderful and vice versa) because they abstract away the complexities of the underlying architecture. The C and Ada languages are especially useful for RISC processors because compilers can automatically include optimizations that keep the instruction pipeline filled.

Unfortunately, a high-level language gives some programmers the incorrect idea that they don't need to fully understand the underlying processor architecture. Most designs still have some assembly-language components, although they may be isolated in a board-support package for an RTOS or interrupt-service routine (ISR). And let's face it, if C code does not meet your performance requirements, your next likely step is to write some small parts in assembly language. If it comes down to that, you would have to understand the architecture anyway, so you might as well understand it in the first place. Table 1 summarizes the architectures of several popular RISC-processor families.

Several aspects of a processor's architecture merit careful study, including the processor's instruction set and how that set interacts with the registers. Because embedded applications rely heavily on interrupts, you also need to thoroughly understand the processor's exception-handling mechanisms. In fact, interrupt and exception handling is one of the areas in which RISC processors differ most. Understanding how your RTOS interacts with the processor in such areas as parameter passing, context switching, and setting up stack frames is also very important.

Keep the pipeline filled

To understand the differences among RISC architectures, you must realize that the goal of RISC-software design is to keep the pipeline filled; that is, to avoid pipeline stalls. Any instruction that requires multiple clock cycles to execute can potentially cause a pipeline stall. As an example, memory loads and stores, branches, floating-point operations, and DSP operations, such as multiply/accumulate, can all take several clock cycles to complete.

Figure 1 shows what happens during a branch instruction in a well-behaved pipeline. In this case, the processor undergoes a two-cycle latency between the determination of whether to take a branch and the loading of the branch target instruction. Pipelines with different numbers of cycles have different latencies. (This pipeline has a five-cycle pipeline.) These latency cycles are wasted time for the processor--something that you should avoid for optimal performance.

Processor architectures reduce latency in two ways. The MIPS, SH3, and 29K architectures use a delayed-branch scheme. With a delayed branch, the processor always executes the instruction that follows the branch decision (Figure 1, Instruction 2). Good optimizing compilers for these architectures try to keep in this delay slot an instruction that does not affect the branch decision's outcome.

Other architectures reduce latency by guessing whether the branch will be taken and filling the pipeline accordingly. Both the I960 and the PowerPC 603 use static branch prediction, which means that the compiler encodes the guess into the branch instruction itself. The processor then automatically loads the predicted path into the pipeline. If the guess was correct, there is no pipeline stall or latency. However, if the processor doesn't take the path, the pipeline incurs the branch instruction's full latency.

The ARM 710 architecture is unique because you can make all of its instructions conditional. There is no branch prediction or delayed branching. Instead, the ARM 710 uses a three-stage pipeline to keep the branch latency low.

Check the effect of interrupts

You must consider interrupts when you change RISC architectures, because embedded systems live and breath by interrupts. The real world seldom provides external events when the processor is looking for them. The irony is that you use interrupts to provide rapid response to external events, but the interrupts generate the largest disruptions to a processor's pipeline, which cause the largest latency in handling those interrupts.

Interrupts and exceptions disrupt the pipeline in the same way a branch instruction does, but the similarities end there. You cannot predict interrupts, and they place operational constraints on which instructions can occupy a delay slot. An interrupt can cause the execution of a delay-slot instruction after a branch. If the delay-slot instruction happens to be one that an interrupt from an interlock signal was supposed to prevent, the situation can wreak havoc on your design.

There are as many ways to handle interrupts as there are processors, but you should be particularly aware of certain behaviors. All processors save the machine state and some version of the program counter to allow normal execution to continue after the processor has handled the interrupt. Some processors, such as the PowerPC 603 and ARM, complete the instruction in the execution units when an asynchronous interrupt occurs, before servicing the interrupt. The MIPS, 29K, SH3, and I960 all abort the current instruction execution.

Another important interrupt behavior is what happens to any in-process load or store operations. Typically, the processor completes the load or store before handling the interrupt, but what happens if the operation is a multiple-load or -store operation? Is it canceled, suspended, or run to completion? The data books do not always address these questions, so you should contact the vendor for the answers.

One of the main software issues in dealing with interrupts is preserving the processor's previous context. Pushing registers onto the stack can be time-consuming and have a major effect on the interrupt handler's latency. Therefore, RISC processors typically save few registers automatically. To avoid such time-consuming saving of registers to memory, for example, the ARM and SH3 processors swap parts of the general-purpose register banks with some secondary registers during interrupts.

Because RISC processors save so few registers, the software developer must be careful when dealing with interrupts, especially nested interrupts. A common mistake when using nested interrupts is to forget to push the previously saved machine-state registers and program counter onto the stack before re-enabling interrupts during an ISR. If a nested interrupt occurs, the processor may overwrite data not pushed onto the stack.

Check register usage for conflicts

You cannot write efficient, high-performance C code and move it to a new processor architecture without understanding the processor's register set and any parameter-passing and function-return conventions your compiler uses. For example, how would you map the 29K family's 192 global and local registers to the PowerPC's 32 general-purpose registers? The 29K uses the Berkeley RISC architecture, which has large register files and register windows, whereas other architectures use a smaller, single-register file.

When exploring RISC-processor registers, you find that many "general-purpose" registers actually have specific uses, either by software convention or by hardware design. For instance, in the MIPS architecture, general register R0 is hardwired to zero, and general register R31 is a link register that contains the return address for jump and link instructions. To avoid problems, the designer should consider some registers in each processor off-limits (Table 2).

When writing assembly-language functions, you also need to understand which registers you can change and which registers the calling function expects to remain intact. If your compiler vendor does not document any of its parameter-passing conventions, write some small functions that pass and return different types of parameters. By examining the resulting assembly listings (yes, in these days of source-level debuggers, you can still get assembly listings), you can determine which registers the compiler uses for parameter passing and return values. There are also some standards, such as the Embedded Applications Binary Interface (EABI) for the PowerPC and the Host Interface (HIF) specification for the 29K family, that specify registers for parameters and return values.Table 3 lists EABI-register conventions.

In addition to the software issues that arise when you change RISC architectures, significant hardware issues can emerge. Many high-performance embedded applications rely on ASICs to reduce system size and cost. Switching to a MIPS processor with a different bus interface can have severe ramifications if your ASIC was designed for the 29K bus. (The thought of losing hundreds of thousands of dollars in NRE charges for that new image coprocessor is enough to give your accounting department nightmares for a year.)

Bus interfaces are not specific to an architecture family, however. Some members of a family have the standard bus interface, and others may have a built-in memory controller. Yet, even devices with memory controllers typically have a mechanism for an external device to signal when it is ready to latch data in or when its output data is valid. (Using this mechanism typically results in better performance than does inserting a worst-case number of wait states into a memory-controller register.) In some cases, therefore, adding a few dollars' worth of PLD to create a bus translator could save your ASIC investment and prevent the need for redesign at least until you find the next bug.

Translation hardware preserves designs

The first step in any bus-translator design is to gather all the available data on the two processors' signal lines. Knowing how processors handle various signals can help you select a new processor that minimizes design effort. Typically, you also need to place some constraints on the design to make the task more manageable.

Consider, for example, the data-bus and instruction widths of the processors in Table 1. The SH3 has a 16-bit instruction width with a 32-bit-wide data bus, which allows the SH3 to load two instructions at a time. Similarly, the 603 has a 32-bit-wide instruction with a 64-bit-wide data bus that you can set to use only 32 bits. Planning your translator design to use only the mode compatible with the original design helps keep the task reasonable (see box, "Switching from the 29K to the PowerPC").

In addition to such physical differences, you should also carefully consider the differing behavior of signals. How a processor handles arbitration, simple and burst read/write transactions, resets, and interrupts can affect the translator design. Processors treat arbitration, for example, in two ways. The I960, the MIPS 3xxx, and the SH3 series can act as bus arbiters; that is, they assume that they have control of the bus and give up the bus only when an external device requests. The SH3 can also act in slave mode, which implies that it must request the bus from an external arbiter. The PowerPC 603 and the AMD 29K processors work in a way similar to slave mode.

For instance, when the 29K wants the bus, it asserts the bus-request (~BREQ) signal and waits for the bus-grant (~BGRT) signal to go low. Once ~BGRT goes low, the processor completes the bus transaction while holding ~BREQ low. The PowerPC 603 follows a similar method but uses other signals to determine that the bus is free. Also, the 603 does not hold the bus request low during the entire transaction, thus allowing another arbitration phase to begin while the current transaction is in progress. When the I960 wants to start a bus transaction, it checks its HOLD input pin. If HOLD is not asserted, the processor begins its bus cycle immediately. If HOLD is asserted, an external device has requested the processor to give up the bus. The hold-acknowledge (HOLDA) output signal acknowledges that the processor has released control of the bus.

After arbitration occurs the bus transaction. Simple transactions perform a single read or write to an address location. This read or write may be a single-byte, half-word, or full-word access. A generic simple bus transaction has an address phase and a data phase. The address phase consists of the processor's putting the address onto the address bus and then asserting an address-valid signal. The data phase consists of either the processor's or the peripheral's driving the data bus and then asserting a data-ready signal or deasserting a wait signal. Table 4 lists the common bus-related signals for various processor families.

All of the buses in this article are synchronous. If you use FPGAs or complex PLDs (CPLDs) in the bus-interface design, take extra care when using the R3xxx processors. The R3xxx family's bus interface uses both edges of the system clock, an action that many FPGAs and CPLDs do not support. IDT (Santa Clara, CA) does offer a work-around to the dual-edge design, but the work-around requires some higher speed devices than does a comparable single-edge design.

Burst transactions complicate bus translation

Burst transactions can be difficult when you design a bus translator. The data-acknowledge signal for a burst can differ from that of a single-beat cycle, the maximum number of beats in a burst may vary, the address may or may not be automatically incremented, and so on. Table 4 compares some burst-related signals for each processor.

In many cases, you can implement the bus-conversion logic with some simple combinatorial logic and a few signal latches. In those cases, the biggest issue is timing. Using a good timing-analysis tool can save you many hours of engineering.

Some designs may need to be more complicated, however. The example in the box uses several state machines. The key for the main state is to use the transfer-attribute signals to identify which type of access the 603 is performing. You can handle the single-beat access using some combinatorial logic, but the burst and two-beat accesses must use a second state machine that increments the addresses accordingly.

Don't forget resets and interrupts

Resets and interrupts are important signals that might need alteration, yet you might overlook them in a bus-translator design. Some processors, such as those in the 29K family, require the reset line to stay low for only a few clock cycles. Others, such as the 603, require reset to stay low for several hundred clock cycles. You should also pay attention to the state of all pins during a reset.

You must also investigate interrupts and interrupt mapping when you design a translator for a new processor. Families and family members can have differing numbers of interrupt inputs. Also, some processors have both synchronous and asynchronous interrupts. Having both types means that, in some cases, you need to ensure meeting the setup times of the interrupt inputs.

These hardware and software issues only skim the surface of what selecting a new processor involves. The myriad family members that exist for each architecture make the selection process more difficult. The combinations and permutations are seemingly endless.

Yet, changing to a new processor does not have to be an engineering nightmare. Although there are no blanket solutions, you can start your investigations with some of the issues raised here. Armed with a proper implementation plan and a good understanding of the relevant issues, you can change RISC architectures and still retain most of what you already have designed.


Switching from the 29K to the PowerPC
Outside forces, such as AMD's decision to end development of the 29K family, may require you to change processor architectures for your product line. To preserve most of your existing design, you need to explore the differences among the types of bus cycles that architectures use. As an example, consider the steps involved in connecting a device designed for the 29K bus to the PowerPC 603 bus.

The first step is to gather all the available data on the bus signals. For instance, the data-bus and instruction widths of the two processors vary; the 29030 has a 32-bit bus, and the 603 has a 32-bit-wide instruction with a 64-bit-wide data bus that you can also set to use only 32 bits. To simplify the design, use the PowerPC 603 in the 32-bit mode. Table A compares the signals on an Am29030 processor to the PowerPC 603.

The next step is to examine signal behavior. Figure A shows simple bus accesses for the Am29030 and PowerPC 603. The Am29030 uses the request (~REQ) signal to indicate that the address is valid and the address device can latch the address in on the next rising edge of the clock. The 603 has a similar signal, transaction start (~TR). When the attached peripheral device asserts the Am29030's ready (~RDY) signal, the data is valid and can be latched in on the next rising clock edge. The 603 has a transfer-acknowledge (~TA) signal that serves the same purpose. The Am29030 address is valid throughout the entire access, but the PowerPC 603 address and transfer descriptors are valid only until the acknowledge (~AACK) signal becomes asserted.

A designer's initial reaction would be to latch the address signals on the rising edge of the system clock (SYSCLK) while ~TS is low. In reality, this idea would not work because a new valid address cycle could begin before the data-bus tenure for that access has expired. The best option is to hold off asserting the ~AACK signal until the memory or peripheral-access time has elapsed. In fact, the simplest solution would be to hold off asserting the ~AACK signal until the last of the data for this access has been transferred onto the bus. This approach would hold the address on the bus until the end of the data tenure. The drawback of this scheme is that the PowerPC 603 would not pipeline addresses. However, this scheme's decrease in performance is probably negligible compared with the total performance increase you gain by using the PowerPC in the first place.

Burst-access cycles complicate the design effort; the biggest differences between the Am29030 and PowerPC 603 system interfaces occur during burst accesses. Figure B shows the burst cycles of the Am29030 and PowerPC 603. To handle burst differences, you must use the simple-transaction technique of holding the address and transfer attributes on the bus for the entire cycle. But the Am29030 increments addresses during the burst, and the PowerPC 603 does not. Therefore, the external bus-conversion logic must count the number of cycles (there should be 8 words total) and increment the address by four after each word.

To achieve this added complexity, the sample design uses several state machines. Figure C shows the overall state machine. Using the 603's transfer-attribute signals, the main state machine determines the access type the 603 is performing. The design uses combinatorial logic for single-burst access. You need a second state machine that increments addresses for burst and two-beat accesses.

 

Table 1--A brief comparison of RISC-family architectures
Architecture Pipeline
stages
Scalar/
superscalar
Branch
prediction
Delayed
branch
Number
of registers
Number
of usable,
general-purpose
registers
Instruction
width
(bits)
PowerPC 603 Five Superscalar Static No 126 32 32
MIPS 3xxx Five Scalar   Yes 35 30 32
SH3 Five Scalar   Yes 33 14 16
I960 Three Superscalar Static No 35 28 32
ARM 710 Three Scalar   No 37 14 32
29K Four Scalar   Yes 192 25+ 32

 

Table 2--Registers for users to avoid in RISC processors
MIPS PowerPC SH3 I960 29K ARM
R0—Hardwired 0 GPR0—0 by convention R0—Index register g15—Frame pointer gr0—Indirect pointer R14—Link register
R31—Link Register   R15—Stack pointer r0—Previous frame gr1—Stack pointer R15—Program counter
      r1—Stack pointer    
      r2—Link register    

 

Table 3--Embedded Applications Binary Interface (EABI) register conventions for the PowerPC
Registers Storage characteristic Usage
GPR0 Nonvolatile Usually set to 0 (not part of EABI specification)
GPR1 Dedicated nonvolatile Stack pointer
GPR2 Dedicated nonvolatile Read-only, small-data-area anchor
GPR3 to GPR4 Volatile Parameter passing/return values
GPR5 to GPR10 Volatile Parameter passing
GPR11 to GPR12 Volatile  
GPR13 Dedicated Small-data-area anchor
GPR14 to GPR31 Nonvolatile  
FPR0 Volatile  
FPR2 to FPR8 Volatile Parameter passing/return values
FPR9 to FPR13 Volatile Parameter passing
FPR14 to FPR31 Nonvolatile  

 

Table A--A detailed bus comparison for the Am29030 and the PowerPC 603
AMD29030 PowerPC 603  
Signal name Type Signal name Type Notes
A(31:0) Three-state output A(0:31) Three-state I/O AMD 29030 is little-endian, MPC603 is big-endian
~BREQ Output ~BR Output Bus request
~BGRT Input ~BG Input Bus grant
R/~W Three-state output TT1 Three-state I/O TT(0:4) gives specific transfer attributes, TT1 maps to a read/write signal, but it is only valid during the address phase; it needs to be latched
SUP/~US Three-state output     Supervisor/user indicator; no equivalent on 603
~LOCK Three-state output     Address-locking mechanism; no equivalent on 603
MPGM(1:0) Three-state output     User-programmable translation-look-ahead buffer entries; no equivalent on 603
ID(31:0) Three-state I/O DH(31:0) Three-state I/O AMD 29030 is little-endian, MPO603 is big-endian
~REQ Three-state output ~TS or ~XATS Three-state I/O REQ is valid for the entire transfer, and TS is valid for one clock cycle; TS is used for memory access; XACTS is used for I/O access; I/O and memory in the 29030 have the same cycles, except for the IO/MEM signal-changing state
~RDY Input ~TA Input Transfer acknowledge, used to insert wait states
I/~D Three-state output TC0 Output TC(0:1) further describe transaction type, TC0 maps to instruction/data
IO/~MEM Three-state output     No equivalent on 603; however, the MSB of the address bus could be used for the same effect
~BWE(3:0) Three-state output     No equivalent on 603; needs to use TT1, TSIZ(0:2) and A(30:31) to generate equivalent signals
~ERR Input ~TEA Input Not exactly equivalent; the 29030 samples the ~ERR signal when ~RDY is asserted, and assertion of ~TEA immediately initiates a machine-check exception
~BURST Three-state output ~TBST Three-state I/O ~TBST is valid only during the address phase, and it needs to be latched; also, the 603 does not increment the address bus during bursts
~PGMODE Three-state output     Indicates access to the same page-mode block; no equivalent on 603
~ERLYA Input     Request early transmission of burst mode addresses; no equivalent on 603
~RDN Input     Indicates that accessed device is 8 or 16 bits wide; no equivalent on 603
OPT(2:0) Three-state output TSIZ(0:2) Three-state I/O Indicates size of current transfer; encoding is different, and the TSIZ bits are valid only during the address cycle
~WARN Edge-sensitive input ~SRESET Input Similar functionality, except that ~SRESET branches to the same exception as ~HRESET, whereas ~WARN has its own vector
~INTRO Input ~INT Input Interrupt
~INTR(1:3) Input     Extra interrupt pins; no equivalent on 603
~TRAP0 Input ~SMI Input System-management interrupt
~TRAP1 Input     Extra trap pin; no equivalent on 603
STAT(2:0) Output DP(0:7) Three-state I/O When HIDO EICE bit is set, the parity pins perform an execution tracking function similar to the STAT pins’
CTNL(1:0) Input     Controls processor mode (halt, step, etc); no equivalent on 603
~RESET Input ~HRESET Input Hard reset; 29030 vectors to address 0; 603 vectors to 0xFFF00100
~TEST Input     Puts 29030 in high-impedance test mode; no equivalent on 603
MSERR Output     Master/slave error; no equivalent on 603
INCLK Input SYSCLK Input 29030 is 33 MHz maximum; 603 is 66 MHz maximum
MEMCLK I/O     Memory-subsystem clock, can be input or output at INCLK or INCLK/2; no equivalent on 603
TCK, TMS, TDI,   TCK, TMS, TDI,   JTAG pins
TDO, ~TRST   TDO, ~TRST    
PWRCLK Power     If tied to +5, MEMCLK is an output; if grounded, MEMCLK is an input; no equivalent on 603
VCC Power VDD Power 29030 requires 5V ±5%; 603 requires 3.3V ±10%
GND Ground GND Ground Power dissipation at 33 MHz, 29030=0.8W, 603=1.1W
~HIT, ~DI, ~WBC Reserved     Reserved pins on 29030; must be tied high through individual pullup.

Table 4--Bus-signal comparison for RISC architectures
Signal R3xxx Power PC SH3 I960 29K ARM
Address valid ALE ~TS ~BS ~ADS ~REQ nMREQ
Data valid Ack or RdCEN ~TA WAIT1 ~READY or WAIT1 ~RDY nWAIT
Address AD(31:4)/BE(3:0) A(0:31) A(26:0) A(31:2)/BE(3:0) A(31:0) A(31:0)
Data AD(31:0)2 D(0:31) D(31:0) D(31:0) D(31:0) D(31:0)
Bus error ~BusError ~TEA     ~ERR ABORT
Burst data valid RdCEn ~TA   ~READY or WAIT1 ~RDY nWAIT1
Burst address increment Yes No Yes Yes Yes Yes
Burst ID ~BURST ~TBST 3 ~BLAST4 ~BURST nMREQ
1 Data is considered valid after WAIT is deasserted.
2 Address and data buses are multiplexed.
3 SH3 supports bursts only in on-chip SDRAM and DRAM controllers.
4 I960 BLAST signal indicates that a burst is complete; it can be thought of as a (not) burst indication.

Author's biography

John Canosa is a principal member of the technical staff at Questra Consulting (Rochester, NY), where he designs and develops hardware and software for high-performance embedded systems. Before joining Questra, Canosa was the manager of electronics engineering at the University of Rochester's (Rochester, NY) Laboratory for Laser Energetics, which houses the Omega Inertial Confinement Fusion research laser system. He has a BSEE from Clarkson University (Potsdam, NY), an MSEE from the Rochester Institute of Technology (Rochester, NY), and more than 15 years' experience designing analog, digital, and embedded systems.


| EDN Access | Feedback | Table of Contents |


Copyright © 1995 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc.