EDN logo


Design Feature: April 13, 1995

DMA module enhances µC CPU performance

Kevin Anderson,
Motorola Inc

Matching a µC's CPU performance to a particular application can be a tricky task. However, to enhance µC performance, a DMA module peripheral can offload many data-transfer tasks from the CPU-and help you defray overall system cost.

When selecting a single-chip µC for an embedded application, consider CPU performance. If the application requires a large number of 16-bit multiplies in a 1-msec timing loop, a CPU that can handle half that number is obviously inadequate. On the other hand, choosing a high-performance CPU to perform simple routines such as scanning a keypad and sending encoded output may result in an end product that is too costly.

Another key issue in selecting a µC involves the capabilities of the peripheral subset on the µC. A well-designed peripheral subset can stretch the performance of a low-cost CPU, which ultimately can drive down system costs.

One example of a peripheral that increases CPU performance is an on-chip DMA module. Traditionally, designers have used DMA to move large blocks of data efficiently within a system. Current µC architectures, such as Motorola's HC08 CPU, have extensions that allow for additional on-chip processing modules. In such cases, you can use a DMA module to perform peripheral service routines. Using this capability can significantly decrease peripheral service latency and increase effective CPU efficiency. The bulk of this article describes the DMA implementation for a single-chip µC in an embedded application; it also includes programming examples that demonstrate how DMA can provide significant and improved system performance.

DMA serves as a means of transferring data between memory and buffers without CPU intervention. In a single-chip µC, large data-block moves are not common, because these devices usually operate independently in a system and are not connected through a bus structure to other CPUs and memory systems. In addition, latency and overhead are associated with entering and exiting an interrupt service routine (see box, "Consider both latency and overhead"). However, once initialized, DMA can service numerous transmit requests without further code execution, and it only interrupts the CPU after the entire message is sent. Transferring each byte takes only two cycles, contrasting with the dozens of cycles needed for CPU service.

Architectural features aid DMA performance

Three architectural features of HC08-family µCs contribute to the chips' ability to perform DMA functions. First, the peripherals are memory-mapped, which means the CPU's address space can access control and status registers. Thus, any instruction that can access memory can also access a peripheral so the access doesn't require special peripheral service routines. Similarly, the DMA module can also access the peripheral registers as just another space in the memory map.

Second, the HC08 CPU allows on-chip co-processor modules to take control of the system address and data buses at a bus-cycle boundary. Motorola designed the co-processors to retain bus control until they've finished their business or can allocate a certain maximum percentage of the bus bandwidth to the co-processor. The latter method is known as "cycle stealing" and allows for overlap between the CPU and the co-processor.

Finally, Motorola designed the peripheral modules with hooks into the DMA module. You can initialize each module having DMA service, so that an interrupt-service request is directed to the DMA module rather than to a CPU interrupt. You can also set up a DMA channel to handle the interrupt request from the peripheral. Once initialized, the DMA module can service the peripheral without CPU code execution and with only two-cycle latency vs the minimum of nine cycles for CPU service.


Old burglars never die

The DMA operates by stealing cycles from the CPU (see box, "Architectural features aid DMA performance"). For each group of bytes transferred, DMA takes over the data and address buses for two cycles. Once the transfer completes, the CPU immediately regains control of the bus and continues operation. In low-data-rate situations, such as 9600-baud serial transfers, the number of cycles taken from the CPU is a very small percentage of the total cycles. If the application requires quick transfer of large amounts of data, such as moving a table of saved values from EEPROM to RAM, DMA transfer can take over the entire bus bandwidth. Large data transfers effectively halt the CPU; however, the overall efficiency remains high because the DMA transfer occurs in fewer cycles than a CPU transfer.

Fig 1 shows the DMA module in the HC08. Each DMA channel has its own set of programmable registers; the CPU can read or write to all of the registers. You program the base address of the data source and destination into the source and destination registers. These registers are 16 bits wide and can point to any place in the µC's memory space. The block-length register sets the total number of bytes to transfer.

The DMA module can transfer as many as 256 bytes in a block. The DMA module clears the byte-count register before a block transfer begins. After initialization, the byte-count register increments for each byte transferred. The DMA's ALU uses this value in calculating the source and destination address during a block transfer. The CPU may read this register at any time to determine how many bytes out of the total have been sent.


Control registers assign service

Each channel also has a channel-control register, which lets the µC assign service to any specific peripheral. The channel-control register also selects whether the DMA module will send one or two bytes in a single transfer. Two byte transfers occur consecutively and take four cycles to complete.

In addition, the channel-control registers control source and destination address generation. The channel source and destination addresses may be set to increment, decrement, or remain static as each byte transfers. Flexible address-generation control is key to the DMA module's ability to service peripheral modules. For example, if you're using DMA to move a data table from ROM to RAM on power-up, you set both source and destination registers to increment. In a different scenario, you might use DMA to service an A/D converter connected to the serial peripheral interface (SPI). Here, the source points to the SPI data register and remains static while the destination address increments and stores successive bytes in a receive buffer. Alternatively, the destination address could remain static, in which case the CPU always has access to the latest A/D-converter value.

Table 1 -- DMA-vs-CPU block-move performance
Block sizeCPU clocksDMA clocksRelative
performance
890462.0
16178622.9
32354943.8
647061584.5
12814102864.9
25628185425.2
The DMA module has a set of system status and control registers that control other module options. Each channel can be individually enabled, set to interrupt the CPU at the end of a transfer block and to loop back to the beginning of a transfer once a block has completed. This last capability allows you to create a process that can run forever without CPU intervention. The bus bandwidth available for DMA operations is also programmable. You can set the bandwidth to allow 25, 50, 67, or 100% of the bus cycles available for DMA operation.

A couple of examples would help to illustrate the DMA module's performance-enhancement capability. A common software routine found in embedded-control-initialization code is the block move. This routine can clear a section of memory or copy a table from program ROM to RAM; another use for it is when system variables store to RAM, where they are modified without using an EEPROM erase or programming cycle. At power- down, another routine saves the modified values back to EEPROM so that new values are available at the next power-up.


CPU transfers block data

Example 1—CPU-block-move pseudo-code
Example 1 shows an in-line software block move similar to that found in initialization pseudo-code. This routine transfers data between any two blocks within a 16-bit address space. To express the total number of CPU cycles consumed in transferring N bytes, use

TCPU13+11(N-1).

Example 2—DMA-block-move psuedo-code
Example 2, which shows the code for the case above. You initialize the DMA channel with the number of bytes to be transferred as well as the source and destination pointers. The channel is also set to use 100% of CPU bandwidth. You initiate the transfer by writing to a DMA control register, which allows software to start a DMA transfer. To determine the total number of CPU cycles used to transfer N bytes in this manner, use

TDMAinit+2N=30+2N.

Table 1 lists the number of cycles required for various transfer sizes and relative performance improvement when opting for DMA over CPU transfer.

Using the data from the two equations above, it is evident that, for transfers of more than 3 bytes, the DMA-transfer method is quicker. However, another factor to consider is the amount of code space each routine uses. The CPU-transfer code takes 10 bytes, whereas the DMA initialization code takes 22. Therefore, you may need to choose between execution speed and program size if you expect the program memory to fill completely.


Use DMA for large serial transfers

Example 3—SCI transmiy using CPU-interrupt psuedo-code
Example 3 demonstrates use of the DMA module with an onboard peripheral, the serial communications interface (SCI). Fig 2 shows a simplified block diagram of the SCI transmitter logic. Once the SCI is initialized and enabled, transmission begins when you write a character to the SCI data register. The SCI immediately transfers this character into the transmit-shift register and, under control of the transmit control logic, shifts the bits out to the transmit pin at the programmed baud rate. When the character moves from the SCI bus to the transmit-shift register, the SCI sets the transmit- data-register-empty flag and sends an interrupt request to the CPU.

Example 3 shows the pseudo-code required to initialize the SCI and transmit a message using CPU interrupts. This example assumes that another routine has placed a message in the transmit-buffer space, which begins at the pointer TXADDR, and has initialized the TXEND pointer to the next byte following the end of the message. The code begins with the steps needed to initialize the SCI, including the transfer format and the baud rate, and to enable the SCI and CPU interrupts.

The transmit interrupt-service routine (ISR) checks the TXADDR pointer and compares it with the TXEND pointer to determine if the end of the buffer has been reached. If not, the CPU moves the byte specified by the TXADDR pointer to the SCI transmitter data register (SCD), increments the pointer by one, and clears the flag before exiting the routine. If the two pointers are equal, the complete message has been transferred and the routine ends by disabling the transmitter.

Now, analyze this routine to find the number of CPU cycles required to transmit a message of arbitrary length. Initializing the SCI takes 14 cycles. For every character transmitted, the CPU executes 29 cycles, plus a minimum of nine cycles to enter the ISR. While the CPU shifts the last character out, the CPU generates one final ISR, which takes the branch to disable the transmitter (TCEND). The process takes 26 cycles, plus nine overhead cycles. For a message of length N, the total is

TcyclesIRQ=14+N(29+9)+(26+9)=38N+49.

The actual number will be somewhat larger because the CPU can take interrupts only on instruction boundaries. Therefore, it is likely that one or more additional cycles of latency will occur for each interrupt-service entry.

Consider both latency and overhead

The HC08 DMA-peripheral-service latency can be as low as two bus cycles (Fig A). This condition occurs when you set the DMA module for 100% bandwidth or when the CPU has received the minimum number of bus cycles since the last DMA transfer. Normal CPU-interrupt-service latency is nine bus cycles. Thus, the best case latency improvement using the DMA module is 4.5 times the interrupt-service-routine (ISR) latency. Latency can be an important parameter if fast peripheral service is imperative.

However, latency is only one factor to consider; the other is overhead, which is defined as the number of CPU bus cycles required to enter and exit the ISR. The HC08 CPU takes nine bus cycles to enter an ISR and seven cycles to exit-for a total of 16 cycles.

Overhead is in addition to the number of cycles the CPU uses to actually execute the service routine. Service of the ISR by the DMA module does not incur overhead, because the CPU has control over the bus and can do useful work during the latency period before the DMA channel actually receives the bus.


CPU-vs-DMA serial-data transmission

Example 4—SCI transmit using DMA pseudo-code
You can compare these results with the number of cycles needed to make the same transfer using the DMA module to service the transmit interrupt. Example 4 shows the pseudo-code required. The CPU must initialize the SCI except that the DMA handles the transmit-data-register-empty (TDRE) interrupts. The DMA setup involves initializing the DMA source and destination address (TXADDR and SCD, respectively), setting the block length (number of bytes to be transferred), setting the transfer type (increment the source address and hold the destination address static), and assigning the channel to service the SCI TX interrupt. The DMA module will also enable a CPU interrupt once the last DMA transfer has taken place. The process takes 27 cycles.

Once the DMA module enables the SCI transmitter, each byte takes two bus cycles. After the last DMA transfer has completed, the DMA module issues an interrupt to the CPU. The ISR checks to see if this channel made the interrupt request. If so, the CPU disables the transmitter and clears the flag. The number of clock cycles equals 24, plus nine to enter the ISR. To derive the total number of cycles to transfer N bytes, use

TcyclesDMA=45+N(2)+(24+9)=78+2N.

Table 2 shows the number of cycles needed by each method for several transfer sizes. The number of cycles to transfer even 1 byte is fewer using the DMA method. The number of bytes of code is larger in the DMA case (this time by 14 bytes), but, in most instances, the additional throughput improvement is far greater than the number of additional bytes of code.

The examples listed above show just a few possible uses of DMA within the µC architecture. In conjunction with other peripheral modules, you can devise many other applications. For example, in concert with a timer, you can implement a fast, flexible, variable-duty-cycle PWM requiring only one timer channel and no CPU intervention.

Table 2 -- DMA-vs-CPU SCI-transmit performance block
Block sizeCPU clocksDMA clocksRelative
performance
187801.1
2125821.5
3163841.9
4201862.3
5239882.7
6277903.1
7315923.4
8353943.8
9391964.1
10429984.4
156191085.7
208091186.9



Kevin Anderson is a field applications engineer for Motorola's Semiconductor Products Sector. Anderson develops applications and performs engineering for automotive electronics. He has helped to develop several versions of 68HC05, 68HC08, 68HC11, and 68300 µCs. Anderson earned a BSEE from South Dakota State University (Brookings, SD) and is a member of the IEEE and Society of Automotive Engineers. In his spare time, he enjoys music, boating, and the outdoors.


| EDN Access | feedback | subscribe to EDN! |
| design features | design ideas | columnist |


Copyright © 1995 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.