| |
|
June 18, 1998
Debugging embedded systems: 10 common hardware problems and how to solve
them
Stuart R Ball
Some of the most common hardware problems can derail your debugging
effort if you don't watch out. Here are some tricks to keep your debugging on track.
Some of the most devilish debugging problems are the result of seemingly
little things--floating pins, wrong values for pullup resistors, and so forth. These are
the kinds of things that are easiest to overlook, both at design time and during
debugging, so they're often some of the last things you consider when trying to get to the
bottom of a problem. Don't overlook them. Make a list. Check for each problem
systematically, and you'll save yourself a lot of grief. Here's a list you can start with.
Floating pins
Many years ago, an engineer colleague came looking for suggestions about
a problem he was having. He had a µP circuit with a UV-erasable EPROM, and the circuit
would work only when he opened the cover of the box it was installed in, or if he put a
flashlight in the box with the cover closed. It turned out that the Vpp pin
(which receives the programming voltage during programming) was floating. Apparently, the
chip needed just a little light (through the erasure window, which wasn't covered) to bias
everything enough to make it work.
In the days when everything was TTL, a floating input would show up on a
scope as about 1 to 2V. Now that nearly everything is CMOS, floating inputs usually look
like ground. Often, if you run your fingers over a board,
circuit operation will change. Also, if an IC, such as an 8-bit register, fails only with
certain data patterns, look for a missing ground. Many CMOS parts will work without a
ground connection as long as one of the inputs is low, but as soon as they all go high,
everything stops.
It's often safe to leave unused µP pins floating, but it's better to
pull them to an inactive state. A µC with internal pullups, of course, usually needs no
other termination.
Risetime problems
Pullups can
be problems in other ways, too. I worked on a problem once with a circuit (Figure 1) that failed on power-up reset
intermittently, and then only on some production boards. Apparently, to save power, the
designer had used large-value (>100-kohm)pullups. Tristate
buffers prevented inadvertent writes to the battery-backed RAM chips during the unstable
interval while power was coming up. The problem was that the 74AC08 inputs on some of the
boards saw the reset go inactive before the tristate buffer did. The result was that the
processor would come out of reset before the RAM circuit was ready, so it couldn't access
the RAM. Instant crash.
In another case, a designer had used too large a pullup on a
68000-family part, making a signal's risetime longer than the µP's specification allowed.
The circuit worked in production for several months, and then the purchasing department
bought a different brand of processor that was less forgiving of the input-transition
time.
Peripheral timing problems
The timing
cycle for a µP accessing a periphe ral IC can always be
problematic. In the example of Figure 2, which
shows the typical timing for an Intel-style processor communicating with some generic
peripheral IC, any of six timing parameters (T1 through T6) can cause problems.
In the write cycle, the processor asserts the address, then asserts the
WR signal, then presents data to the peripheral. Time T1 on the diagram is the
address-setup time before WR goes low. If the design doesn't satisfy this parameter, the
write data could go to the wrong register (or memory location) inside the peripheral. A
similar situation exists for time T2, the data hold time after WR's rising edge. If the
design doesn't satisfy T2, the peripheral might store the wrong data. The last parameter,
T3, is the minimum length of the WR pulse itself. Some peripherals also have a
maximum-length parameter.
In the read cycle of Figure 2's
example, time T4 is the address-setup time before RD's falling edge. Satisfying this
parameter is usually less critical than satisfying the write cycle's equivalent parameter
unless the peripheral latches the address on RD's falling edge. Time T5, the time the data
must be stable before RD's rising edge, is effectively the peripheral IC's access time. If
you don't satisfy T5, the processor might read the wrong data. Time T6 is the data hold
time after read cycle completion. This parameter is most likely to be a problem on a
processor with a multiplexed address/data bus, where a peripheral that doesn't release the
bus quickly enough can cause bus contention on the next cycle.
These parameters are typical of processor and peripheral data sheets,
but there are others, of course. Some peripherals have a parameter for the minimum time
between successive accesses, or they require synchronization of input signals to a clock.
Sometimes they want write data to be stable before WR's leading edge, which requires
additional logic with Intel-type processors. Other processors, such as the Motorola 68000
family, have different cycle and signal structures, but the same types of timing
requirements apply.
Many designers just connect peripherals together, assuming that if the
clock rates or the access times are right, everything else will work, too. This approach
can be dangerous, especially if production will run for many months or years, giving
plenty of opportunity for installation of parts from different manufacturing lots. It's
best to find a timing problem when you start a design, because fixing one can add a
significant amount of logic to a board. Verify at the outset that your design meets all
timing parameters.
Risetime problems, timing problems, and floating-pin problems are often
temperature sensitive, because parts' thresholds and speeds shift slightly with
temperature. If you have an intermittent problem that you suspect is caused by one of
these conditions, you can often make it show up by using circuit-cooling spray to cool a
board or a hair dryer to heat it. Be careful not to get IC packages so cold that they
crack or so hot that they melt.
EMI problems
Embedded systems often must control stepper motors, dc motors, or
relays, all of which can cause electromagnetic interference (EMI) problems. Any inductive
device will cause EMI when it switches on or off. Whether the EMI causes problems is
another matter.
One
easy-to-prevent problem appears in Figure 3a,
in which a µC drives a relay through a port pin and a MOSFET transistor. When the port
pin goes low, the MOSFET will turn off, opening the relay. Note, though, that there's no
protection diode between the transistor drain and the supply. When the relay opens, the
energy stored in the relay has to go somewhere; the result is a massive voltage spike on
the transistor's drain. Depending on the characteristics of the relay coil and the
transistor, this flyback voltage can approach 100V--enough to destroy the transistor.
A solution to this problem, shown in Figure
3b, is the addition of a snubber diode across the relay coil. The transistor
drain is now clamped to a 1-diode voltage drop above the positive supply. For faster
opening of the relay, you can use a transient-suppresser diode instead, allowing the drain
voltage to rise to a voltage somewhere between the supply voltage and "total
destruction." If you use a transient suppresser, remember that the drain voltage will
rise to the sum of the supply voltage and the transient-suppresser clamp voltage.
The catch to this solution, which designers often overlook, is that this
fix isn't really free. Adding a diode protects the transistor, but the coil energy still
has to go somewhere, and it does. It takes the form of a current spike into the positive
supply. If the supply lacks proper bypassing, the result can be a voltage spike on the
supply itself. So, when driving relays (or dc motors or solenoids), take a little extra
care to be sure that the supply has adequate bypassing and that the path between the relay
and the supply has a low impedance.
Figure
4a, which shows a µP-based board driving a motor, illustrates another
current-related problem. When the motor turns on, current
increases, and the increased current passes through the ground wires back to the power
supply and to chassis. The current causes a voltage drop in the wiring, because the wiring
impedance (inductance plus resistance) is never zero. If the wiring inductance is high
enough, the voltage drop can upset the processor board's ground enough to affect the
board's operation or to corrupt communication with other boards in the system. Stepper
motors or dc motors that are PWM (pulse-width-modulation) controlled present a particular
problem, because a high-frequency surge usually occurs when the current turns on.
The circuit of Figure 4b
minimizes this problem by adding a third ground wire, which isn't connected to logic
ground, for returning motor current to the power supply. The motor still causes a current
surge, but it doesn't affect the logic ground. However, although this addition solves the
EMI problem, it can cause other problems. An H-bridge usually drives the motor, and if the
motor return and logic ground get too far apart, the voltage differential and resulting
current can damage the H-bridge.
Ground loops
In the classic case of a ground loop, two circuits connect to different
grounds and to each other, and the grounds have slightly different ac or dc potentials.
Because the impedance between the two grounds is very low, significant current can flow in
the grounds themselves.
Figure
5, which shows an embedded-µP system communicating with a host PC,
illustrates a ground-loop problem. Both systems get power from a 115V ac line. If the two
systems connect to different branches of an ac circuit (for example, if they're in
different rooms or different buildings), then significant current can flow in the ground.
The current flows through the ground wires in the interface connections.
A ground-loop problem can be particularly bad if two connected systems
operate on different ac voltages. A typical situation involves a µP system that's part of
a large machine requiring 208V 3-phase power. To make matters worse, other heavy
equipment, such as air conditioning, might share the 3-phase power. I've seen RS-422
drivers literally destroyed when an embedded system's ground got yanked around by air
conditioning compressors turning on and off.
If the interface between a host PC and an embedded system is RS-232C or
serial RS-422, you can sometimes solve a ground-loop problem by running the interface
through an optical isolator pair. If the interface is parallel, a LAN, or some other
high-speed interface, it might be necessary to ensure that the two systems are on the same
branch of an ac line. If the voltages are different, you might have to ensure that both
systems have clean, independent returns to your building's ground, with no heavy-duty
equipment sharing the ground return.
Ground-loop problems can also occur within an embedded system that has
many boards and modules, each with a separate power supply. Sometimes you can fix these
problems with a ferrite bead on the right cable, but that tends not to be a very permanent
or repeatable fix.
Low-level signals
Ground
loops can be a problem even without directly affect your processor; they can affect the
devices the processor connects to. Figure 6a,
which shows a processor board using a thermistor to read temperature, provides an example.
The thermistor has a fairly low output level--say, 1 mV per degree. The logic in the
thermistor's vicinity on a board draws current, which causes voltage drops across the
power supply wiring and all the connectors. The normal voltage drop typical of such
systems isn't enough to upset the logic, but it can be enough to cause an offset in the
thermistor reading. Worse, the value may change as the dc current changes with the state
of the logic. The solution to this problem is to give the thermistor a separate return (Figure 6b), so that it's not affected by the
offset voltage. Of course, the same principle applies to strain gauges, pressure
transducers, or any other low-level analog input device.
Shorted outputs
Another source of EMI problems is shorted outputs. It has been my
experience that having two CMOS or TTL outputs shorted together can make an entire circuit
susceptible to noise. And, of course, the shorted outputs themselves dump a lot of noise
into the grounds.
Self-generated ESD
ESD (electrostatic discharge) will often upset a µP-based circuit,
because an ESD pulse contains very high frequencies that couple very readily into logic.
Even if your equipment is resistant to ESD from external sources, electromechanical
equipment can generate ESD internally. If your system uses rotating motors and has bizarre
failures, look for ESD, especially if a motor couples to a pulley with a belt. A motor
driving a pulley with a belt made of insulating material can be a good generator of ESD,
as can any two insulators rubbing against each other--for example, a plastic brake that
prevents coasting on a plastic drum.
The usual solution to such ESD problems is to use belts and pulleys that
are slightly conductive. If this isn't possible, you may have to use a conductive brush to
carry the charge to ground, or you may need to look at alternative drive mechanisms.
The problem goes away
It happens far too often: A subtle bug makes you tear your hair out, but
when you hook up a logic analyzer or a scope to look at it, it goes away. When this
happens, look for timing errors or race conditions. Usually, the test equipment is adding
a few picofarads of capacitance, enough to slow down the risetime of some signal.
Race conditions
Race
conditions frequently occur in embedded systems, as Figure 7
illustrates. In the figure, a µC drives a 74AC139 to generate pulses for some external
system. Use of the 139 allows the controller to generate nine separate outputs using only
two port lines. The outputs could have various purposes--for example, generating
interrupts to other boards or clocking data into registers.
As the timing diagram of Figure 7
shows, problems arise when the µC steps through the select lines. As each input line
changes state, a momentary glitch appears at one or more outputs. (The diagram shows
glitches on the Y1 and Y3 outputs, but the glitch locations can vary in devices from one
manufacturer to another according to the devices' internal structures.) If the outputs
drive registers or latches that are fast enough to respond to the glitches, invalid data
can result. The solution to this particular problem is to use a third port pin, connected
to the AC139's enable input, to gate the outputs off when the select inputs are changing.
References
- Ball, Stuart R, "Debugging embedded systems: using
a trace buffer to see what went wrong," EDN, April 9, 1998, pg 161.
- Ball, Stuart R, "Debugging embedded systems: using
hardware tricks to trace program flow," EDN, April 23, 1998, pg 163.
- Ball, Stuart R, "Debugging embedded systems: using
a serial condition monitor to overcome limited diagnostic access," EDN, June 4,
1998, pg 167.
This article, one of an occasional series on basic debugging techniques,
is an adaptation from the book, Debugging
Embedded Microprocessor Systems by Stuart R Ball. Material reproduced courtesy of
Newnes, an imprint of Butterworth-Heinemann, 225 Wildwood Ave, Woburn, MA 01801-2041. For
more information, check www.bh.com. To order, call
1-800-366-2665.
|