

There comes a time in any project when your new design is finally assembled, awaiting your special expertise to "make it work." Sometimes, it seems like the design end of this business is the easy part; troubleshooting prototype hardware can make even the toughest engineer a Maalox addict.
You can't fix any embedded system without the right world view--suspicion tempered with trust in the laws of physics, curiosity dulled only by the determination to stay focused on a single problem, and a zealot's regard for the scientific method.
Perhaps these are successful characteristics of all who pursue the truth. In a world that surrounds us with complexity, we deal daily with equipment and systems we only half understand. So, it seems wise to follow understanding by an iterative loop of focus, hypothesis, and experiment.
Too many engineers fall in love with their creations, only to be continually blindsided by the design's faults. They are quick to assume, overtly or subconsciously, that the problem is due to the software, the lousy chips, or the power company, when simple experience teaches us that any new design is rife with bugs.
Assume it's broken. Never figure anything is working right until proven by repeated experiment; even then, continue to view with suspicion the "fact" that it seems to work. Bugs are not bad; they're merely a test of your troubleshooting ability.
Armed with a healthy skeptical attitude, the basic philosophy of debugging any system is to complete the following steps:
For (i=0; 1<# findable bugs; i++) { while (bug(i)) { Observe the behavior to find the apparent bug; Observe collateral behavior to gain as much information as possible about the bug; Round up the usual suspects; Generate a hypothesis; Generate an experiment to test the hypothesis; Fix the bug; }; };
Now, you're ready to start troubleshooting, right? Wrong! Stop a minute and make sure you have good access to the system. No matter how minor the problem seems to be, troubleshooting is like a bog where we all get trapped for far too long. Take a minute to ease your access to the system.
Do you have extender cards if they're needed to scope any point on the boards? How about special long cables to reach the boards once they are extended?
If there's no convenient point to reliably scope on the scope's ground lead, solder a resistor lead onto the board, so that you're not fumbling with leads that keep popping off.
Some systems have signals that regulate major operating modes. Solder a resistor lead on these points as well, because you'll need to scope them at some point. This small investment in time up front will pay off in spades later.
Let's cover each step of the troubleshooting sequence in detail:
Step 1: Observe the behavior to find the apparent bug. In other words, determine the bug's symptoms. Remember that many problems are subtle and exhibit themselves via a confusing set of symptoms. The fact that the first digit of the LCD fails to display may not be a useful symptom -- but the fact that none of the digits work may mean a lot.
Step 2: Observe collateral behavior to gain as much information as possible about the bug. For example, does the LCD's problem correlate to a relay's clicking in? Avoid studying a bug in isolation but beware of trying to fix too many bugs at once. Address such problems as ROM accesses' unreliability and a not-bright-enough front-panel display one at a time. No one is smart enough to deal with multiple bugs all at once, unless they are all manifestations of something more fundamental.
Step 3: Round up the usual suspects. In other words, many computer problems stem from the same few sources. Clocks must be stable and meet specific timing and electrical specs, or all bets are off. Reset too often has unusual timing parameters. When things are just "weird," take a minute to scope all critical inputs to the µP, such as clock, hold, ready, and reset.
Always remember to check Vcc. Time and time again at Softaid, we see systems that don't run right because the 5V supply is putting out only 4.5 or 5.6V, or 5V with lots of ripple. The systems come in after their designers have spent weeks sweating over some obscure problem that, in fact, never existed, but was simply the specter of the more profound power-supply issue.
Step 4: Generate a hypothesis. Don't be like "shotgunners," those poor fools who address problems by simply changing things--ICs, designs, PLD equations--without having a rationale for the changes. Shotgunning is for amateurs. It has no place in a professional engineering lab.
Before changing things, formulate a hypothesis about the cause of the bug. You probably don't have the information to do this without gathering more data. Use a scope, an emulator, or logic analyzer to see exactly what is going on; compare that with what you think should happen. Generate a theory about the cause of the bug from the difference in these.
Sometimes, you have no clue to what the problem is. Scoping the logical places might not generate much information. Or, a grand failure, such as an inability to boot, occurs that is so systemic that it's hard to tell where to start looking. When desperation sets in, it's worthwhile to scope around the board practically at random. You might find a floating line, an unconnected ground pin, or something unexpected. Scope around, but always be on the prowl for a working hypothesis.
Step 5: Generate an experiment to test the hypothesis. Most of the time, you can resolve this step when gathering data to come up with the theory in the first place. For example, if an emulator reads all ones from a programmed ROM, a reasonable hypothesis is that CS or OE isn't toggling. Scoping the pins proves this one way or the other, but requires you to formulate another hypothesis and experiment to figure out why the selects are not where you expect to see them.
Sometimes, though, you should apply the hypothesis-experiment model more formally. When we first started to use Intel's XL version of the 186 (supposedly compatible with the older series), none of our systems worked. Scoping around showed the processor to be stuck in a weird tristate, although all of its inputs seemed reasonable. One hypothesis was that the 186XL was not properly coming out of reset, a hard thing to capture because reset is basically a nonscopable, one-time event. We finally built a system to reset the processor repeatedly, giving us something to scope. The experiment proved the hypothesis, and a fix was easy to design.
An alternative would have been to glue in a new reset circuit at the beginning to see if the problem would go away. Problems that mysteriously go away tend to mysteriously come back; unless you can prove that the change really fixed the problem, there may still be a time bomb lurking in the system.
Occasionally, a bug is too complicated to yield to such casual troubleshooting. If you have to adjust the timing of a PLD, visualize or draw the new timing before wildly making changes. Will it work? It's much faster to think out the change than to implement it and perhaps troubleshoot it over again.
Rapid troubleshooting is as important as accurate troubleshooting. Decide what your experiment will be and then stop and think it through again. What will this test prove? I like experiments with binary results: The signal is there, or it isn't; the signal meets specified timing, or it doesn't. Either result gives me a direction in which to proceed. Binary results have another benefit: They sometimes let you skip the experiment altogether. Always think through the actions you'll take after the experiment is complete, because sometimes you find yourself taking the same path, regardless of the result, making the experiment superfluous.
If the experiment is a nuisance to set up, is there a simpler approach? Hooking up 50 logic-analyzer probes is rather painful if you can get the same information in an easier way. I'd hate to be in a lab without a logic analyzer because they're useful for so many things, but I try to relegate it to the tool of last resort, because you can often construct an easier experiment in a fraction of the time it takes to connect the logic analyzer.
Don't be so enamored of your new hypothesis that you miss data that might disprove it. The purpose of a hypothesis is simply to crystallize your thinking: If it's right, you'll know the step to take next. If it's wrong, collect more data to formulate yet another theory.
Step 6: Fix the bug. There's more than one way to fix a problem. Hanging a capacitor on a PLD output to skew it a few nanoseconds is one way; another is to adjust the design to avoid the race condition entirely.
Sometimes, a quick and dirty fix might be worthwhile to avoid getting hung up on one little point. Revisit the kludge and re-engineer it properly. Electronics have an unfortunate tendency to work in the engineering lab and not go wrong until the 5000th unit is built. If a fix feels bad, or if you have to furtively look over your shoulder and glue it in when no one is looking, then it's bad.
Finally, never fix the bug and assume it's OK because the symptom has disappeared. Apply a little common sense and scope the signals to make sure you haven't fixed the problem by creating a new one.
At 3:00 am, when the problems seem intractable and you're ready to give up engineering, remember that the system is only a computer. Never panic: You are smarter than it is.
Visit Jack Ganssle's Web page at http://www.softaid.net/emulators.html.