It's the firmware, of course
Have you ever had to troubleshoot a design you didn't know much about? Anyone who is working around complex systems today will likely be faced with such a task, especially if that person has a reputation as the troubleshooter. I've sometimes made the mistake of cultivating such a reputation.
A product landed in my lap one day, and it was a tiny bit flaky. Every so often on power-up – maybe 5% of the time – it would just sit there, stare back at me, and refuse to do anything. This was pricey consumer gear, and our customers would not be amused.
Fortunately, I was already familiar with the general design of the box, an AV media server. I laid out the guts on my lab bench and started playing. I discovered that, when the box appeared to be dead in the water, it was actually mostly functional. Only the front panel was hung up. This box had a small PC motherboard in it, and the motherboard connected to the front panel via an internal USB connection. The USB wasn't called on to do much, though – just handle some lights and buttons.
I can't remember the exact chain of events, but I somehow quickly decided that the problem lay with this USB connection. The trouble was that I knew next to nothing about USB: two data/two power, bit rates, and something about pullup resistors. After hitting the Web for 20 minutes, I felt suitably prepared to attack the problem – with the aid of my scope, of course.
I clipped probes onto the data-in and -out lines and started hitting that power button. I could see bursts of activity on the USB link – always the same – unless the front panel decided to hang up. When that happened, I would see initialization data coming in vain from the PC. The microcontroller on the front panel would not respond.
Continuing to play with this setup, I found that the failures only happened within a certain band of off-times. If the power was turned off for a short period and then back on, the system would come back up smiling. Ditto if it had been off for a long period. Very interesting.
After verifying there was nothing obviously wrong with the hardware, I let my mind wander a bit. What about the microcontroller firmware? I thought back to my early computer days, and how processors and memory would remember data for a while, even after their supply voltage had dropped to zero. Or how memories would generally power up with a default data pattern – not 100% reliably of course, but not randomly either.
I thought, "What if there's a wee bug in the firmware? Could a certain memory startup pattern cause the failures we're seeing?" The more I played, the more I was convinced that the misbehaviour was due to some sort of microcontroller initialization problem.
He looked unimpressed.
"Let's go look at your code. I want to see what it's doing," I said. He looked unconvinced.
I finally dragged him back to his desk and we started scanning the code. It was in assembler, which I tend to enjoy, but it wasn't an instruction set I was overly familiar with. He started mentally stepping through the reset code, and within 30 seconds, he looked at me and pointed to the problem. Some memory was not being initialized, and some wrong addresses were.
The memory of my victory over the miscreant firmware has become fuzzy with age. I mostly just remember my triumph of reasoning. But I was very humble, of course. It was just a guess, after all. OK, maybe I wasn't so humble.