Solving a metastable mystery
At first, the prototypes worked fine, but about a quarter of them would intermittently start executing code from random locations. We quickly traced the problem to my board, but even the worst prototype would only fail once an hour. I was lucky that firmware development was scheduled to last more than six months before the product would go into production, because this kept me off the critical path while I attempted to track down this intermittent bug. I spent almost all of this time trying everything I could think of—with no success.
After spending six months fighting with this bug, I took most of a day off to attend an Intel seminar about the company's new single-chip DSP. It was completely unrelated to my project, but it helped me get out of the rut I’d dug looking for the bug.
When I got back to work, I had a wild idea for a tricky way to use a four-channel analog-storage oscilloscope. I added all the channels together and adjusted the vertical scale for each one to a unique scale factor (10V/div, 5V/div, 2V/div, and 1V/div). This could produce 16 possible levels depending on the 5V TTL logic levels applied to the four scope probes. I triggered the scope with the start of any Z80 bus cycle. I then waited for the bug to occur.
When the bug showed up, the resulting scope display was a grid-work of traces. All traces seemed to represent valid states for the signals I was monitoring. Unexpectedly, I found two transitions within a single clock period. Closer examination determined that a flip-flop output changed with the clock edge but immediately returned to its original state.
I had never heard of a metastable flip-flop before. It was a new concept to most of the other engineers in the department as well. I quickly discovered that the D input to a flip-flop in the DRAM address decoder was changing during the clock edge. The resulting narrow pulse upset the DRAM and corrupted the next memory access. I made a simple change to my circuit and the intermittent went away.
Or so I thought. With the intermittent gone, we went into production. Unfortunately, the intermittent returned with the first batch of production units. But it only showed up when the production masked-ROMs were installed. We had used UV-erasable EPROMs for development, and they worked fine in the production units. But it turned out that the “pseudo-static” mask ROMs contained some of the same circuitry found in DRAMs, which got confused by narrow pulses on the chip select.
When I fixed the metastable problem in the DRAM address decoder, I didn’t look for the same vulnerability in the ROM address decoder. Once I made the same change to the ROM address decoder, we had a stable and reliable product.
Sigurd Peterson started as a circuit design engineer for Tektronix in 1976. In the 1990s, he worked for Intel on the Pentium Pro and also worked for Tektronix/Xerox before becoming an independent system design consultant in 2002. He received a bachelor of science degree in engineering physics at Oregon State University in 1976 and a master of science degree in circuit design from the Oregon Graduate Institute in 2000.