March 13, 1998THE ANALOG ANGLE: Troubleshooting the consultant's wayRon ManciniAction plans are critical in troubleshooting programs.Recently, a company hired me to fix its disk-drive problem. The company's halls were piled from floor to ceiling with disk drives. Clearly, these guys were in deep trouble. I knew I could count on cooperation because half the problem--getting management's attention--was solved. Action plans are critical in troubleshooting programs, and I had already formulated one based on initial phone conversations with people from the company and on my own experience. My action plan comprised a responsible person, expected and actual results, and expected and actual completion dates. The company president believed that the problem was in the data separator, but my plan included the complete drive. He took exception to my plan because he wanted a quick solution. I told him about a guy who insisted on fishing in his bathtub because it was convenient and comfortable. If the action plan focused on only the data separator, we might have been doing the equivalent of fishing in the tub, so we had to focus on the whole disk drive. Still, Mr President wasn't convinced, so in my attempt to prevent another manager from joining the team and to appease Mr President, I pretended to confer with another consultant. (In reality, I stuck to my guns--I simply had to throw Mr President a bone to keep things moving.) I told him my fictitious consultant suggested that in addition to sticking with my action plan, we speed troubleshooting by making the problem repeatable. The problem was intermittent, producing a read error in 106 tries rather than 108 tries. The read error occurred randomly, so we hooked up high-speed recording equipment to freeze 50 instructions before a read error occurred. Finally, the observations made sense, and we got a programmer to write code forcing the machine to loop through the error. With a repeatable error, we scoped the problem, which was a current spike generated by the interface-bus dump when the bus dump occurred in the middle of a read cycle. The bus dump caused a single-ended voltage transient on ground, which the differential read amplifier couldn't reject. Better grounding would have prevented the problem, but the company needed a quick and cheap fix. A NAND gate solved the problem by preventing a bus dump during the read cycle. Next, testing started. Upon completing the initial tests, Mr President wanted to start shipping. I convinced him to retrofit the drives on hand but to hold shipments until completing testing. Further testing revealed that the system slowed because of the delayed bus transfers, but the slowdown was acceptable for most customers, so shipping to these customers started immediately. The final fix gained back system speed by improving grounds rather than by adding a NAND gate. To sum up, the following is a troubleshooting procedure that works:
Using this guide, I received a minor hero's sendoff. Now, you can be a hero, too. In my next column, I deal with troubleshooting at your own company, rather than acting as a consultant for another one. This type of troubleshooting is much harder! |
||||
|
||||
| EDN Access | Feedback | Table of Contents | |
||||
| Copyright © 1997 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Publishing Company, a unit of Reed Elsevier Inc. | ||||