How do you know if you have a hardware or a software problem?
Sounds like a simple enough question, but it can be a real problem for a systems engineer on a Friday afternoon at 4:00 pm when he is getting ready to set up for a weekend of automatic testing and, as a bonus, is on a critical path with a target on his back.
So which phone number do you dial? Whose Friday evening is going to be interrupted, the software engineer or the hardware guy?
Hardware engineers always design hardware to work properly; their proof is in how the parametric testing performs and how many margins they build into their design.
Software (firmware) never has bugs unless you can prove them beyond a shadow of a doubt, and the proof better be repeatable.
It’s the system engineer who creates all the problems trying to put the hardware and software together!
I recently ran into an interesting problem with a radar customer on the East Coast. We had just been onsite discussing some new RF products over a lunchtime presentation, and one of the attendees seemed quite distracted during the presentation. To my surprise, he had been listening intently while being constantly interrupted with messages streaming in on his smartphone. At the end of the meeting, he politely asked if we had the time to bring the equipment down to his dungeon.
He was the system engineer on a hopping radar product that was getting ready for some qualification testing. His test technician had been giving him constant updates on the test status of a product going into environmental testing over the weekend: Parametric tests - PASS; Functional test - FAIL.
When we got down to the dungeon, we saw a nicely equipped lab and a product with its covers off and a mixed-signal oscilloscope looking like it was keeping a product alive on life-support. Not far from the truth, as the technician looked as though he was ready to use a crowbar to persuade the DUT to behave.
So here’s the long and short of how I learned what that fifth channel is for on the right-hand side of the oscilloscope labeled “Ext Trig.” You need to use this when a problem is telling you that it is happening. You use memory more efficiently when problems self-identify (by isolating on the event of interest). Sometimes it takes two instruments to gang up on the problem.
During parametric testing of this hopping radar, the customer test process stopped the device in a non-hopping test mode to test the parametric performance of the radar. Stepping and testing across the 12 possible channels of operation. Fair enough; the hardware works.
Functional test involved putting the radar into a self-test mode using a PN9 sequence on a specific channel plan, in this case hopping over eight of the 12 channels.
Using a real-time spectrum analyzer, it was easy to see all nine of the eight channels during functional test. Yes, all nine. So you’d expect a similar statistical density on a truly random sequence (PN9 - 512 steps across eight channels ~ 12.5% of the time). And we did see similar statistics across eight RF channels. But about 0.2% of the time, channel 0 appears.
Using a frequency domain trigger (frequency edge, frequency mask, or statistical density) to trigger the mixed-signal scope, we were able to trap the channel 0 value being sent to the shift register each time the PN9 sequence restarted. It was the repeatable bug he needed to secure the phone call. Apparently the new FW load had a minor bug.
How have you used test strategies to solve the fight between hardware and software engineers?
James K commented:
A firwmare bug is a software bug.
Unfortunately, You would have to call the software guy first; since, he could actually do something (make a change) in attempts to rectify the problem or possibly point to the hardware. Unfortunately, you would end up having to call the hardware guy also.... If only to back things up and blame it again on software.
In the end they both go home happy and chalk the problem up to the technician or test software not operating correctly.
On Monday everyone gets yelled at by the boss.
Glen Chenier commented:
Long ago we had a problem with a 1 second watchdog timer restarting a card. With a digital oscilloscope and a lot of perseverance and patience we were able to show that the restarts happened before the 1 second mark which exonerated the software.
The engineer (and I use the term loosely) who designed the watchdog used an oscillator and a counter chain. He calculated the "divide by" to create a cycle 1 second long. He overlooked the simple fact that the timeout actually occurred halfway through the count.
When in doubt, it could be either. Or both.
mee commented:
I just use the golden rule ... If in doubt, blame software
Perry commented:
Well what I read is that the error is apparently in the hardware, and the software was able to be changed to work around the hardware problem, although there was no "smoking gun" to prove it absolutely.
ron davison commented:
usually the software guy gets called first because 99% of the time it is a software problem.
But because of this he gets to blame it first on the hardware guy without the hardware guy able to have any input as he has not been called yet. This get past up the food chain where management again gets positive feedback that the problems usally are in the hardware. Only on sunday when the project engineer is ready to have a breakdown does the hardware guy get called. He explains why all the BS shoveled by the software team toward the hardware team cannot be the issue and where he/she helps reduce the scope of the haystack search. management hears that hardware is involved in solving the problem and again assumes the false perception via this feedbback that the issue is always hardware related. the software guys never get called on this because they are not there when the linching is to commence because they are at home sleeping because they were up all night fixing or trying to fix the problem. PS it not really the software guys falt because the managment pushed a unrealistic schedule down everyones throut again and everyonwe is just a broken drown doing theri job now.
Ed Mengel commented:
Troubleshooting is a creative process. You need to mentally invent the machine that displays the attributes that you observe. The hard part is to put aside the knowledge of what it should do or that it does 99.99% of the time.
The key to solving that kind of a problem is to have the proper test capabilities that can exercise and verify the hardware. It sounds like you had that, but many times system requirements do not include it.
Wawaus had it right: you need both hardware and software people working together and sometimes, that happens to be the same person.
In his book "The Soul of a New Machine", Tracey Kidder included the development team's use of a logic analyzer to find a hardware problem that only happened once a day.
ASD-engineer commented:
When evaluating hardware, software, or firmware difficulties, there is always the dowsing channel to ask specific questions for needed answers.
wawaus commented:
I am a hardware engineer. When confronted with a problem which could be hardware or software my approach has been to sit down with the software engineer and devise a test regime which will definitively identify the source of the problem. Arguments solve nothing, co-operation is the name of the game!
Chuck commented:
Two words: Therac 25.
Charles K. Summers commented:
We sell software stacks to companies that use them as the core portion of equipment. This basically means, since we don't make the final equipment, we have to be able to debug assuming no special tools or equipment.
So, our primary tool is logging. We have lots of options -- out a UART, into a circular queue within RAM, out an SCP, etc. but it is all basically a set of codes or messages.
We take the data and determine just where it starts messing up and then add logs to keep narrowing the range until we know where it is.
In the case of hardware or software problems, we have to keep two conflicting cases in mind -- how should it behave if working correctly and what would the effect be if NOT working correctly.
Recently, we spent a lot of time with a client finding a problem. A line analyzer indicated that the data did arrive to the equipment (finding this out, alone, took a lot of time since this only happened after 2 to 12 hours of constant data transfer). Log messages determined the point where the data no longer existed. We kept taking the logs down lower and lower (including OS sections) and found that the low-level driver for the hardware chip was not receiving the data to send to the software layers.
At this point, there were two possibilities -- an error in the LLD software or a problem with the hardware chip. We found the problem by ASSUMING that the chip had a specific problem (losing an interrupt for data frames arriving closely together) and coding a workaround that would allow for this. With the workaround, it worked -- without it we had the problem.
The bottom line is that you have to keep narrowing the scope of the problem area and then apply negative logic (what incorrect thing can cause this incorrect result). After all, if it was all correct, there wouldn't have been a problem.
Allan commented:
He said it was a firmware bug in last paragraph.
bitbanger55 commented:
So, was it hardware? or software?















