The $1 million recall
An engineer’s reputation almost goes up in smoke after a power module comes back from the field heavily burned.
Samuel Kerem, engineer -- EDN, November 11, 2011
This story began in 2003 when a 1000W PM (power module), which just a day before was part of huge optical telecom network, came from the field being heavily burned.The network which transmitted many Gbit/sec was interrupted; the issue immediately hit the radar of the highest management. I delved into a world of telecom power modules, which “must” work forever. To achieve forever, the power distribution is made redundant: though one unit can do the job, two PMs power the same rack in case the other fails and needs replacement, which would be performed by so-called “hot-plugging” while the rack is under power.
Hot-plugging a PM into the 40-60V telecom bus is a tricky business. The current flow to the module is controlled by an onboard MOSFET; to deal with 1 kW, it must be fully on or off. Upon the module insertion, the transient from off to on must be fast but not too fast; otherwise, the inrush current that charges onboard capacitors may brownout the telecom bus. The same MOSFET doubles as a circuit breaker. If an inboard short is suspected, the reaction must be fast, but “nuisance” spikes, common in power environment, must be ignored.
![]() Congratulations to Samuel Kerem, author of this entry and winner of EDN’s Tales From The Cube: Tell Us Your Tale contest, sponsored by Tektronix. Kerem will receive a Tektronix scope valued at approximately $5000. Read the other finalists’ entries here. |
The fix was to redesign the hot-swap timing that was set by a few resistors and capacitors. As mundane as the job of calculating their values may sound, it was a vital task. The demonstration of the PM shutting down and restarting in a controllable manner during various induced shorts vindicated the efforts.

The returned PM.At this moment, the highest management entered the scene. There had been more than 1000 PMs deployed. The fix would cost $10 for parts, and more than $1 million to recall all PMs, modify, and re-deploy. The verdict was to proceed with the fix. I appreciated the trust. The company was approaching the break-even point and each dollar mattered.
Three years later, an innocent e-mail hit me in the stomach: The PM came from a field with possible telemetry failure. The telemetry stopped weeks ago, but the hub was operating flawlessly, so the service visit was delayed. The replaced unit looked innocent when it arrived, but smoked immediately in a test rack. Though the module revision wasn’t immediately known, my gut knew this unit was modified. Backed by the $1 million wager, my reputation was “prohibited” to smoke.
I realized that many people would soon learn the same, and unless a miracle happened, unpleasant calculations would follow. As I could not recall my last encounter with any miracle, I ran into the lab to see the PM. During the dash, I was thinking if it was appropriate to compensate my company for the wasted $1 million. I did hope for leniency but even 90% of forgiveness didn’t feel lovely. The thought of monetary loss sharpened my senses. Upon arrival, I focused my attention on the laboratory power supply connected to the test rack with a visibly smoked module. Now, guided by brain rather than gut, I checked this laboratory supply setting. Eureka! The current limit on this supply was set to 18A; reaching this level would turn this supply into a current source.
In the field for weeks, thanks to redundancy, one PM kept the rack operational, while the shorted PM, with access to unlimited power, reacted happily to 30A-inrush, keeping the MOSFET alive by kicking on-immediately-off every few seconds. The overprotective laboratory settings killed the MOSFET in 20ms. When I returned to my office, I had proof the fix had prevented disaster. I still wonder whether a penny capacitor was the culprit.
Samuel Kerem is an experienced designer of medical, scientific, and telecommunication equipment.
Talkback
-
I had a similar experience with an assembly of paralleled power transistors that suffered a cascading failure. The cause may surprise you. This misadventure was described in Design News magazine. Please look in the September 22, 2008 Design News blogs for "The Case of the Wily Wires," or try designnews.com/author.asp?section_id=1386&doc_id=227974
Myron Boyajian
Myron Boyajian, PE - 2012-27-1 11:18:39 PST -
Back in the good old days of relays, high current relays had an "arc shoot" which was a magnet used to blow out the arc. I don't know if this would be applicable to this problem, but it is worth looking at.
Bruce Baker - 2012-27-1 10:23:15 PST -
This is a lovely article,i especially liked the fact the he had to "do some basic mundane calculations" heheh.
Dennis McNeill - 2012-20-1 07:23:03 PST -
Ex-Crayon's comment reminds me of an installation my Motors professor told me about when I was in EE school about 40 years ago that he had worked on.
They were transmitting high-current power for an Aluminum smelting plant, using aluminum bus bars, something like 2" x 12", interleved plus, minus, plus, minus, etc. They had special break-away brackets supporting the bus bars, so if they had a short, rather than the line destroying itself, it would just blow out the brackets, and they could reassemble it after they got the short cleared.
Lynn Grant - 2012-19-1 18:59:05 PST -
I had a failure similar to your second one where my short circuit protection was set to trigger at large short circuit currents(>50A) from a car battery. A 200A 250ms spike was easy for the battery to produce and reliably triggered my protection circuit. During product development I sent my few precious first samples to the test lab which proceeded to test the short circuit protection feature using a bench supply rated at 10A. All my samples were destroyed. But they wrote me a nice report about all the smoke.
Joe Whitaker - 2012-19-1 15:03:10 PST























