If ever EDA needed a ($700M) proof point on their value...
As I reported yesterday, Intel announced that a “design error” in a SATA I/O support chip for the Sandy Bridge processor would cause them to respin the design… at a cost of $700M! From the information that Intel provided, it was apparent to me that the problem was most likely a voltage domain error, i.e. a low voltage device got accidentally hooked up to a higher voltage supply than it was spec’ed for.
A report on the internet today, if it is credible, confirmed my speculation:
quoting Intel’s Steve Smith (VP and Director of Intel Client PC Operations and Enabling) : The problem in the chipset was traced back to a transistor in the 3Gbps PLL clocking tree. The aforementioned transistor has a very thin gate oxide, which allows you to turn it on with a very low voltage. Unfortunately in this case Intel biased the transistor with too high of a voltage, resulting in higher than expected leakage current. Depending on the physical characteristics of the transistor the leakage current here can increase over time which can ultimately result in this failure on the 3Gbps ports.
Bingo! Exactly as I suspected. Intel’s comments yesterday:
- Problem was “statistical”
- Performance degrades over time.
- The “error” can be fixed by an upper-layer metal mask patch.
In a former life, I was a product marketing manager for two tools that were designed specifically to find problems like this. Mismatched voltage domains is, unfortunately, one of the most common causes of respins in the books. And, sadly, so easily prevented!
There are simple static ERC (electrical rules checkers) that can test every transistor instance to find any devices with low voltage models that are hooked up to supply rails that exceed their rating. List price of these tools?? About 0.01% of what this will cost Intel. (Add a couple more decimal places with Intel’s discount).
The problem manifested itself as degradation over time, and it was “statistical”, i.e. not deterministic. This sounds like NBTI, in which PMOS devices degrade randomly. It could also be HCI, in which NMOS devices experience charge trapping that alters their threshold voltage over time.
To test this effect EDA vendors have added features to circuit simulators that can reproduce “aging”. Work on these techniques began more than twenty years ago.
This design error may not have been human error at all. It could have been an error produced by an auto-router that hooked up the bad transistor. In any case, it’s not a “design error”, it’s a verification methodology fault.
I bet that gets fixed real quick too!
Genevieve commented:
A million thanks for ptosnig this information.
Nitin Deo commented:
Just read this article and Mentor’s blog about the new ERC tool. Is this too much of a coincidence or am I going too wild with my imagination?
I wonder if Intel uses internal ERC or commercial ERC tool?
Matt Hogan commented:
I would agree with PD Dude that traditional ERC methods don’t work well for the type of problem described in this article. To catch this type of error you need to be able to identify classes of circuits, (thin oxide PMOS, for example), appropriately identify the voltages for the pins on that device, and compare them against the specific rules for that device and voltage domain. We have been working with customers on multi-power domain ERC requirements such as these using circuit and layout information to develop more sophisticated reliability checks. I’ve posted more information about ERC checking on my blog on the Mentor web site.
Matt Hogan commented:
I would agree with PD Dude that traditional ERC methods don’t work well for the type of problem described in this article. To catch this type of error you need to be able to identify classes of circuits, (thin oxide PMOS, for example), appropriately identify the voltages for the pins on that device, and compare them against the specific rules for that device and voltage domain. We have been working with customers on multi-power domain ERC requirements such as these using circuit and layout information to develop more sophisticated reliability checks. I’ve posted more information about ERC checking on my blog at blogs.mentor.com/matthew_hogan/
charly commented:
I agree to sit in a plane, knowing the human pilots are reading the newspaper while the autopilot brings us close to the target airport. I don’t agree with a fully automated landing.
What about a design review checklist at the end of the project? For any block interface issue, this helps a lot!
Esko Mikkola commented:
Ridgetop Group’s Sentinel Silicon test structures are currently used to generate accurate aging simulation models for 32 - 65 nm CMOS.
Jeremy commented:
When you rely to much on technology you lost the need to fully understand what you do. The machine will tell me if I’m wrong… It’s true in a multimillon company, It’s also true in every day live with GPS and cell phone people do not plan their trip anymore. So they end up lost more offend then before…
EDA Engineer commented:
Appropriate use of EDA could have found & corrected this bug, sure. But Intel, along with most of the other big electronic companies, spend more time & effort *not buying* EDA tools. I’d almost bet that Intel’s stationary expenses dwarf its EDA spend.
chipwiz commented:
Garbage in garbage out! There will be no verification flow that will catch all in any circumstance. Contrary to your ideology of a need for better verification, I would suggest designers do a better job at weeding the bugs first and not overly rely on some tool down the stream that will catch all for them. Individual designers must take the responsibility that they do not introduce bugs. Too many designers are relying on tools and are lost in determining if their design will work without them. As described this sounds like the wrong choice of transistors (thick oxide as opposed to thin oxide). If any of the designers had some experience with ECL design they would never have made such an error.
Bob Colwell commented:
” In any case, it’s not a “design error”, it’s a verification methodology fault.” I don’t agree. Something has to be designed first before it can be validated or verified. There was a human responsible for the circuit in question; whether they designed this circuit by hand or used automation, they are still responsible. That was where the first mistake was made. Validation/verification was then performed, which obviously failed to catch this error; that was the second mistake. That’s about all one can say without insider knowledge, which none of us commenting on this issue have. Beyond all that, mistakes happen, always have, always will. The important thing is to learn from them and not make the same one twice.
Don L. commented:
all you can here in the halls here is “….NBTI…this “, “…STI… that”; “QT issues…there”;
“…what will metal fill do to us?; placing dummy devices to eliminate shadows…”
EDA does not make those conversations happen; engineers do. Unfortuntely Intel did not have the right engineers doing that job.
Larry M commented:
The respin doesn’t cost the entire $700M. A good chunk of that is the recall. Of course that doesn’t change the outcome.
PD Dude commented:
My assumption is its a badly connected bulk node on a PMOS in a PLL, right?
I’d be curious to hear how people believe this should be caught, with some specifics.
I have talked with numerous engineers and while their are ways to catch it, most of them require some data prep, and if you have bad data input, you will not catch it.
For a simple single voltage domain digital circuit, this stuff works great. But most ERC decks I’ve seen don’t work well on analog and either require alot of help, or produce tons of false positives. We’ve been telling verification companies this for 10 years. But I guess DFM and fast DRC times sound more exciting. Maybe now someone will spend some time on a better ERC solution for real mixed signal SOCs.
It would make for a great DAC demo. Get Synopsys, Mentor, Cadence and Magma up there and have them tell me specifically how their ERC tool would have caught this problem and why Intel didn’t catch it.
EDA Exec commented:
Make that a $1B proof point … they also lost $300M in revenue.
DM commented:
I agree this is a verification flow problem, since a simple static ERC would have flagged the mismatch, mixed signal or not. I’d say the EDA group owns this mistake, not the designers - assuming the designers ran the tool flow.
Kev (simguru) commented:
Modern chips are no longer digital or analog they are mostly mixed signal. There is no good mixed-signal design methodology - i.e. something that allows you to do digital verification with the analog stuff modeled properly.
So I’m not surprised stuff fails, I’m more surprised that it ever works.
This is a fixable problem, but it would require the analog and digital guys in the EDA companies actually integrating their tools, but the analog and digital guys in EDA don’t communicate any better than in the design community.
Linda Capcara commented:
Thanks for the detailed explanation Mike. It makes me wonder if the Intel Storage Group GM is getting the heat.
Engineer commented:
Interesting!!















