Subscribe to EDN

Design for reliability: NEC offers some new thoughts on reliable SoCs

February 14, 2007

With all the talk of design for manufacturing (DfM) another important aspect of design tends to get overlooked. That is adding features to the design to improve the long-term reliability of SoCs that pass manufacturing test and get deployed to the field. The problem here is not reducing initial defects, but preventing the gradual deterioration of the chip from resulting in field failures.

This sounds like an obvious issue: if the design rules are sufficiently stringent, the chip should not deteriorate over time. But as two papers from NEC discussed at ISSCC yesterday, the situation isn’t black and white at fine geometries. Mechanisms such as gradual degradation of transistor gate dielectric, deterioration of vias and electromigration can all cause circuit performance to decrease over time, and there simply isn’t enough data on a new process to model this correctly and establish workable guidelines. Design rules that say, in effect, “use the previous process spacings” are not helpful. So the problem remains: how does one identify chips that are at risk, and either pull them out of production or repair them before a failure occurs?

NEC’s first paper on the topic addressed the issue of measuring actual timing slack in-circuit. The main intent of the paper was to describe a complex clock generator circuit. But one of the features of the design stood out: the clock generator was able to selectively insert jitter into the clock lines. This provides a mechanism for directly measuring the timing slack on individual nets or in blocks. You simply speed up a particular clock until the block fails, and now you know what the largest delay in that block is. Of course figuring out which clocks to skew for how many cycles is non-trivial.

The second NEC paper is far more ambitious. Suppose you do screen out parts that have very little timing slack at process corners. There are still mechanisms that will eventually cause circuits to fail. If the product is a notebook computer, no matter—people expect them to be as unreliable as cheap toys. But if the product is a pacemaker, or an attitude control system in a vehicle, the result of failure could be harm and litigation.

The automotive industry provides an excellent example of this concern, according to NEC Chief Research Manager Yasunori Mochizuki. He says that ordinarily, the auto industry wants essentially perfect parts. If any part fails in the field, they return it to the vendor and demand a failure analysis. “But in advanced processes, car makers have a hard choice to make. If they want the density and performance, they have to accept that individual circuits will fail sometimes.”

But that doesn’t mean automakers will accept failed systems. So SoC vendors are beginning to look at fail-proof designs. Approaches exist for achieving this, including dynamic error checking—such as error-correcting codes in memories—and multiple redundancy with voting—used in the FPGA-based wheel drivers in the Mars Rovers, for instance. But these approaches either only deal with memory, or they consume huge overhead.

The second NEC paper offers a novel alternative. Assuming that the failure mechanism manifests itself as a gradual increase in delay on a net, the NEC approach examines input waveforms entering each logic block or I/O, and does timing comparisons on known sequences of states. If the timing relationship between inputs starts to shift, the circuitry assumes that something in the slowing path is approaching failure, and it switches in a redundant block of logic to replace the slowing one.

In this way the chip identifies and replaces degrading blocks of logic before they use up their allowable timing slack, so no failure occurs, except in the case where multiple concurrent failures use up all the spare blocks. Even in this case the circuitry could probably warn the system that something was very wrong before an actual failure occurs. The approach is claimed to have less area overhead than full triple redundancy, and to be even more powerful at preventing errors in some failure scenarios.

Posted by Ron Wilson on February 14, 2007 | Comments (0)
POST A COMMENT
Display Name
captcha

Before submitting this form, please type the characters displayed above. Note the letters are case sensitive:

Advertisement
Advertisement
Advertisement
About EDN   |   Site Map   |   Contact Us   |   Subscription   |   RSS
© 2012 UBM Electronics. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other UBM Canon sites

UBM Canon | Design News | Test & Measurement World | Packaging Digest | EDN | Qmed | Pharmalive | Appliance Magazine | Plastics Today | Powder Bulk Solids | Canon Trade Shows