Feature
Whose fault is it anyway? An introduction to digital fault simulation
Rigorous fault simulation ensures confidence in designs. Unfortunately, relatively few designers include fault simulation in their design methodology.
Clive "Max" Maxfield, Intergraph Computer Systems -- EDN, 6/6/1996
The minimal use of fault simulation in current design methodologies could stem from a lack of awareness (by both de-signers and managers) as to what fault simulation does and what benefits it brings you. This lack of awareness is often combined with misconceptions as to where this form of simulation actually fits into the design process.
Fault simulation is conceptually simple, yet capriciously
cunning. After you design a circuit, you create a set of test vectors that
describe stimuli for the fault simulator to apply to the circuit. These test
vectors also contain the expected responses from the circuit. The fault
simulator analyzes the circuit to determine any faults that might occur, such as
shorted wires or breaks in the interconnect. The simulator then proceeds to
evaluate how many of these faults your test vectors would detect (Figure
1). The criterion for detection is that a fault must manifest itself at one or more of the circuit's primary outputs. This manifestation occurs in the form of a response that differs from that of the fault-free circuit. Perhaps the most popular
misconception about fault simulation is that its only role is to verify the
quality of the test vectors that automatic test equipment (ATE) uses to check
boards. Promulgating this misconception are reference manuals that show process
flows with fault simulation at the downstream end of the design cycle along with
automatic test pattern generation (ATPG). I strongly disagree with this limited
world view, because fault simulation can be extremely efficacious in the early
stages of the design (Figure
2).
Here's the way fault simulation works: After describing the circuit and test vectors, you can use a digital logic simulator to verify the functionality of the design. Maybe the design passes, or maybe it fails. In the latter case, you root around to determine whether the problem lies in the circuit or in the test vectors, and you fix one or the other (or both). At this stage, many designers proceed directly to timing verification, but they are missing the point. Just because your test vectors passed your logic simulation doesn't mean that you have a good design. Your first-pass vectors probably did not test your entire circuit. Therefore, at this point, you should use fault simulation to discover which parts of the circuit the test vectors have not exercised. You can then loop back and augment the test vectors until you are content that they do, in fact, put the design through its paces. You should only consider moving to timing verification when you are convinced that you have a comprehensive test sequence. Note that the scenario in Figure
2 assumes that a form of a digital simulator, called a "dynamic timing verifier," performs the timing verification. All three simulators (logic, fault, and timing) use the same simulation models and test vectors. Of course, some designers (particularly at the board level) neglect any form of verification by simulation and use static-timing analysis to perform their timing verification. The arguments for and against static vs dynamic timing analysis are many and varied. Suffice it to say that the proponents of static-timing analysis would point to its being exhaustive, relatively fast, and not requiring a waveform. I'm happy to accept that static-timing analysis applies to many designs, but I consider using it only after I have verified the functionality of the design via some form of simulation (in which case, I already have a waveform for dynamic timing analysis by default). The problem with static-timing analysis is that it myopically checks the timing without paying much regard to the functionality (except in simple cases, such as understanding when an inversion has occurred). This lack of understanding of the design's functionality means that static-timing analysis never informs you that you've used an AND gate where you really needed an OR gate. Static-timing analysis only reports whether whatever you did use passes or fails the timing constraints you specified as your goal. In addition to increasing confidence in your test vectors, running the fault simulator early in the design helps you locate any portions of the design that are difficult to test or are simply untestable. This ability to detect problem areas allows you to modify your design early in the cycle to ensure its testability, thereby gaining the undying gratitude of the poor souls who are eventually going to build and test your masterpiece. (If more designers were actually forced to walk one of their creations through the layout, manufacturing, and testing processes, I'd bet my wife's life savings that the designers would radically change the way in which they went about creating any future designs.) Stuck-at, open, and drive faults The simulator might apply different faults to the
circuit. The three most common fault classes are stuck-at, drive, and open (Figure
3). Stuck-at faults occur on wires and represent a short circuit to a ground plane or to a power supply; thus, assuming positive logic, these faults are abbreviated to S0 and S1, respectively. Drive faults apply only to a component's output terminals. DO and D1 indicate that the component has failed in such a way that it's constantly driving either a logic 0 or 1 value at its normal strength. The DZ fault represents the cases in which the component has failed or has become disconnected from the wire such that it's driving a high-impedance Z state. By comparison, open faults apply only to a component's input terminals. All three open-fault types represent cases in which the input terminal has become disconnected from the wire, where O0 and O1 indicate that this input subsequently "floats" to a logic 0 or logic 1 (perhaps caused by an internal pulldown or pullup resistor). OZ indicates that the input assumes a high-impedance value. In the case of the circuit segment in Figure
3, an S0 fault on wire w1 manifests itself to the outside world in a manner identical to that of a D0 fault on g1.y or an O0 fault on g2.a. (Similarly, an S1 fault on wire w1, a D1 fault on g1.y, and an O1 fault on g2.a all appear identical to the outside world.) At this stage, note that, with the exception of a DZ fault, drive faults are really only of interest when multiple components are driving the same wire. Similarly, with the exception of OZ faults, open faults are only of interest when the wire is connected to two or more component inputs. Of course, not all fault simulators are created equal, and many consider only a subset of the faults described. Some fault simulators can only handle stuck-at faults, in which case the purveyors of said simulators go to great lengths (using the highest quality multicolored charts) to persuade you that you don't really need the other faults. In my humble opinion, this situation is akin to a computer salesman proclaiming that a mouse is an unwarranted luxury item. The ability to distinguish open and drive faults from stuck-ats provides a much deeper understanding of any potential problems that may eventually leap up and bite you when you least expect it. If the argument that you can get by using only stuck-at faults sways you, watch out for fault simulators that handle these faults only on the interconnect but not in the models of the components themselves. The main argument behind this approach is based on the fact that many early simulators only inherently understood models of primitive gates and registers. Thus, if you wanted to use these simulators to create a model of a component such as a multiplexer, you did so by wiring a number of simulation primitives together. This modeling technique meant that, although your component model was functionally equivalent to the device in question, it only approximated the physical reality. Therefore, many pundits proclaimed that any fault simulation of the component model's internal structure was meaningless. In fact, this debate is somewhat of a gray area. I've personally detected and resolved many unexpected issues by performing fault simulations that included the contents of component models, even when the internal structures of those models only bore a passing resemblance to the real devices. The proponents of the
"don't-simulate-component-internals" party point to the difficulty of fully
fault simulating a circuit with large microprocessor-sized components, unless
you restrict yourself to the interconnect. It's certainly true that if you've
got devices the size of a Pentium Pro on your circuit board, fault simulating
every aspect of each model would bring your computer to its knees, and you would
be old and gray by the time you saw any results. (If you are already old and
gray, you have probably tried this technique.) However, there is nothing to
prevent you from fault simulating the device's primary internal registers and
major data paths. Of course, the device models you're using may come from a
third party and may, therefore, be encrypted. In this case, the
"don't-simulate-component-internals" brigade would gleefully point out that you
don't know what's inside the model anyway. However, although the model's creator
may have signed a nondisclosure agreement with the device's manufacturer, it's
no great secret that a component such as a microprocessor includes such
registers as an accumulator and a program counter. Thus, the modeler wouldn't be
giving away any trade secrets by making the specific internal constructs visible
to the outside world in the form of the fault simulator. Of course, you would
also require a brief application note telling you what the designer called those
constructs and what they represent (for example, "ACC" stands for
"accumulator").
Fault simulation can be a time-consuming hobby; so,
anything that can be done to speed it up is gratefully received by the user. One
particularly useful trick is fault equivalencing and then collapsing. The
simulator first looks for any faults that identically manifest themselves at the
primary outputs (equivalencing). The simulator then simulates only one fault
from that group (collapsing). The chain of inverters in Figure
4 shows this process.
A fault simulator that supports fault collapsing
recognizes that an S1 fault on wire w1 generates the same result as an S0 on w2,
which, in turn, generates the same result as an S1 on w3. Thus, the simulator
selects one of these faults as the prime fault, equivalences the other faults to
it, and simulates only the prime. (The simulator treats an S0 on w1, an S1 on
w2, and an S0 on w3 in a similar fashion.) The simulator also detects any drive
faults that are equivalent to open faults and, in turn, any open faults that are
equivalent to short faults. Thus, even if you're simulating all fault types, the
end result may be somewhat less horrific than you might have feared. For
example, consider the collapsing that the fault simulator could perform on a
two-input AND gate (Figure
5).
In the case of an AND gate, if any of the inputs are logic 0, the output is driven to a logic 0. The fault simulator can, therefore, equivalence the O0 on g1.a to the S0 on w1. The simulator can also equivalence the O0 on g1.b to the S0 on w2, and both the S0 on w1 and the S0 on w2 can be equivalenced to an S0 on w3 (along with the D0 on g1.y). Therefore, the simulator can equivalence all six S0, O0, and D0 faults to a single S0 on w3, which may itself end up equivalanced to some other fault. Similarly, the O1 on g1.a can be equivalanced to the S1 on w1, the O1 on g1.b can be equivalenced to the S1 on w2, and the D1 on g1.y can be equivalenced to the S1 on w3. Thus, the fault simulator has to actually simulate only four of the 12 potential faults. However, note that the equivalences in this example are based on the assumption that wires w1 and w2 are each connected only to a single driving gate; similarly, note that wire w3 is connected only to a single load gate. Note that stuck-at faults are considered to have the highest precedence. Given a choice, the simulator always attempts to make a stuck-at fault the prime and equivalence any drive and open faults to these stuck-ats. In addition to stuck-at, open, and drive faults, there are a variety of other faults that the simulator could consider. For example, some fault simulators understand that certain primitive gate faults can occur in some technologies, such as an AND gate failing in such a way that it behaves as if it were an OR gate. Another type of fault that may be useful for you to simulate is a short fault. Unlike stuck-ats, which imply an undesired connection to a power or ground plane, short faults represent unwanted connections between a pair of signal wires. The main problem with this type of fault is specifying which pairs of wires the simulator should evaluate. An exhaustive approach would involve considering potential shorts between every possible permutation of signal wires, but the universe would end before your simulation does. Another technique is to explicitly specify each pair of wires for which you wish to simulate a short, but this is extremely tedious to say the least. In the days when dual in-line packages were the rage, certain simulators provided the ability to automatically generate a list of potential short faults based on adjacent pins on the component package. Unfortunately, I am not aware of any simulators that have extended this technique to handle grid-array type packages (which isn't to say that such simulators don't exist). Some simulators understand the concept of "functional faults," which they can apply to high-level hardware-description-language (HDL) representations of components, circuits, or systems. To illustrate this concept, consider the following portion of a model written in Verilog: This model segment states that a positive edge occurring on the signal called "clock" causes the simulator to load register regA with the contents of regB or regC, depending on the state of the signal, sigA. One example of a functional fault involves the simulator holding sigA at logic 0 (similar to an S0 on an interconnect), seeing what happens, and repeating the exercise while holding sigA at logic 1. At the very least, simulating functional faults in this way can reveal portions of the HDL that your test waveform has not yet exercised. Further considerations You can also use fault simulators to verify the quality
of the test vectors that you ultimately use to check the unit under test (UUT),
which may be an individual device or a full circuit board. Some testers can only
differentiate between logic 0 and logic 1 values at the UUT's primary outputs;
others can also detect high-impedance "Z" values. Thus, you should be able to
instruct the fault simulator as to which output values it can recognize, based
on the tester's specification.
A fault that the simulator applies may cause unknown "X"
values to appear at the outputs. In the real world, these "X" values actually
appear at the UUT's outputs as logic 0s or 1s. Because the simulator doesn't
know what will happen in the real world, the simulator considers an "X" caused
by a fault to represent a "potential detection." Assuming for the sake of this
discussion that your tester can't detect "Z" values and that you've, therefore,
instructed your fault simulator to only distinguish between 0s, 1s, and Xs (the
simulator coerces Zs into Xs), then it's possible to create a simple matrix to
illustrate the difference between detections and potential detections (Figure
6).
If the known-good value were 0 and the faulty value were 1, or vice versa, then the fault simulator would definitely detect the fault. But, if the known-good value is 0 or 1 and the faulty value is "X," then the simulator would consider the fault to be only potentially detected. The point is that a fault simulator with any level of sophistication should allow you to specify the number of potential detections that must occur before that fault is dropped from the active list; otherwise, the simulator could continue to simulate that fault ad infinitum (or until you reprogram it with a mallet in frustration). Some fault simulators are based on a serial algorithm, which means that they independently check each fault, one after another. This process can be acceptable in a hardware simulator but is pretty much untenable in a software implementation. Software simulators usually favor a parallel approach to evaluating faults; these algorithms are particularly powerful at the start of a simulation when many faults are active, but the simulation can become slow toward the end when only a few faults remain to be detected. One of the more powerful fault-simulation algorithms is the Parallel Value List (PVL), which GenRad developed. Using a fault simulator conveys a number of benefits, including guiding you toward creating more testable circuits and more rigorous test waveforms. If you do start to use a fault simulator, any additional time you spend up-front creating your initialization sequence reaps rewards downstream. One trick that I've found to be particularly useful is to try to initialize every register element using at least two techniques before allowing the simulator to start applying its faults. For example, if a register has both set and reset inputs, first pull one and then the other. Alternatively, if a register only has a reset, first pull the reset and then load a value into the register using the clock. The reason for initializing everything twice is that if you only use one method such as the reset to initialize your register, then when the simulator applies an S1 fault to reset (assuming that this signal is active low), the register ends up containing unknown "X" states, which can flood throughout the simulation. Therefore, the simulator can only make potential detections, which increases the simulator's memory usage and slows it to a crawl (Reference 1). Also note that using XOR gates in your circuits makes them more testable, which can ease the task of writing your test waveform and speed up your fault simulation (Reference 2). Last, but not least, we might ask ourselves how fault simulators could evolve in the future. I can think of at least two features that I'd love to see. I'm going to keep the first one to myself (in the hopes that it one day will make me rich beyond my wildest dreams), but I'd be more than willing to share the other one with you. Suppose that I'm designing something such as a traffic-light controller, and I know that I never want it to fail, such that each of the roads at the intersection is presented with a green light. What I'd like to do is instruct the fault simulator to apply my test waveform to the circuit, and to inform me of any individual faults or combinations of a specifiable number of faults that would cause the circuit to fail in such a way as to present a particular combination of values to its outputs (all the traffic lights being green, in this example). Obviously, if I let the simulator rampage throughout every gate in the design, this process could take forever. However, in the case of a mission-critical design, such as this, my circuit should be compartmentalized. For example, the controller should be partitioned from the driver, and the driver circuitry should contain fail safes to prevent my worst dreams from turning into nightmares. On this basis, I should be able to restrict the simulator to only exhaustively examine the driver portion of the circuit for fault combinations leading to my "all-green" condition. As usual, don't hesitate to let me know if you have any ideas or comments on what you've read or if you have some wild and wacky ideas of your own. Author's biography Clive "Max" Maxfield is a member of the technical staff at Intergraph Computer Systems, Huntsville, AL, where he gets to play with the company's high-performance graphics workstations (phone (800) 763-0242). Maxfield is also the author of Bebop to the Boolean Boogie: An Unconventional Guide to Electronics (ISBN 1-878707-22-1). To order, phone (800) 247-6553. You can reach Maxfield via e-mail at crmaxfie@ingr.com. References always @(posedge clock) if (sigA==0) regA=regB; else regA=regC;













