Design Feature: June 23, 1994
Synchronization bugs cause intermittent failures in board designs. These bugs can be frustratingly difficult to reproduce in the lab. Fortunately, careful designers can avoid this frustration by fulfilling two requirements. First, understand the principles of synchronization and metastability. Second, recognize the subtle situations in which these principles apply.
To see what can go wrong, consider the representative synchronous state machine in Fig 1. Because of a design bug, the state occasionally makes transitions from State 1 to State 0 instead of jumping forward to State 2. The state number is the binary value of the machine's three state flip-flops. The INIT, DATA_VALID, and COUNT_EN outputs are decoded from the state bits. The REQ signal is an asynchronous input. The design assumes that REQ holds its value for longer than the system clock period, guaranteeing that the machine sees all transitions.
The culprit causing the design bug is the asynchronous input, REQ. Being asynchronous, the REQ input may change at any time relative to the clock. Suppose that the REQ signal goes true at a time that violates the setup time of the state flip-flops. Because of skew or slight variations in timing for the flip-flops, some of the flip-flops might respond to the REQ input, and others might not. Suppose that the least-significant state flip-flop responds to the REQ input quicker than the other flip-flops. Then, instead of transitioning from State 1 to State 2, the machine could go from State 1 to State 0. This condition can occur even when the flip-flops are on the same die.
To prevent improper transitions, you can clock the asynchronous signal into one flip-flop, called a "synchronizer" (Fig 2). In the above case, the synchronizer would latch the REQ input on a clock edge to produce the signal LREQ. LREQ replaces REQ as the input to the state machine. If the synchronizer responds to the REQ value change before the clock edge, LREQ takes on the new value. If the clock edge precedes the change in REQ, LREQ doesn't change until the following clock edge. LREQ transitions are synchronous to the clock, drastically reducing the state machine's failure rate.
A synchronizer prevents most failures caused by an asynchronous input. Unfortunately, a phenomenon called "metastability" complicates synchronization. If an active clock edge and a data transition occur very close together, a flip-flop or a latch may not immediately make a transition from its current state into the new state. The flip-flop may remain in an in-between state, called the "metastable state," for an indeterminate time. Eventually, it settles to a 0 or a 1. While it is deciding, its output may glitch, oscillate, sit at an intermediate voltage, or merely show an increased clock-to-output delay.
The settling time is probabilistic. The longer the time after the clock edge, the more likely that the flip-flop will resolve to a valid state. Unfortunately, there is no guaranteed upper bound on the settling time. You can't build a bistable device such as a flip-flop that cannot go metastable. Its two stable equilibrium states are potential-energy minimums. Between the two minimums is a potential-energy maximum. Because the slope of the energy curve is 0 at the maximum, the maximum is also an equilibrium state, although an unstable one.
The MTBF that results from metastability depends on several factors. One basic metastability equation (Ref 1) is as follows:
where fc is the clock frequency and fd is the frequency at which the data input transitions. (For a flip-flop in an arbitration circuit, fc and fd would be the frequency of transitions of the two arbiter input signals.) T0 and t are device-specific constants. The time allowed for the output to settle is t', which starts at the clock-edge transition. The formula is meaningless, of course, for t' less than the normal clock-to-output delay.
The terms preceding the exponential in the equation indicate how often a flip-flop can become metastable. High clock and input transition frequencies fc and fd present more opportunities for metastability to occur. T0 is a scale factor. You can conceptualize T0 as the width of a time window around the clock edge during which, if a data transition occurs, the flip-flop becomes metastable. The term fc*T0 is the fraction of the clock period occupied by this time window. Because fd is the number of data transitions per unit time, fd*fc*T0 is the number of data transitions per unit time that fall within the metastable time window.
The exponential term in the equation describes the probability that a metastable condition will last for time t'. As you increase the time t' that you wait before looking at a flip-flop's output, you exponentially decrease your likelihood of seeing unresolved metastability. The time constant for the exponential term is t.
To find the probability of a synchronizer failure due to metastability, set t' equal to the maximum time that a synchronizer flip-flop can be metastable without affecting a succeeding flip-flop. Therefore, t' is usually the time interval between the active clock edge at the first flip-flop and the next active clock edge at the succeeding flip-flop, minus the setup time of the second flip-flop and minus the path delay between the two flip-flops (Fig 3).
Manufacturers use various forms of the metastability equation. For example, Ref 2 uses three constants, k1, k2, and Do, giving MTBF in the form
A little algebra puts this equation into the same form as Eq 1
so that T0 is
and t is k2.
Because metastability formulas aren't standardized, you have to read application notes carefully to understand the manufacturer's definition of each parameter. For example, Cypress (Ref 3) defines:
where
and
Thus,
The clock-to-output time of the flip-flop is part of t', as shown in Fig 3. The clock-to-output delay is not part of tr because the delay is subtracted in the clock-to-feedback term. The difference between t' and tr corresponds to a change in scale factor W relative to the T0 parameter in Eq 1. The path of the potentially metastable output is assumed to be entirely inside one PLD, from one flip-flop through the feedback to another flip-flop, both clocked by the same clock. This configuration is generally the best design, but if your design violates this assumption, you have to adjust this formula.
Because of these variations in parameter definitions as well as differences in the techniques used to detect metastability, it is difficult to compare T0 across manufacturers. However, these variations should not affect t, the parameter to which MTBF is most sensitive.
The following example calculates MTBF for the REQ synchronizer discussed previously. What is the probability that LREQ will go metastable and that this state won't resolve in time to meet the setup time on the state bits? Using a Cypress 22V10-20 as a synchronizer and assuming that the system clock frequency is 20 MHz and that REQ asserts every 3.1 msec, you can calculate the MTBF. Because there are low-to-high and high-to-low transitions every 3.1 msec of the REQ signal, fd is 0.645 MHz. In addition to these values, you must use the PLD's maximum operating frequency, fmax, which you take directly from the Cypress data sheet. The maximum operating frequency is 41.6 MHz. Using the Cypress formula and the W and tsw parameters from the Cypress data sheet yields
where tr is given as 1/(20 MHz)-1/(41.6 MHz)=26 nsec. Plugging in the values to the equation, yields an MTBF =1.7×1059 sec, or 5×1051 years, a very large number.
Suppose, however, that the system clock speeds up to 40 MHz. The MTBF becomes approximately 1 minute, an MTBF figure that is obviously unacceptable.
For some systems, you cannot conveniently describe fd in hertz. For example, an asynchronous input on an image-processing board may change state twice/image. Expressing fd in units of 1/image gives MTBF as the mean number of images processed between failures.
The MTBF calculated here is for a single synchronizer. Multiple asynchronous inputs to the system yield a lower MTBF than that of a single synchronizer.
The calculation is simple. However, finding the parameter values is difficult. Although some manufacturers provide values for T0 and t in application notes, many do not. Second, the reported parameter data may not be very accurate. A scan of the literature shows that numbers for the same type of device vary considerably from one report to another. Reported parameter values are not guaranteed maximums but are usually averages of a few parts tested. Like propagation delays, metastability parameters vary with process variations, voltage, and temperature. Small variations in the time constant t, especially, cause enormous variations in calculated MTBF because t is in the exponential term. The material in Ref 4 discusses the problem of parameter variation, giving an example in which a typical MTBF of 317 years shrinks to 12 minutes when you use estimated worst-case values.
Calculations are useful for getting a rough idea of the magnitude of the metastability problem. Following basic principles helps you to minimize the problem. The most important principle is to allow as long a time as possible for metastable conditions to settle. Clocking the synchronizer flip-flop with the opposite clock edge may speed your design by half a cycle, but it also costs you heavily in MTBF. This method reduces t', thus having the same effect on the exponential term as doubling the clock frequency. In the state-machine example, clocking the synchronizer with the opposite clock would cause MTBF to plummet from 5×1051 years to less than 2 minutes. On the other hand, decreasing the clock frequency yields exponential improvements in MTBF.
Simple guidelines can gain a few nanoseconds, which may translate to many multiples of t. First, if you are implementing a synchronizer in a PLD, put the synchronizer flip-flop and destination flip-flops in the same part to minimize the delay from the synchronizer's output to its destination. Second, you can reduce the effects of metastability by using a multiple-stage synchronizer, which adds stages of pipeline delay. A multiple-stage synchronizer is a chain of flip-flops that synchronizes one asynchronous signal. The output of each additional stage in the synchronizer is less likely to be metastable than is its input. The longer the chain of flip-flops, the less likely it is that metastability will occur at the last stage's output.
It is possible to rearrange a design to increase the length of the synchronization pipeline without adding latency. In the state-machine example, metastability on LREQ has to settle one setup time before the clock to avoid errors in the state bits. You could renumber the states, as shown in Fig 4, using a Gray code for the transitions that LREQ affects. Using this technique, each LREQ edge affects only one state flip-flop, preventing illegal state transitions. Even if LREQ fails to settle one setup time before the clock, an error does not result unless the changing state bit goes metastable and remains metastable long enough to cause further timing violations.
A third way to improve MTBF is to choose devices with better metastability parameters. Metastability characteristics depend on circuit factors, such as internal gain-bandwidth product. Faster logic families oftenbut not alwayshave faster metastable resolutions. For example, the material in Ref 1 measures a t of 0.4 nsec for a sample of 74F74 flip-flops but measures a t of 1.7 nsec for the ECL 10131 flip-flops. For the same values of fc, fd, and t', it calculates an MTBF of 1×1013 sec for the Fast family of D flip-flops but only 30 sec for the 10K ECL family of D flip-flops.
Devices claim to be metastable immune
Some devices, specifically designed to avoid metastability, are not guaranteed metastable-free but have small values of T0 and t and relatively well-behaved outputs. Philips Components, for example, claims that Signetics designed the "metastable-immune" 74F50XXX family to avoid runt pulses, oscillations, and intermediate voltage states on the outputs (Ref. 5).
Using a dual-port RAM or FIFO buffer may seem a way to dodge the synchronization issue. Using these devices, you depend on the IC designer to implement the arbitration for reads and writes correctly. However, you still must think about asynchronous changes in status flags.
Before trying to handle an asynchronous signal properly, make sure that it actually is asynchronous. The metastability equations assume that the input data transition is equally likely to occur at any time during the clock period. In some synchronizing situations, such as an asynchronous interface with a handshake, this assumption may not be valid. Assume that a synchronous-state machine generates a request and that the circuitry at the other end of the interface runs the request through some combinatorial logic and then generates an acknowledge. The timing of the acknowledge is, therefore, correlated with the state machine's clock.
If you treat a clock-correlated signal as an asynchronous signal, the system will probably work fine most of the time. However, each state machine has part delays, and, under some conditions, the system may fail. The delays may be such that the system violates setup times on every transition. The MTBF formulas don't work if the input is correlated with your clock.
Similarly, excessive path delays in synchronous logic can result in the same condition. The delays of some parts could be such that the data input of a flip-flop always makes a transition during the time window that causes metastability. Synchronous logic's advantage is its deterministic timing, but sloppy timing can cause it synchronous logic to be reliably bad rather than reliably good.
If you can't avoid synchronization, follow these basic rules to avoid trouble. First, be aware of which signals are asynchronous. Second, receive each asynchronous signal by clocking it into only one flip-flop. Finally, mitigate against metastability by allowing needed settling time. Design your synchronization scheme, rather than synchronizing ad hoc, and document the scheme so that you keep your design in mind as you make changes.
| Calculating MTBF for a 2-stage synchronizer |
|---|
|
One possible assumption is to let fd2 be the probability that the first flip-flop has not settled by one setup time before the clock of the second flip-flop. (1/fc-Tsu2). Then, the following equation (Ref 4) shows the MTBF for the synchronizer, assuming both have the same metastability parameters:
By assumption, 1/fd2=MTBF of the first synchronizer (MTBF1).
Therefore,
Because the fd term appears only once, this is not the square of MTBF1, as is sometimes claimed. Setting fd2=1/MTBF1 assumes that one uniformly distributed asynchronous data transition occurs each time the first stage goes metastable. One could argue that this assumption doesn't necessarily hold. The apparent fd2 depends on the first flip-flop's metastable behavior. For example, oscillations and intermediate voltage levels from the first flip-flop would be more likely to cause setup violations on the second one, producing a larger apparent fd2 that would runt pulses and delayed transitions. Nevertheless, errors in fd2 are insignificant compared with uncertainties in the exponential term. |

Debora Grosse has worked at Unisys in Plymouth, MI, for nine years. She has BSEE and MSEE degrees from the University of Michigan. In her spare time, she enjoys taking walks with her family.