Feature
Skew generation and analysis in timing-critical circuits
Custom-made circuits enable designers to achieve timing precision on the order of a few picoseconds. By following recommended techniques, designers can achieve highly accurate skew insertion and balancing, and Spice simulations confirm the design method's suitability.
By Naveen Tiwari and Ruchir Saraswat, STMicroelectronics -- EDN, 11/13/2003
As the electronics industry enters the less-than-100-nm-transistor era, delay sensitivity has also begun to venture into the picosecond range. You can also infer this high degree of timing accuracy by Moore's Law, which predicts that the density of on-chip transistors will double every 1.5 years. This increase in density has a direct effect on capacitance and other parameters, thus decreasing the delays of on-chip elements.
New methods are available for achieving a specific delay, such as DACEA (differential autocancellation of error architecture); other methods handle skew balancing, skew insertion, and delay characterization, all of which are useful in designing timing-critical circuits (Reference 1). Both methods are compatible with standard cell libraries, and proper layout can ensure that silicon results are close to the desired accuracy. The methods are technology-independent, thus allowing easy portability of the design into subsequent process generations. They are especially useful for classes of circuits that can compromise on silicon area to achieve picosecond-level accuracy.
Matching rise- and fall-time delaysThe problem of matching rise-and fall-time delays is prominent in custom-made circuits, and almost all of the cells fail to match rise and fall for every PVT (process, voltage, and temperature) combination. Even if a cell is tuned to have matched rise and fall times for a certain PVT combination and for some input drive and input load, it will fail to match the delay as soon as you even slightly vary one of the parameters. The solution to this problem lies in using an architecture that will, by itself, take care of mismatch and be independent of parameter variation. This approach will certainly cost additional silicon area, but the result it achieves will be close what you desire.
To better understand the suggested method, consider, for example, a simple buffer, normally made of two inverters. If you want the buffer to have a small input load but a high output driving strength, then the input inverter must be small, and the output inverter must be large. Standard library-cell buffers have the same construction, and the rise and fall delay of such buffers cannot be matched. The solution doesn't lie in changing the library cell, because this option is infeasible.
Instead, you use an architecture with components of equal driving strengths and almost equal input capacitance in the timing-critical path, and you as often as possible use the DACEA method while designing the circuit (Figure 1). Suppose that the buffer comprises two inverters of equal size (that is, equal input capacitance and equal output drive). Also, assume that the input comes from a driver of the same strength as these two inverters and that the output load is equal to the input capacitance of the inverter. Let the inverter's input rising propagation delay be TIR and let the input falling propagation delay be TIF. With these assumptions, the new buffer will have a propagation delay equal to TIR+TIF for a rising input and TIF+TIR for a falling input. Spice simulation results prove that if, on silicon, both inverters are similar, their propagation delays will be equal in response to rising and falling inputs. These delays will match for any PVT corner.
This theory can extend to any circuit or module, such as the implementation of a 4-to-1 multiplexer using DACEA (Figure 2). In this implementation, the propagation delays of all four datapaths—namely, D0_OUT, D1_OUT, D2_OUT, and D3_OUT—are exactly equal. Each path contains two tristate inverters with almost equal load at their outputs, thus achieving differential autocancellation of error.
The suggested architecture can also implement bigger modules if you modify your architecture and blocks to have the same drive and use DACEA. But to extensively use this method, all of the cells must have almost the same input impedance and an almost equal output drive. This method produces a small error, but increased drive strength minimizes it. You should optimally choose the drive strength so that error is acceptably low and you minimize the area occupied.
The suggested method might also encounter a problem with the output inverted with respect to the input, because only an odd number of elements occur in the path. Thus, the error cannot completely cancel out to match the delay. The best option for this case would to balance the entire path, up to the last inverted element, and use the inverter as the last element. In this option, error will exist, but it cannot be greater than the propagation delay of the inverter used. You can further minimize the mismatch if you exactly know the load. In fact, the mismatch will be small if the inverter has sufficient driving strength with respect to the output load.
Skew insertionAnother requirement for custom-made timing-critical circuits is to intentionally provide some skew by using a delay chain. This challenge can be big if the requirement is to generate skew on the order of few picoseconds, and in equal steps. The fastest component that could produce delay without phase change is a buffer chain (Figure 3). You can match the rise and fall delay of this chain using methods discussed in the previous section. Analyzing the minimum value of a buffer-chain configuration in the time domain results in process- and temperature-dependent values (Table 1). To further increase precision, you can easily tap every inverter's output in the chain (Figure 4). The resultant accuracy will improve by a factor of two (Table 2).
These values will be much more accurate than previous results, but they will still exhibit significant error because of the difference in rise and fall time of the inverter, resulting in two major problems. Every consecutive tap has a phase difference of 180° (that is, every consecutive tap is inverted); hence, you need to do some design work to bring them to the same phase without adding any extra delay. Also, the rise and fall time of each inverter must be equal; otherwise, additional skew would appear. To overcome these shortcomings, you need a circuit that can invert the taps with 180° of phase difference while passing other taps through unchanged.
If you require even better accuracy (greater than one inverter delay), with rise and fall times matched, you can use a delay chain called the complementary delay line (Figure 5). This circuit uses two inverter chains; the input to one of them is the complementary clock, and the input to the other is the normal clock. Every inverter's output in the chain is connected to the output of a similar inverter in the other chain via two low-strength inverters. These inverters ensure that the rise and fall time of the chain is balanced, and the use of two chains provides in-phase outputs from every inverter output, but the circuit employs alternately taken taps. Although the circuit provides more accurate delays, gain is still less insignificant in terms of delay than the previous circuit, and it requires a large number of components with requisite high power consumption.
If you require high accuracy, you should use the phase-blending technique, which will help you achieve an accuracy of few picoseconds (Figure 6). This accuracy is much higher than the smallest delay of a standard library cell, such as an inverter. This circuit receives two phase-adjacent input signals ΦA and ΦB, which are separated in phase by one inverter delay. The phase blender directly passes these two signals with a simple delay to produce output signals ΦA and ΦB. It also uses a pair of phase-blending inverters to interpolate between these two input signals with a simple delay to produce a third output signal, ΦAB, having a phase between ΦA's and ΦB's. This fact effectively doubles the available phase resolution.
The phase-blending-inverter relative-size ratio, w=Wa/(Wa+Wb), is the ratio of device widths in inverters A and B. For input signals separated by one inverter delay (td=1/RC), the model specifies that, to ensure that the phase of ΦAB lies exactly between ΦA's and ΦB's, the phase-blending inverters must size in a w=60/(60+40)=0.60 ratio, such that the leading phase is coupled to an inverter that is bigger than the one that receives the lagging phase.
The phase-blender idea can extend to multiple cascaded stages for further phase-resolution improvement, with each stage improving resolution with a factor of two. Although it is theoretically possible to increase the resolution indefinitely by adding more phase-blender stages, a practical limit exists. With these highly accurate delay steps available, you should select a delay chain that meets the requirement with minimum area and minimum power consumption (Table 2). In other words, the delay chain you choose will depend on the target application.
If you need a delay step greater than or equal to the buffer delay and to create a skew equivalent to the buffer with both reference and delayed signals having similar transitions, you should select the buffer chain. The essential fact here is that you need to be sure that the transitions of both signals are similar (that is, either that both of them are rising or that both of them are falling). If you need a delay step greater than or equal to the buffer delay but to generate skew for all possible transition combinations, you must use the DACEA approach. Although the area required will increase, you'll meet both delay and any transition requirements.
If your requirement is to achieve a delay step equivalent to that of an inverter, you can select the complementary delay chain. The advantage here is that the rise and fall delays are matched. The disadvantage is that the circuit will consume even more area, caused by a doubling of the number of buffers. Dynamic power consumption will also increase, because the outputs of the two complementary inverters are virtually short-circuited through low-drive inverters. Finally, if your requirement were to achieve a skew step size smaller than an inverter delay, the best approach would be a phase blender.
The divide-and-conquer approachHighly accurate circuits have limited range; otherwise, they'd require excessive silicon area. By decreasing accuracy, your design can achieve higher range with fewer components. To achieve both of these benefits, you need to use a divide-and-conquer approach—that is, use both the high-range scale and the high-accuracy scale. The high-range scale uses a large step to provide high range with smaller area; each step of the high-accuracy scale equals a small portion of the step of the high-range scale.
This scheme has been successfully implemented in a block intended to measure the pulse width of a signal (references 2 and 3). The design employs the simple principle of counting the total number of edges of a high-frequency signal during the pulse period (Figure 7). This technique requires little circuitry, and the accuracy of the implementation is one time period. This accuracy will improve as the frequency increases. But, it is important to note that there is a practical limit on the maximum frequency that you can generate on-chip.
Suppose the maximum frequency generated is 1 GHz. Then, the maximum error would be 1 nsec. This error is too high, considering that you require accuracy of few picoseconds. The solution to this problem lies in using a high-accuracy circuit.
You implement the high-accuracy circuit using an inverter-delay chain with a precision of one inverter-delay step (Figure 8). Suppose, for example, that the inverter delay is 50 psec; the accuracy will then be 50 psec. If, however, you want to measure a 50-nsec pulse, the circuit required will be large. (The delay chain itself requires 1000 inverters; imagine the size of the complete circuit!)
Instead, divide the problem into two parts. Both circuits operate concurrently, with the high-range circuit measuring the time by counting the number of edges and the high-accuracy circuit measuring the remaining error with an accuracy of 50 psec. You save a significant amount of area with this approach. (The delay chain would now require only 20 inverters, for example.) The result is a circuit with accuracy of 50 psec and range of 50 nsec; an even higher specification design would require little additional circuitry. You can employ a similar approach with any circuit requiring accurate high-range steps.
Characterizing delayOnce the design is complete, you're never sure whether the silicon results will match the delay values from simulation. To ensure the correct calculation of skew for insertion or matching, you should have one mode in which all the similar delays can form an oscillating chain. If the delay you're characterizing is large enough to oscillate, you should make it oscillate by routing the output as feedback to the input in the self-characterization mode. Use this method for applications in which delay chains of similar elements exist for skew insertion or skew balancing. The output frequency will help in calculating the skew value (Figure 9).
It is important implement the multiplexer to have a smaller delay than the total length of the delay chain. This delay can be equal to approximately one or two delay elements. In this way, the final calculation will have negligible error. To prove this point, suppose that you have a delay chain of length N with each delay element of delay D, and that the multiplexer has delay D', which can vary from D to 2D: Frequency=1/[2*(N*D+D’)]. If you assume that D'=D but D' is actually equal to 2D (a maximum-error case): DCAL=1/[F*(2N+1)], where DACTUAL=1/[F*(2N+2)]; ΔD=DACTUAL–DCAL; and ΔD=[–1]/[F*(2N+1)*(2N+2)].
Now, analyze the situation by using practical values: Let D=50 psec. If N=100, then F=1/(50 psec*202)=99.09 MHz, and ΔD=0.248 psec.
Thus, the error in the calculation of one delay is less than one-quarter of a picosecond, or 0.497% (negligible). For, N=50, the results would be: F=196.08 MHz ; ΔD=0.495 psec. Hence, the error in the calculation of one delay is 0.990% (still negligible). In these examples, the error is proportional to 1/N2; therefore, the error will decrease quadratically with an increase in N.
If the oscillating frequency is too high to pass through the output pad (more than 500 MHz), you should use serial divide-by-2 circuits until the frequency at the output pad of the chip is well below the pad limit. Because the divide-by-2 circuit is positive-edge triggered, the output frequency will be exactly 1/2n times the input frequency (where n is the number of divide-by-2 stages).
Layout issuesThe whole exercise of skew balancing and skew insertion at the design level is useless if you don't design the layout to take care of errors. EDA vendors have proposed various implementation flows, such as timing-driven design, to handle semicustom designs. But for full-custom designs in which you need accuracy on the order of picoseconds, these flows are of little help, because the designers must be mindful of the effects layout is introducing and effectively manage the skew insertion. The tools may, for example, place fewer vias than are optimal, leading to a large voltage drop. Also, most of the tools have design flows that fail to optimize routing for highly random placement.
One key recommendation is to effectively manage the placement of the cells. In many cases, cell placement is like simple block-fitting, but you must take care on areas in which timing is critical. This recommendation is particularly important in the case of delay chains. An example is a multiplexer used to tap a delay chain. If you randomly place the cells, the delays of different paths will be dissimilar. Conversely, an effective technique to balance the delays introduced by the multiplexers is to place the cells so that the multiplexers introduce identical delays.
An example consists of a delay chain tapped at various points (Figure 10). A 16-to-1 multiplexer consists of five 4-to-1 multiplexers. Due to the symmetrical placement of the multiplexers, various path delays match. A similar argument holds for the delay components in the chain. A random placement of the delay elements would cause the introduction of dissimilar delays at various nodes. Asymmetrical placement results in the fact that the delay introduced from inverter 3 to inverter 4 differs from the delay introduced from inverter 4 to inverter 5 (Figure 11). Symmetrical placement, on the other hand, ensures that the delay step is minimized.
Another area of concern is the current consumption of circuits operating at extremely high frequency. You must therefore ensure that the number of vias available at the tapping points is sufficient for the anticipated current flow. The placement of the chain will also determine the current drawn from a single net. Assuming that the power rails are running horizontally, a horizontal placement of the delay elements would cause a current load on the single net (Figure 11). If you place the inverters vertically, the current will be drawn through two nets, thereby decreasing the per-net current loading. This transition would drastically reduce the voltage drop on the nets.
A resistive model clearly shows that if you place the
cells horizontally, the current loading on the net is 2I, where I is the current
drawn by a single delay element. If, on the other hand, you place the delay
elements vertically, the current loading on a single net is I. Due to this
increased current loading, it is also possible that if series of
current-consuming components locate in the same rail, the power drop will
continue to decrease as the components move away from the power source. It's
possible that the component farthest from the power source will receive so
little power that it will affect the function of the entire circuit. If you are
trying to achieve high accuracy, the overall effect of this problem can be huge.
You should conduct layout placement, taking power and skew into account by using
symmetry and optimum power-distribution methods.
| Author Information |
| Naveen Tiwari and Ruchir Saraswat are design engineers at STMicroelectronics, where they are responsible for creating analog and digital blocks for on-chip test, debugging, and characterization of various macros. Tiwari holds a bachelor's degree in engineering (electronics and communication) and a master's degree in technology (signal processing). His interests include designing new macros to meet the challenges of the VLSI industry, playing cricket and chess, and spending time with his family. Saraswat holds a bachelor's degree in electrical engineering and a master's degree in technology (control systems). His interests include analog and digital design employing novel approaches to solving problems faced in industry, swimming, reading, and going for long drives. |
| References |
|














