Verifying complex clock and reset regimes in modern chips: the challenge and scalable solutions
Automation at every step and built-in tool intelligence offer the only practical path to scalable verification of modern chips.
Pranav Ashar and Vishnu Vimjam, Real Intent Inc -- EDN, January 24, 2011
In an environment where chip-design risks have ballooned to worrying levels, a verification methodology based on just linting and simulation does not cut it. Identifying specific sources of verification complexities and deploying automatic customized technologies to tackle them in a surgical manner has an enormous benefit.The key words here are automatic and customized. At first glance, they don’t necessarily go hand-in-hand. Whereas automatic has to do with maximizing productivity—that is, minimizing the number of steps an end user must go through in design setup, analysis, and debug; customized has to do with making sure that the setup, analysis, and debug steps are as specific and comprehensive as they can be to minimize the risk of a bug slipping through the cracks. Therein lies the challenge, not just for clock-domain verification but for the entire plethora of failure modes that can bring down your typical modern chip.
A partial listing of typical failure modes, in addition to clock-domain issues, includes bugs caused by complex control logic, timing-constraint errors, X-simulation inconsistencies, incorrect DFT structures, and power-management control. While clock-domain verification is not the only critical phase in a design tape-out, it certainly is a case in point. In modern chips, the complexity of clock-domain verification has grown tremendously for a number of reasons, as highlighted in the sections below.
A brief introduction to the CDC problem
CDC (clock-domain-crossing) bugs are a confluence of bad implementation, timing, and logic. As shown in Figure 1, if the signal crossing from one asynchronous domain to another arrives too close to the receiving clock edge, the captured value is nondeterministic (that is, leading to metastability). These errors are very hard to detect and diagnose via simulation or in the lab. Unfortunately, they do result in frequent failures in the field that are expensive to fix.

Not long ago, the number of asynchronous domains used to be less than five—small enough that linting templates could be written for each crossing. Such template checks would run fast enough, and the false-failure noise in the report was tolerable because each reported failure could be analyzed manually. This is no longer the case today, for several reasons.
The number of asynchronous domains is now closer to 100 for high-end SOC designs that target performance or power efficiency. For one, the chip die is large enough that it is impossible to distribute the same fast clock to all parts. In addition, power management dictates that there be multiple VDD and clock regimes on the chip that can be turned on and off independently. Finally, it is just a reality today that an SOC looks more like a collection of subcomponents, each with its own clock, as shown in the Figure 2, than like a single circuit in the conventional sense.

Designers are truly frustrated with template-based methods. They would like the intelligence to be built into the verification tools. For example, the reports from template-based analysis can be so voluminous that even a small percentage of noise is beyond the ability of the engineering team to analyze manually. The painful aspect is that a designer who is required to use a certain tool is also obligated to look through all failures reported by it. Worse, it is more than likely that a real failure will be missed in the process and the verification effort will come to naught.
Widely disparate and dynamically varying clock frequencies
Clock frequencies in communicating domains (asynchronous or synchronous) can differ by an order of magnitude today. In addition, some clock frequencies are allowed to vary dynamically based on throughput requirements or for power optimization. Analyzing the design for data integrity and data loss in domain crossings under all possible scenarios becomes nontrivial and cannot be done by linting alone.
Proliferation of gated clocks
Control structures for power management use gated clocks extensively. Mode-specific gated clocks (for example, scan clocks) are also very common in modern chips. This situation introduces a multidimensional verification problem of the following nature:
- First, one must make sure that the clock setup is correct before verification is started. While a detailed analysis is nontrivial, it does have the advantage of pointing out errors in the clock distribution circuitry or an inconsistency in the environment specification. Either of these errors makes it important to have a correct setup for the subsequent analysis to be meaningful.
- In practice, gated clocks are introduced for optimization rather than as an original part of the functionality. One must ensure that the design operates correctly in the presence of gated clocks—that is, that the clock gating did not modify the functionality. The large amount of gated clocking, the nontrivial control circuitry used in gated clocks, and the likelihood that most of it is automatically inserted by a synthesis tool make the job complex.
- Third, because clock gating can be implemented in a variety of ways, glitches on gated clocks can also occur in a corresponding number of different ways. Glitches on clocks are insidious and some of the hardest bugs to diagnose. It is imperative to know about this possibility as early as possible. Given this variety of gated-clock types and their glitching potential, a template-based approach is again a recipe for loss of productivity and very slow analysis.
Reset distribution
The implementation of power-up reset is more complex today because it is designed to optimize power and physical layout. As in the case of clocks, it is important to comprehensively verify the reset setup prior to subsequent analysis of the design.
Timing optimization
Timing optimization normally occurs in the synthesis phase and may be transparent to the designer. Optimizations such as retiming have the potential to violate basic design principles and lead to potential glitches at the gate level, where there were none at the RT level, as illustrated in Figure 3. This situation highlights the need for glitching analysis to be an integral part of design verification. In addition, glitch insertion also suggests that the verification tools for clock and reset analysis be operable at the RT level as well as at the gate level. Template-based approaches make operation at the gate level difficult because it becomes necessary to develop multiple versions of each template for the various gate-level libraries. Also, the language used for RTL can be different from the language used at the gate level in the same design group, making it necessary for the verification tool to operate with Verilog, VHDL, or a combination of the two.Clock distribution in deep submicron
As the limits of clock distribution are tested, previous second-order issues, such as clock jitter in data and control transfers, have increased in importance. This means that even crossings across synchronous domains that were previously deemed safe must now be designed carefully and verified comprehensively.
Examples of silicon re-spins caused by clock and reset related failures
There have been many examples of silicon re-spins as a result of not comprehensively verifying the issues highlighted above. The ones described below are quite revealing:
- An asynchronous reset control that crossed clock domains but was not synchronously de-asserted, causing a glitch in control lines to an FSM. As shown in Figure 4, the reset line is not synchronized. As a result, it cannot be guaranteed that the three flip-flops on the receive side will exit the reset state at the same time.

- An improper FIFO-based protocol controlling an asynchronous data transfer, resulting in a read-before-write operation leading to a functional failure. This is a corner-case situation that occurs when the FIFO is empty. The read logic must wait long enough for the written data to stabilize, taking into account the additional time required for the metastability effects at the asynchronous crossing to subside. A typical FIFO is shown in Figure 5.

- Reconvergence of synchronized non-gray-encoded control signals to an FSM, resulting in cycle jitter that in turn causes a transition to an incorrect state.
- Glitch in a logic cone on an asynchronous crossing path that was latched into the destination domain, resulting in corrupt data being captured. As shown in Figure 6, this is a simple error, but happens more often than it should.

- Gating logic inserted by back-end tools for power management, resulting in glitches on a clock. In Figure 7, if the clk1 and clk2 clocks are asynchronous, the gated clock can be glitchy.

The verification problems highlighted in the above enumeration are not amenable to meaningful coverage via simulation. Neither are they solved effectively by template-based linting methods as indicated. As a result, the clock-domain verification problem has become a true show-stopper, and an effective solution is a must-have.
One approach to overcoming these challenges is based on a first-principles understanding of the failure modes involved and the deployment of a symbiotic combination of structural and formal methods to verify them comprehensively without loss of precision. As an example, structural and formal methods combine to get the clock and reset set up automatically. Similarly, these technologies combine to check for metastability-related errors, glitching potential, data integrity, and loss-of-data and signal-correlation issues. Some failures are flagged purely by structural analysis, while others require formal analysis for a more precise result.
The basic principles of CDC verification are described here. For all crossings (control, data, reset, gated clock, and so on), we must ensure the following:
- Downstream logic is protected from metastability using synchronizers or control logic.
- The correct value is captured on the receive side.
- Signal correlation is maintained.
- Glitches are not propagated.
Beyond these basic principles, the details of how various types of crossing structures are recognized and various checks are performed must be built into the tool’s intelligence, as in the case of Meridian-CDC. For example, the tool should be capable of automatically inferring the load-control and propagation control mechanisms shown in figures 8 and 9 for ensuring that data is correctly transferred. If a user has to explicitly describe to the tool every single type of circuit topology to look for, the verification battle is lost right there.
Real Intent’s first-principles approach infers designer intent in the crossings or clock/reset distribution network and, further, automatically infers checks required for that implementation style. Consequently, the company’s structural analysis runs orders of magnitude faster and removes the obligation on the designer to come up with the right templates and formal properties.
Formal methods analyze failures effectively under all possible design operations and also avoid having the user laboriously check each scenario separately. An example is the free-running-clock feature in Meridian that checks for data loss under all possible frequency ratios between the transmitting and the receiving domains. Meridian-CDC automatically infers the following checks that are formally analyzed with a built-in understanding of metastability effects:
- Busses are gray encoded.
- Pulse widths of control signals coming from the transmit domain are long enough.
- Transmitted data is stable for a sufficient number of cycles.
- Potential of a glitch originating within the combinational logic directly feeding into a synchronizer.
While the first three formal checks help to reduce false positives, the glitch check, in fact, helps to reduce false negatives by allowing combinational logic to feed directly into a synchronizer if it is determined functionally that a glitch can never occur or does not corrupt the functionality even if it does occur.
The structural analysis in Meridian-CDC is linked into the linting phase of verification, as well as into simulation. Meridian-CDC automatically instruments simulation to model metastability and check for failures. In this manner, existing test benches can check for clock-domain issues in simulation with little runtime overhead. This feature significantly improves the likelihood that corner-case failures will be caught in simulation.
The final feature built into Meridian-CDC to enable runs at the full-chip or SOC level is automatic hierarchical analysis. With chip sizes exceeding 100M gates, even the most efficient structural and formal analysis algorithms find it difficult to scale to the full chip level. Meridian performs CDC analysis in a bottom-up manner. As blocks get verified, their details are abstracted out and only the shell model with clock association attributes on the I/O signals are propagated up the hierarchy. This setup in turn enables users to focus on the CDC issues present at any desired level of hierarchy.

Talkback





















