Domino logic boosts performance
Intrinsity's FastCore techniques offer an alternative to high-performance, low-power cores.
By Robert Cravotta, Technical Editor -- EDN, March 4, 2010
Intrinsity's Fast14 NDL (1-of-N domino logic) can increase the performance of a low-cost, standard RTL (register-transfer-level) core by 40 to 50% and maintain cycle accuracy. The total design cost is approximately 90% less than the full-custom approach, and the design cycle takes less than a year.
Intrinsity's team of chip designers, using its Fast14 NDL, developed the 1-GHz, less-than-700-mW Cortex-A8 RTL FastCore, with cycle accuracy and Boolean equivalency to ARM's original Cortex-A8 golden specification. About one-fifth of the A8's functions are benefiting from the domino logic. Intrinsity's FastCore techniques offer an alternative approach to implementing high-performance, low-power cores.
The fastest processors, such as those from Intel and AMD, use fast dynamic- or domino-logic gates in a full-custom-design process. Resolving the timing and noise issues associated with dynamic logic in such designs requires large design teams, multiyear design cycles, and hundreds of millions of dollars.
Many SOC (system-on-chip) vendors license processor cores from ARM or MIPS. These cores use synthesized static-logic gates in an automated design process that provides moderate performance and quick design cycles with small teams of engineers. It is possible to increase the performance of these cores by implementing changes to the ISA (instruction-set architecture). However, this approach also requires substantial resources. ISA license fees are substantially higher than those for synthesized RTL cores. Optimizing the ISA is a multiyear process that typically requires large design teams and deep pockets. Changing the ISA destroys compatibility with the original core and requires the development of new test suites and software, which can add a year to the design cycle.
Synthesized static logic has more relaxed timing constraints than does domino logic. It allows as many as 15 to 20 gates per clock cycle, but each gate executes less logic than an NDL gate. An NDL gate can execute two to five times more logic per clock cycle than a synthesized static-logic gate by taking advantage of multiple out-of-phase clocks so that the system completes more work per system clock cycle.
However, domino logic consumes more power and silicon area than static-logic gates, so you must use domino logic judiciously. Intrinsity's NDL uses less power than traditional domino logic by representing values zero through three with four wires so that only one of the transistors driving the wires is powered at a time. The system represents zero as 0001, one as 0010, two as 0100, and three as 1000.
The Fast14 technology originally targeted multigigahertz applications allowing only one NDL gate per clock phase because that was the only way to guarantee the timing requirements. Originally, developers manually applied NDL gates to large sections of a design—full blocks or units—because of the complexity of interfacing the domino logic to the synthesized static logic in the chip. The new tools automatically size the NDL transistors to achieve the required timing and drive strengths.
Intrinsity developed a simplified interface between static and dynamic logic that enables more efficient transitions between dynamic and synthesized static logic and supports an automated placement flow. The simpler static/dynamic interface allows the use of NDL in smaller subcomponents, ameliorating power and area issues and giving designers the freedom to use small amounts of it in more areas of the chip. The ability to use smaller NDL subcomponents enables the FastCore design team to selectively use NDL only in the critical paths of the core, rather than in large blocks.
Intrinsity engineers used their circuit-design expertise to build a semiautomated timing-analysis flow employing the PrimeTime timing-analysis tool. The tool looks at all the paths in the design to identify those with the most common features: either common start and end-points or paths that traverse through a common set of logic. By speeding the logic areas that affect the largest numbers of critical paths, the design team could speed up a large number of critical paths with a minimal amount of NDL, minimizing any die-size or power-consumption penalty.
Intrinsity engineers determined that the lower clock-frequency targets for the ARM Cortex families of cores allow the use of four to 12 gates per clock phase, and they still meet the timing requirements. Increasing the number of NDL gates in each clock phase contributes to achieving the 1-GHz clock rate of the Cortex-A8 FastCore in a 45-nm LP (low-power) process—about 40% more speed than you can achieve in a 45-nm LP process using only synthesized static logic.





















