Azuro offers up to 30 percent clock frequency bonus with new clock optimization tool
Ever since we stopped getting an automatic 50 percent boost in clock frequency with each new process node, there has been increasing attention on design tools that can get more speed out of a given netlist and process, without driving power through the roof or yield through the floor. One of the first victories in this quest, now widely accepted as a necessary part of a sub-130 nm flow, was physical synthesis. Physical tools could perform a number of improvements on the design, including logic optimizations, reordering of the logic, and changes based on early estimates of post-placement timing, during the synthesis process.
Another line of research has led to a much more radical, and as yet not widely accepted, design style: self-timed logic design. In this approach, logic paths have means of indicating when a signal has arrived at the register input. So rather than clocking all the registers off a common clock with theoretically zero skew, and then turning the clock down until signals have had time to propagate through the slowest logic path in the design in one clock cycle, you just let everything run at its natural speed. Each stage of logic, in effect, waits until all its inputs are valid, does its work, and passes the results on to the next stage as they are available.
In principle self-timed logic should be a huge benefit. Running everything at its natural speed could save power, reduce noise, eliminate clock circuitry altogether except at the periphery of the block, eliminate sensitivity to process variations, and give faster overall performance. In practice, though, self-timed logic lacks commercial synthesis tools, breaks most verification and analysis tools, and even when hand-crafted can erase it’s own benefits with overhead circuits. And it’s darned intimidating.
Now Azuro, a little EDA vendor with a strong reputation for its PowerCentric clock-tree synthesis and clock power optimization tool, has found a middle path between the rigors of conventional RTL design and the idealized but often impractical world of self-timed logic. And it all starts with one of those "why didn’t I think of that" ideas that hand-crafters have used for years to tweak recalcitrant synchronous circuits into spec.
It’s brilliantly simple: instead of clocking all the registers at the same time, why not adjust the clock skew on each register so that the clock arrives one hold-time, plus guardbanding, after the valid logic signal? You get many of the benefits of self-timed logic, and you relax many of the constraints that make the design of balanced clock trees nearly impossible in small geometries. But the circuit remains fully synchronous, subject to normal tool flows, including conventional verification, since you aren’t changing the logic, and static timing analysis if you use Propagated Mode. That is the idea behind
Azuro’s new tool, Rubix.
As always, building the tool is a lot more complex than having the bright idea. Rubix actually works by identifying what Azuro CEO Paul Cunningham calls chains: paths that start at an input or at a junction where feedback enters the path, and continue on until they reach an output or a feedback junction. The tool then traverses these chains, watching the path delays between each pair of registers in the chain. Then, to pass on the simplified-for-an-editor explanation Paul offered, the tool starts with the largest path delay—the critical path—in the chain. It looks at the delay on the path preceding the critical one, and if there is slack, it skews the clock on the register at the beginning of the critical path to arrive early, using up some of the available slack. Then it looks at the path after the critical one. If there’s slack there, Rubix delays the clock to the register at the end of the critical path. The result of these two intentional clock skews is that there is now more than one clock period available for signals to traverse the critical path. So you can turn the clock up a bit and still meet timing. Once you have created slack in what was the critical path, you find the new critical path and repeat, until you have used up all the slack you can in the chain.
The bottom line, according to Cunningham, is that after this optimization the maximum clock frequency is set by the average path delay in a chain, not the maximum path delay. (This actually becomes obvious, if you draw some pictures of logic chains and borrow and lend skew across the paths in the chain for a while.) So with no impact on the logic, the output of the physical synthesis tool, or the downstream analysis or place-and-route tools, you can improve fmax by up to 30 percent. The Rubix tool runs more or less automatically, so there is little required of the user beyond understanding what is going on.
The process is far less problematic than trying to generate balanced clock trees, Cunningham maintains, especially in light of the added complexities of clock tree optimization below 90 nm. Rubix actually could reduce the clock network’s sensitivity to random variations by spreading slack around among the paths, for instance. Given the success of PowerCentric in generating optimized clock trees, one is tempted to take Cunningham’s word on the matter.
There are a few complexities, some of which Rubix deals with and some of which are left to the user. One problem, which turned out to be thorny enough to require a lot of development time on the tool, is that if you aren’t very careful, "it can be quite hard to keep the clock network from growing rapidly" as you do the optimization process, Cunningham said. Recognizing how to produce all the clock skews you need, in the places you need them, without creating excess power dissipation or congestion in the clock nets is non-trivial. But Rubix addresses that problem.
One issue left partially to the user is design-for-test. "Not having balanced clock trees can mess up scan-chain stitching," Cunningham explained. "So we included a scan-chain reordering capability in Rubix." Apparently this process requires some intervention, and obviously it would impact downstream test insertion or messing about by DfT designers late in the flow.
For the future, Cunningham speculates that Rubix could not only be used to achieve a greater fmax on a given design, but it could be used to create enough slack at a given frequency to allow turning down the supply voltage, thereby making significant savings in dynamic and static power. But Azuro hasn’t really looked into that application yet.
Rubix has been in the hands of some early users for some time now, and is scheduled for general availability in April. At that point it could add an interesting new dimension to the quest for just a little bit more speed.
Marc at Azuro commented:
SteveM commented:
John Q. PNR commented:
AtopTech commented:
Marc commented:
notAnotherTool commented:















