October 8, 1998
The million-gate march:
Tackling today's tough ASIC challenges
Increasing chip density is putting a strain on design engineers. Adhering
to a good design-reuse methodology and working within a close-knit design team allow
engineers to complete million-plus-gate designs.
Roy Shanks And Jon Levi, Cadence Spectrum Design Center
Because of greater silicon density and chip-fabrication advances, designs are becoming
more complex. Design methodologies are also improving to keep pace with the amount of
available silicon for logic implementations. For designers to complete complex chip
designs within a reasonable time, reuse has become necessary.
A complete design flow starts at the architectural level and continues through the
physical database (Figure 1). This design process requires
a rich and complex environment for managing large amounts of design data, library models
for many libraries, technology files, and other information. For complex submicron
designs, such as systems-on-chip (SOC), a hierarchical block-based design approach has
many advantages.
Project organization
One way to view the submicron-design environment is to consider a design flow covering
the entire ASIC design. The design community is aware of the changing role of gates and
interconnect in chip designs. Interconnect is now 70% or more of the propagation delay of
paths on many chips and has a large impact on power dissipation.
The number of tasks for designing a million-gate ASIC is beyond the abilities of one
individual. Design organizations must form teams to handle the myriad tasks that require
special expertise and experience. These teams have five major functions: architecture and
logic design, logic verification, design implementation, timing design and analysis, and
circuit design. One approach is to form a project team with members possessing the skills
to cover each of these functions. Layout and physical verification require special
expertise, and the team needs transistor, process, and design expertise for timing
analysis.
Customer specifications drive architectural evaluation and design definition (Figure 2). You perform architectural modeling to ensure
system functionality and performance. You then review full-design functionality and
develop a top-level functional-block diagram. Next, you expand the design hierarchically
one level at a time, with each level providing greater detail.
The microfunctions you define in this way may be previously designed, customer-provided
functions or industry-standard functions. In this definition phase, you must develop or
acquire specifications for microblocks. You write or acquire a behavioral-level
description of each microblock and create a chip-level description to link functionality.
To simulate the chip's function and begin implementation, write a behavioral model of
the system. You then define and review design-implementation trade-offs. Next, you do a
chip-level verification, including all protocols and interface conditions. Most errors
occur during exception conditions, which happen outside normal chip operation. Coverage of
these conditions is an engineering judgment unless a set of verified legacy tests exists.
For functional verification, you develop an RTL description of each block, along with the
testbenches to verify blocks. You synthesize blocks from RTL code to a gate-level
representation. Use constraint files, wire-load models, and synthesis scripts to
accelerate the iterative process. Blocks undergo verification using testbenches at RTLs
and gate levels. Designers normally use Verilog throughout these design steps, although
some customers may want to use VHDL.
For acquired blocksexternal intellectual property (IP) or coresthe scenario
is more complex (Figure 3). In these cases, you must
obtain proper models, some of which may come in encrypted formats. If the vendor has
designed these cores without Virtual Socket Initiative Alliance (VSIA) guidelines, you
need to design glue logic to provide an interface between the block and the bus.
Sometimes, core providers do not design cores in your target technology, thus requiring
you to migrate them. This fact does not impact functional models; however, you must modify
or create timing models for the migrated block.
As you complete block design, you perform chip-level block integration and
floorplanning. A design of this complexity represents an SOC, and the verification
environment is complex. At the chip level, you must first independently verify the
functionality of each block and then verify this functionality with bus-level activity or
protocols. In some cases, using bus-level models can accelerate the early stages of
verification. You have to generate large numbers of test-vector sets for debugging to
perform chip-level verification. This task can be time-consuming and can use many
resources.
Verification follows two paths: Behavioral simulation ensures system functionality;
gate-level simulation both ensures that the implementation matches behavior and provides
timing information. Matching gates and behavior is typically error-prone, because
gate-level representations include all state information, but behavioral-level
representations may not.
Test coverage is only as good as the vectors you use. If you add test logic, such as
JTAG boundary scan or internal scan, functionality is unavailable in the behavioral model,
so you can't run generated vectors from automatic-test-pattern generation (ATPG) against
the behavioral model. New formal-verification software provides a method of verifying a
design's RTL representation against the design's synthesized gate-level representation.
Functional verification continues until it covers all possible operating conditions.
You can also accomplish this design-confidence level by randomly generating operations and
fixing bugs until the bug rate reaches a predetermined number, such as one bug per week or
none in the last five days.
At the gate-level timing stage, you perform initial timing evaluation using estimated
interconnect-delay values, based on typical interconnect length and fan-out considerations
(Figure 4). To facilitate timing convergence, you need
more accurate estimations as early as possible. You can get these estimations from
floorplanning tools or from fast-turnaround place-and-route runs at either the design's
block level or the chip level for interconnect delay between blocks.
You can run a second verification pathcapturing functionalityin parallel
with the behavioral and RTL verification. You capture the design using a language such as
C. In this way, the team follows two parallel paths from specification through RTL
representation. Compare the two solutions in the chip-level bus environment for a final
verification. Although testing takes place in a chip-level environment, you verify blocks
one at a time using chip-level testbenches that you generate. It is impractical to
generate these vectors at both the chip and the block levels. Verification tasks are
developing or updating block models, developing the chip-level verification environment,
and developing C testbenches to verify the design. You model chip-level blocks in C code.
The C-code approach adds one layer of verification to the process because it is unlikely
that both the RTL-code and the C-code writer can make the same mistake in interpreting a
specification.
Physical design
Early in the design cycle, chip- and block-level netlists are typically unavailable,
but with a hierarchical design structure, physical designers using the architectural
specification for functional-block, pin, and bus information can begin floorplanning using
estimated block sizes and form factors. They can do this floorplanning in parallel with
logic design early in the design cycle. If an earlier version of this or a similar design
exists, you can use the earlier design as an initial floorplan.
Blocks in the design may come from different sources. Some may exist at the physical
level in a different technology, and others may exist at an RTL. Still others may be new
designs originating at a specification level. You can scale blocks previously implemented
in a different technology to the target technology and use them for an initial block-area
estimate for floorplanning. For those blocks at physical levels, netlist levels, or RTLs,
you can start block layout early in parallel with logic design.
Block retargeting at the physical level requires a design flow different from that of
nonphysical-layout blocks. Physical and logic designers must consider placement, area,
block form factors, chip- and block-level pinout, busing definitions, and timing issues.
Physical designers share timing information from floorplanning or block-layout activities
with logic designers.
Block development begins with area and pinout estimates. With this information,
designers create mock library-exchange formats (LEFs) for each block that they can use for
chip floorplanning. LEF is a Cadence (www.cadence.com)
design-tool format. In most cases, routing across blocks is not viable: routing at the
chip level is a block-routing exercise, requiring knowledge of block-routing porosity.
Mock LEFs permit early estimation of signal-path delays between blocks. This estimation
requires a floorplanner or routing tool to generate a block-level interconnect at the chip
level for initial global-timing data. As the internal block structure becomes better
defined, you can expand timing analysis. Using timing-driven layout, you place and route
each block, so that the blocks comply with timing constraints in the
design-exchange-format (DEF) file, another Cadence design-tool format.
You need to define your clock strategy early and refine it as the design progresses.
Sometimes, the customer defines clock strategy. One common method is to use a clock-grid
approach, sometimes called a "big-bang" clocking scheme. Another method is to
use a balanced clock tree. If any timing discrepancies in the clock-tree approach remain
after synthesis, you can resolve them with less manual effort than you need for the
big-bang clock method. The clock strategy must recognize the block-based architecture of
the physical design. The clock specification provides requirements for maximum skew and
for minimum and maximum insertion delay. You must consider clock-distribution power
dissipation when setting clock strategy and monitor the estimated power during
implementation. Designs with more than one clock domain require manual tuning to minimize
skew between domains. You often vary metal clock-line widths to manage clock RC delays or
to make a line less susceptible to process variations.
The ability to make incremental changes in the late stages of the design cycle is a key
element in maintaining schedules. Once you place and route a design with most timing paths
within specification, you can save redesign and verification time if you replace or
redesign only the impacted portion of the design. In this way, you do not disturb paths
that already meet timing specifications. Even if this process requires manual editing of
the synthesized netlist, you can verify the edited version of the netlist against the same
testbench you use for RTL verification to keep databases synchronized. This process works
for modifications involving functional changes or changes in buffer sizes.
One area of concern is timing closure: getting the physical design's timing to match
synthesis timing. In the past, designers used a cycle that included completing a layout
and running a timing analysis on the extracted layout. Results would prompt another
layout, and this situation could continue for a long time, depending on how far the design
pushes the process technology. Timing-driven design addresses this loop up front by
creating a methodology for taking results of prelayout timing and using these results as
constraints on the place-and-route tool. The price you pay is a slightly increased area,
because you are not necessarily using an optimum placement. However, schedule improvement
because of the reduced number of iterations is worth the price. For designs that use
aggressive clock strategiesfor example, designs with clocks as high as 450
MHzyou can use a timing-driven-layout methodology.
Placement is critical in timing-driven-layout designs. The design creates a
timing-gridlock condition with such tightly grouped paths that a small change in clock
frequency can cause large numbers of paths to fail timing specifications. Unless the
place-and-route tool has timing-driven capability, these designs are difficult to execute.
You can implement buffer sizing using a company's proprietary program that increases
buffer sizes to improve performance and decreases them to decrease power.
Today's large designs require multiple timing methods. Because IP blocks can come from
multiple sources, they also take many forms: soft, firm, and hard cores, according to the
definitions of the VSIA. Some blocks are RTL representations, some are netlists, and some
are physical blocks. Timing requires cell- and transistor-level tools. If the IP vendor
characterizes the block for your target technology, then the block may have a timing
model. In some cases, no block-timing model is available. For these situations, you need a
transistor-level timing model. This model may be a Spice model for some critical
asynchronous-style logic, or it may be a timing-analysis- or -simulation-tool model that
works at a pseudo-Spice level.
Whatever the timing method, you have to create a router timing model so that you can
place constraints on the layout. You describe timing requirements to the place-and-route
tool, and the tool works to meet these constraints. The tool then flags paths that fail
the timing constraints. Most often, this failure results from improper cell-drive
strength. Synthesis tools can adjust drive strengths to meet physical layout. You often
need repeater cells on long runs to reduce RC delays. You must characterize each
technology with the target library to find the crossover point at which inserting a
repeater is better than just increasing drive strength. Handle any adjustments to the
circuit as an engineering change order (ECO): Only the portion of the design that has
problems should change. Incremental-timing results tell you that if the design was good
before, then it should still be good after the ECO; don't start over again. Another factor
to consider is buffer size. If the initial buffering selection is for the smallest buffer,
problems may occur later when buffers get bigger, forcing a new placement. However, if you
initially use the largest buffer and then downsize buffers as appropriate, area use may
not be optimum.
Use scripting to automate the design flow and to ensure that you follow consistent
design processes. Scripts typically include capability for tasks such as building
libraries, setting up a directory for a design, setting up a batch process for executing a
fixed sequence of operations or commands, providing format translations, providing an
interactive prompting checklist, and releasing selected files into the revision-control
system. With diverse blocks, the timing approach depends on the nature of the blocks, and
you must carefully plan hierarchical timing flows (Figure 5).
Retargeting hard cores
In the cases in which a physical or other hard-core cell representation exists in one
technology and needs to migrate to a second technology, you can use a combination of
commercial tools. In addition, you need technology files, along with process-layer-mapping
definitions, for old and new processes. If the block is too large, break it into smaller
blocks. You can port hard cores three ways. The fastestbut the one with the least
area gainis optical shrink. With this method, do a design-rule study to find
limiting rules and then create a shrink macro that shrinks the block to the size at which
an acceptable number of physical-design-rule-check violations occurs. The acceptable
number can range from zero errors to a number representing however much time you are
willing to spend cleaning residual design errors after running the process shrink. Unless
someone creates the target process with a design shrink in mind, then the amount of shrink
is smaller than the theoretical maximum-allowable shrink.
The second shrink method uses a commercially available technology-migration tool (Figure 2). A commercial tool performs "smarter"
shrinks. Because such a tool knows technology rules, it can rotate and manipulate
transistors to maximize shrinkage.
The third method is to reverse-engineer the block netlist from the physical file and to
do either a netlist port or a synthesis port. The synthesis port may be more efficient
because of current technology considerations. It is likely that added metal layers or cell
layouts are more efficient in the new technology and, therefore, provide a smaller result.
The theoretical maximum limit of an optical shrink does not limit the synthesis port.
Hierarchical timing strategy
In large designs, the timing approach may vary by block. For example, a synthesis tool
may initially time synthesized blocks that a path-delay tool then evaluates. Blocks that
you import and map from one process to a target process may require a transistor-level
timing tool. It may be impossible to flatten all blocks for a chip-level timing
evaluation; you must develop block-level timing models for blocks that you cannot flatten.
You also have to generate a hierarchical netlist containing those hierarchically described
blocks. You create timing shells for hard cores that a path-analysis tool uses for timing
the next level of the design hierarchy.
You also have to generate timing models for all library elements. These models should
be consistent for the design views used with each tool, although some inconsistencies may
exist. You usually find macro input- and output-pin capacitance in LEFs and in macro views
that the timing analyzer uses. Accurate interconnect-parasitic modeling is difficult but
important in deep-submicron chip design because it represents a larger portion of the
total delay.
As processing geometries decrease, the amount of delay from interconnect increases
dramatically. At 1-µm geometries, a transistor's gate delay represents about 60 to 70% of
total path delay. At 0.35 µm, this number shifts to about 30%. Previous methodologies of
taking a lumped capacitance for a line's interconnect capacitance are no longer adequate.
Accurate RC modeling of lines is essential, especially for clock lines. Special tools can
accurately extract and reduce the RC data for a line. Commercially available design tools
implement new algorithms that work on complex interconnect lines, including all crossovers
and parallelism. The result is a delay number that has the accuracy of an extracted-layout
resultaccuracy that you need to achieve an acceptable value for interconnect delay.
Data reduction is imperative because the amount of data acquired when you design a 1
million-gate circuit is huge. Designers have successfully used data-reduction techniques
for some time. The accurate extraction of RC parasitics is the key. Shrinking technologies
have resulted in some interconnect widths exceeding their heights. This fact means that
fringing (side) capacitance exceeds plate (vertical) capacitance, and an analysis tool
must account for all parallel lines in its calculations.
Crosstalk, which was present in earlier technologies, is more apparent with the higher
speed edge rates of new designs. High-speed design requires high edge rates. Routers are
beginning to look at this problem. One approach is to do an analysis by feeding the
crosstalk delay impact as additional delay to the timing-analysis program. When timing
analysis puts the crosstalk penalty into the path, the analysis can then flag those paths
where a real violation exists. In a synchronous circuit, crosstalk is a problem if data
has not stabilized. The three most common fixes for crosstalk are increasing drive
strength, rerouting the line, or putting a repeater on the line.
For a device to be manufacturable, it must meet the silicon manufacturer's design
rules, and it must be testable. With large circuits, good coverage is necessary to
discover any manufacturing defects. Such coverage employs a good test strategy from the
beginning of the design.
Visibility into the circuit is essential for chip debugging if something goes wrong and
for viewing block states during operation. If you design with a structured-testability
approach, you can use an automatic approach to generate test patterns. In addition, you
use test insertion to create testable scan chains. This approach assumes that you have
ensured device functionality through verification before generating manufacturing test
vectors. You can create large vector sets in this manner along with a coverage number
telling you what the tools may have missed. With a little analysis, you can also determine
why test-coverage tools are not testing some parts of the chip.
The use of cores or acquired blocks that don't conform to VSIA standards can complicate
this approach. For these blocks, there is no assurance of a structured-test approach or
one that is compatible with the rest of the design. Usually, you must apply functional
vectors to blocks to prove that they are working correctly. This step is difficult because
these chips often have deeply embedded blocks. Boundary scan allows for a slow insertion
of vectors that can test the block, but at-speed testing requires a BIST-type test. You
should also check for compliance with manufacturability rules, such as ESD and
electromigration. Many designers assume that vendors design cell libraries with these
rules in mind, but that is not always the case.
A clear understanding of the issues relating to the development of complex ASICs lets
you design "correct-by-construction" circuits. In other words, you can recognize
obstacles and overcome them early enough in the design so that they do not interfere with
delivery of the final design. |