EDN Access

[Download PDF version]

NOTE: Figures (below) link to Adobe Acrobat (PDF) files. To get the entire article in one PDF, click the button on the left.

GET ACROBAT READER


October 8, 1998


The million-gate march:
Tackling today's tough ASIC challenges

Increasing chip density is putting a strain on design engineers. Adhering to a good design-reuse methodology and working within a close-knit design team allow engineers to complete million-plus-gate designs.

Roy Shanks And Jon Levi, Cadence Spectrum Design Center

Because of greater silicon density and chip-fabrication advances, designs are becoming more complex. Design methodologies are also improving to keep pace with the amount of available silicon for logic implementations. For designers to complete complex chip designs within a reasonable time, reuse has become necessary.

A complete design flow starts at the architectural level and continues through the physical database (Figure 1). This design process requires a rich and complex environment for managing large amounts of design data, library models for many libraries, technology files, and other information. For complex submicron designs, such as systems-on-chip (SOC), a hierarchical block-based design approach has many advantages.

Project organization

One way to view the submicron-design environment is to consider a design flow covering the entire ASIC design. The design community is aware of the changing role of gates and interconnect in chip designs. Interconnect is now 70% or more of the propagation delay of paths on many chips and has a large impact on power dissipation.

The number of tasks for designing a million-gate ASIC is beyond the abilities of one individual. Design organizations must form teams to handle the myriad tasks that require special expertise and experience. These teams have five major functions: architecture and logic design, logic verification, design implementation, timing design and analysis, and circuit design. One approach is to form a project team with members possessing the skills to cover each of these functions. Layout and physical verification require special expertise, and the team needs transistor, process, and design expertise for timing analysis.

Customer specifications drive architectural evaluation and design definition (Figure 2). You perform architectural modeling to ensure system functionality and performance. You then review full-design functionality and develop a top-level functional-block diagram. Next, you expand the design hierarchically one level at a time, with each level providing greater detail.

The microfunctions you define in this way may be previously designed, customer-provided functions or industry-standard functions. In this definition phase, you must develop or acquire specifications for microblocks. You write or acquire a behavioral-level description of each microblock and create a chip-level description to link functionality.

To simulate the chip's function and begin implementation, write a behavioral model of the system. You then define and review design-implementation trade-offs. Next, you do a chip-level verification, including all protocols and interface conditions. Most errors occur during exception conditions, which happen outside normal chip operation. Coverage of these conditions is an engineering judgment unless a set of verified legacy tests exists. For functional verification, you develop an RTL description of each block, along with the testbenches to verify blocks. You synthesize blocks from RTL code to a gate-level representation. Use constraint files, wire-load models, and synthesis scripts to accelerate the iterative process. Blocks undergo verification using testbenches at RTLs and gate levels. Designers normally use Verilog throughout these design steps, although some customers may want to use VHDL.

For acquired blocks—external intellectual property (IP) or cores—the scenario is more complex (Figure 3). In these cases, you must obtain proper models, some of which may come in encrypted formats. If the vendor has designed these cores without Virtual Socket Initiative Alliance (VSIA) guidelines, you need to design glue logic to provide an interface between the block and the bus. Sometimes, core providers do not design cores in your target technology, thus requiring you to migrate them. This fact does not impact functional models; however, you must modify or create timing models for the migrated block.

As you complete block design, you perform chip-level block integration and floorplanning. A design of this complexity represents an SOC, and the verification environment is complex. At the chip level, you must first independently verify the functionality of each block and then verify this functionality with bus-level activity or protocols. In some cases, using bus-level models can accelerate the early stages of verification. You have to generate large numbers of test-vector sets for debugging to perform chip-level verification. This task can be time-consuming and can use many resources.

Verification follows two paths: Behavioral simulation ensures system functionality; gate-level simulation both ensures that the implementation matches behavior and provides timing information. Matching gates and behavior is typically error-prone, because gate-level representations include all state information, but behavioral-level representations may not.

Test coverage is only as good as the vectors you use. If you add test logic, such as JTAG boundary scan or internal scan, functionality is unavailable in the behavioral model, so you can't run generated vectors from automatic-test-pattern generation (ATPG) against the behavioral model. New formal-verification software provides a method of verifying a design's RTL representation against the design's synthesized gate-level representation.

Functional verification continues until it covers all possible operating conditions. You can also accomplish this design-confidence level by randomly generating operations and fixing bugs until the bug rate reaches a predetermined number, such as one bug per week or none in the last five days.

At the gate-level timing stage, you perform initial timing evaluation using estimated interconnect-delay values, based on typical interconnect length and fan-out considerations (Figure 4). To facilitate timing convergence, you need more accurate estimations as early as possible. You can get these estimations from floorplanning tools or from fast-turnaround place-and-route runs at either the design's block level or the chip level for interconnect delay between blocks.

You can run a second verification path—capturing functionality—in parallel with the behavioral and RTL verification. You capture the design using a language such as C. In this way, the team follows two parallel paths from specification through RTL representation. Compare the two solutions in the chip-level bus environment for a final verification. Although testing takes place in a chip-level environment, you verify blocks one at a time using chip-level testbenches that you generate. It is impractical to generate these vectors at both the chip and the block levels. Verification tasks are developing or updating block models, developing the chip-level verification environment, and developing C testbenches to verify the design. You model chip-level blocks in C code. The C-code approach adds one layer of verification to the process because it is unlikely that both the RTL-code and the C-code writer can make the same mistake in interpreting a specification.

Physical design

Early in the design cycle, chip- and block-level netlists are typically unavailable, but with a hierarchical design structure, physical designers using the architectural specification for functional-block, pin, and bus information can begin floorplanning using estimated block sizes and form factors. They can do this floorplanning in parallel with logic design early in the design cycle. If an earlier version of this or a similar design exists, you can use the earlier design as an initial floorplan.

Blocks in the design may come from different sources. Some may exist at the physical level in a different technology, and others may exist at an RTL. Still others may be new designs originating at a specification level. You can scale blocks previously implemented in a different technology to the target technology and use them for an initial block-area estimate for floorplanning. For those blocks at physical levels, netlist levels, or RTLs, you can start block layout early in parallel with logic design.

Block retargeting at the physical level requires a design flow different from that of nonphysical-layout blocks. Physical and logic designers must consider placement, area, block form factors, chip- and block-level pinout, busing definitions, and timing issues. Physical designers share timing information from floorplanning or block-layout activities with logic designers.

Block development begins with area and pinout estimates. With this information, designers create mock library-exchange formats (LEFs) for each block that they can use for chip floorplanning. LEF is a Cadence (www.cadence.com) design-tool format. In most cases, routing across blocks is not viable: routing at the chip level is a block-routing exercise, requiring knowledge of block-routing porosity. Mock LEFs permit early estimation of signal-path delays between blocks. This estimation requires a floorplanner or routing tool to generate a block-level interconnect at the chip level for initial global-timing data. As the internal block structure becomes better defined, you can expand timing analysis. Using timing-driven layout, you place and route each block, so that the blocks comply with timing constraints in the design-exchange-format (DEF) file, another Cadence design-tool format.

You need to define your clock strategy early and refine it as the design progresses. Sometimes, the customer defines clock strategy. One common method is to use a clock-grid approach, sometimes called a "big-bang" clocking scheme. Another method is to use a balanced clock tree. If any timing discrepancies in the clock-tree approach remain after synthesis, you can resolve them with less manual effort than you need for the big-bang clock method. The clock strategy must recognize the block-based architecture of the physical design. The clock specification provides requirements for maximum skew and for minimum and maximum insertion delay. You must consider clock-distribution power dissipation when setting clock strategy and monitor the estimated power during implementation. Designs with more than one clock domain require manual tuning to minimize skew between domains. You often vary metal clock-line widths to manage clock RC delays or to make a line less susceptible to process variations.

The ability to make incremental changes in the late stages of the design cycle is a key element in maintaining schedules. Once you place and route a design with most timing paths within specification, you can save redesign and verification time if you replace or redesign only the impacted portion of the design. In this way, you do not disturb paths that already meet timing specifications. Even if this process requires manual editing of the synthesized netlist, you can verify the edited version of the netlist against the same testbench you use for RTL verification to keep databases synchronized. This process works for modifications involving functional changes or changes in buffer sizes.

One area of concern is timing closure: getting the physical design's timing to match synthesis timing. In the past, designers used a cycle that included completing a layout and running a timing analysis on the extracted layout. Results would prompt another layout, and this situation could continue for a long time, depending on how far the design pushes the process technology. Timing-driven design addresses this loop up front by creating a methodology for taking results of prelayout timing and using these results as constraints on the place-and-route tool. The price you pay is a slightly increased area, because you are not necessarily using an optimum placement. However, schedule improvement because of the reduced number of iterations is worth the price. For designs that use aggressive clock strategies—for example, designs with clocks as high as 450 MHz—you can use a timing-driven-layout methodology.

Placement is critical in timing-driven-layout designs. The design creates a timing-gridlock condition with such tightly grouped paths that a small change in clock frequency can cause large numbers of paths to fail timing specifications. Unless the place-and-route tool has timing-driven capability, these designs are difficult to execute. You can implement buffer sizing using a company's proprietary program that increases buffer sizes to improve performance and decreases them to decrease power.

Today's large designs require multiple timing methods. Because IP blocks can come from multiple sources, they also take many forms: soft, firm, and hard cores, according to the definitions of the VSIA. Some blocks are RTL representations, some are netlists, and some are physical blocks. Timing requires cell- and transistor-level tools. If the IP vendor characterizes the block for your target technology, then the block may have a timing model. In some cases, no block-timing model is available. For these situations, you need a transistor-level timing model. This model may be a Spice model for some critical asynchronous-style logic, or it may be a timing-analysis- or -simulation-tool model that works at a pseudo-Spice level.

Whatever the timing method, you have to create a router timing model so that you can place constraints on the layout. You describe timing requirements to the place-and-route tool, and the tool works to meet these constraints. The tool then flags paths that fail the timing constraints. Most often, this failure results from improper cell-drive strength. Synthesis tools can adjust drive strengths to meet physical layout. You often need repeater cells on long runs to reduce RC delays. You must characterize each technology with the target library to find the crossover point at which inserting a repeater is better than just increasing drive strength. Handle any adjustments to the circuit as an engineering change order (ECO): Only the portion of the design that has problems should change. Incremental-timing results tell you that if the design was good before, then it should still be good after the ECO; don't start over again. Another factor to consider is buffer size. If the initial buffering selection is for the smallest buffer, problems may occur later when buffers get bigger, forcing a new placement. However, if you initially use the largest buffer and then downsize buffers as appropriate, area use may not be optimum.

Use scripting to automate the design flow and to ensure that you follow consistent design processes. Scripts typically include capability for tasks such as building libraries, setting up a directory for a design, setting up a batch process for executing a fixed sequence of operations or commands, providing format translations, providing an interactive prompting checklist, and releasing selected files into the revision-control system. With diverse blocks, the timing approach depends on the nature of the blocks, and you must carefully plan hierarchical timing flows (Figure 5).

Retargeting hard cores

In the cases in which a physical or other hard-core cell representation exists in one technology and needs to migrate to a second technology, you can use a combination of commercial tools. In addition, you need technology files, along with process-layer-mapping definitions, for old and new processes. If the block is too large, break it into smaller blocks. You can port hard cores three ways. The fastest—but the one with the least area gain—is optical shrink. With this method, do a design-rule study to find limiting rules and then create a shrink macro that shrinks the block to the size at which an acceptable number of physical-design-rule-check violations occurs. The acceptable number can range from zero errors to a number representing however much time you are willing to spend cleaning residual design errors after running the process shrink. Unless someone creates the target process with a design shrink in mind, then the amount of shrink is smaller than the theoretical maximum-allowable shrink.

The second shrink method uses a commercially available technology-migration tool (Figure 2). A commercial tool performs "smarter" shrinks. Because such a tool knows technology rules, it can rotate and manipulate transistors to maximize shrinkage.

The third method is to reverse-engineer the block netlist from the physical file and to do either a netlist port or a synthesis port. The synthesis port may be more efficient because of current technology considerations. It is likely that added metal layers or cell layouts are more efficient in the new technology and, therefore, provide a smaller result. The theoretical maximum limit of an optical shrink does not limit the synthesis port.

Hierarchical timing strategy

In large designs, the timing approach may vary by block. For example, a synthesis tool may initially time synthesized blocks that a path-delay tool then evaluates. Blocks that you import and map from one process to a target process may require a transistor-level timing tool. It may be impossible to flatten all blocks for a chip-level timing evaluation; you must develop block-level timing models for blocks that you cannot flatten. You also have to generate a hierarchical netlist containing those hierarchically described blocks. You create timing shells for hard cores that a path-analysis tool uses for timing the next level of the design hierarchy.

You also have to generate timing models for all library elements. These models should be consistent for the design views used with each tool, although some inconsistencies may exist. You usually find macro input- and output-pin capacitance in LEFs and in macro views that the timing analyzer uses. Accurate interconnect-parasitic modeling is difficult but important in deep-submicron chip design because it represents a larger portion of the total delay.

As processing geometries decrease, the amount of delay from interconnect increases dramatically. At 1-µm geometries, a transistor's gate delay represents about 60 to 70% of total path delay. At 0.35 µm, this number shifts to about 30%. Previous methodologies of taking a lumped capacitance for a line's interconnect capacitance are no longer adequate. Accurate RC modeling of lines is essential, especially for clock lines. Special tools can accurately extract and reduce the RC data for a line. Commercially available design tools implement new algorithms that work on complex interconnect lines, including all crossovers and parallelism. The result is a delay number that has the accuracy of an extracted-layout result—accuracy that you need to achieve an acceptable value for interconnect delay.

Data reduction is imperative because the amount of data acquired when you design a 1 million-gate circuit is huge. Designers have successfully used data-reduction techniques for some time. The accurate extraction of RC parasitics is the key. Shrinking technologies have resulted in some interconnect widths exceeding their heights. This fact means that fringing (side) capacitance exceeds plate (vertical) capacitance, and an analysis tool must account for all parallel lines in its calculations.

Crosstalk, which was present in earlier technologies, is more apparent with the higher speed edge rates of new designs. High-speed design requires high edge rates. Routers are beginning to look at this problem. One approach is to do an analysis by feeding the crosstalk delay impact as additional delay to the timing-analysis program. When timing analysis puts the crosstalk penalty into the path, the analysis can then flag those paths where a real violation exists. In a synchronous circuit, crosstalk is a problem if data has not stabilized. The three most common fixes for crosstalk are increasing drive strength, rerouting the line, or putting a repeater on the line.

For a device to be manufacturable, it must meet the silicon manufacturer's design rules, and it must be testable. With large circuits, good coverage is necessary to discover any manufacturing defects. Such coverage employs a good test strategy from the beginning of the design.

Visibility into the circuit is essential for chip debugging if something goes wrong and for viewing block states during operation. If you design with a structured-testability approach, you can use an automatic approach to generate test patterns. In addition, you use test insertion to create testable scan chains. This approach assumes that you have ensured device functionality through verification before generating manufacturing test vectors. You can create large vector sets in this manner along with a coverage number telling you what the tools may have missed. With a little analysis, you can also determine why test-coverage tools are not testing some parts of the chip.

The use of cores or acquired blocks that don't conform to VSIA standards can complicate this approach. For these blocks, there is no assurance of a structured-test approach or one that is compatible with the rest of the design. Usually, you must apply functional vectors to blocks to prove that they are working correctly. This step is difficult because these chips often have deeply embedded blocks. Boundary scan allows for a slow insertion of vectors that can test the block, but at-speed testing requires a BIST-type test. You should also check for compliance with manufacturability rules, such as ESD and electromigration. Many designers assume that vendors design cell libraries with these rules in mind, but that is not always the case.

A clear understanding of the issues relating to the development of complex ASICs lets you design "correct-by-construction" circuits. In other words, you can recognize obstacles and overcome them early enough in the design so that they do not interfere with delivery of the final design.


Authors' biographies

Roy Shanks, service director for Cadence Design Systems (www.cadence.com), has 23 years of experience with Burroughs, Unisys, and Cadence. He currently manages an IC-design group and has developed processors, I/O devices, and caches. Shanks is a member of the IEEE and has a BSEE degree from Cleveland State University (Cleveland) and an MSEE from Oakland University (Rochester, MI).

Jonathan Levi, a Cadence (www.cadence.com) project manager, has 19 years of experience with Burroughs, Unisys, and Cadence. He manages ASIC-design projects and has worked on designs having as many as 3 million gates. Levi has a BS in physics from Kalamazoo College (Kalamazoo, MI).


| EDN Access | Feedback | Table of Contents |


Copyright © 1998 EDN Magazine, EDN Access. EDN is a registered trademark of Reed Properties Inc, used under license. EDN is published by Cahners Business Information, a unit of Reed Elsevier Inc.