Subscribe to EDN

How to multithread a timing analyzer

January 11, 2010

As server and workstation CPUs have embraced multicore configurations, EDA vendors have set out to adapt their tools to the new computing environment. That means parallelizing the applications so that there are separate tasks to run on separate processors. Designers are using quite a range of techniques, from the obvious to real rocket science, all of which we can arrange based on any number of criteria, including either the level of interaction between tasks or the granularity of the tasks.

The simplest level of parallelization is done by users themselves, and has been around for years. There is no coupling between tasks. You just launch different scenarios on different servers. All the scenarios run in parallel, each at single-processor speed. But if you don’t have a bunch of scenarios—different process corners or operating modes, say—to run, this approach isn’t much help.

A more challenging approach for tool designers is to exploit data parallelism to partition one run of the tool across multiple machines. For instance, you can turn one run into many parallel runs by simply dividing the design up into chunks and running each chunk on a separate machine. This is really straightforward for some tools, such as design-rule checkers, where there is little or no interaction between adjacent design objects. You just break your design up into pieces, leaving a little overlap between your chunks, run each piece of the design on a separate server, and resolve any ambiguities that come up in the overlap areas after the run.

The principle still works for tasks that are less independent, but it gets harder. Some time ago, in Release 2008.12, Synopsys offered a version of their PrimeTime static timing analyzer that allows you to partition the design across multiple communicating servers. This is a bit more difficult than DRC, since the tool is going to have to deal with paths that cross the boundaries of the partitions. Synopsys vice president of engineering for PrimeTime Ken Rousseau explains that for timing analysis, the partitioning and reintegrating process is somewhat complex, and tends to get more difficult as the chunks of the design—the granularity, if you will—get smaller. It becomes difficult to avoid having large amounts of information flow between separate tasks, using up inter-server bandwidth and potentially stalling some tasks while they wait for others. Yet this level of parallelization is worth the trouble: Synopsys claims about a doubling of throughput by spreading one run across multiple servers.

Today, with the 2009.12 release, the company is moving to a significantly finer level of granularity in its partitioning. The new PrimeTime release not only allows you to distribute a design across multiple servers, but it multithreads the run on an individual server as well. The multithreading, according to the company, achieves an additional factor of two in performance.

"The real challenge in multithreading at this level is carving up one timing graph, and then dealing with what goes on at the boundaries," Rousseau says. With the finer-grained partitioning there is likely to be more inter-task communication, making the throughput more sensitive to inter-task latencies. So Synopsys uses this technique only for multiple cores on a single machine.

The two techniques—distributing and multithreading—are largely complementary, Synopsys says. So applying both at once to distribute a large design across a number of multicore servers can produce an aggregate five-times increase in throughput. This figure—like the overall performance of PrimeTime—may be dependent on both the structure of the design and on design styles. But based on evaluation with about a hundred benchmark designs, the company is happy with the five-times figure as an estimate. Given that many design shops today are delaying or cancelling orders for computing hardware, that figure is enough to get one’s attention.

Posted by Ron Wilson on January 11, 2010 | Comments (4)

January 20, 2010
In response to: How to multithread a timing analyzer
bethM commented:

Multithreading timing functions is very difficult, and hard to implement on older software architectures. So, how far can PT go without a total re-write? Also, they cite the performance improvement as five times increase, but what is that compared to (i.e. 5x on 8 cores? 16?) Stand-alone CLK tools have newer SW architectures and are multi-threaded. Mentor's place&route tool is also a newer program and has scalable multi-core timing analysis. Their customers reported 7x speedup in timing on 8 cores. Quite valuable to have that in the implementation loop.


January 13, 2010
In response to: How to multithread a timing analyzer
rharding64 commented:

with the increased ease of programming FPGAs and CPLDs for end applications, true parallel programming can be realized cheaply by using such devices. i am currently learning FPGAs on my own and will, in the future use them in the heart of my automated test fixture designs. With the FPGA as the workhorse and the desktop MS .NET GUI control application on the PC, this forms the ultimate reconfigurable test platform. After everthing is considered this does implement what Microsoft likes to call the 'distributed computing model'. my two cents


January 11, 2010
In response to: How to multithread a timing analyzer
JGeada commented:

Both CLK and Extreme have had multithreaded timing analyzers for years now. CLK even gave talks at TAU a couple of years back describing how this could be done with close to linear performance scaling all the way to 16 cores.


January 11, 2010
In response to: How to multithread a timing analyzer
EdaGuy commented:

Isn't it funny how an EDA vendor's story changes with their tool releases to fit the features in their current release? Everyone has had multicore machines for years and they are just now getting around to deciding it is important and having working code? Boy, who could have seen this trend happening... ya, I know it was difficult to predict. BTW, extraction, place and route and DRC all have "halo effects" when chopping up data as well, that's one part of dealing with multithreading/distributing.

POST A COMMENT
Display Name
captcha

Before submitting this form, please type the characters displayed above. Note the letters are case sensitive:

Advertisement
Advertisement
Advertisement
About EDN   |   Site Map   |   Contact Us   |   Subscription   |   RSS
© 2012 UBM Electronics. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other UBM Canon sites

UBM Canon | Design News | Test & Measurement World | Packaging Digest | EDN | Qmed | Pharmalive | Appliance Magazine | Plastics Today | Powder Bulk Solids | Canon Trade Shows