How to multithread a timing analyzer
As server and workstation CPUs have embraced multicore configurations, EDA vendors have set out to adapt their tools to the new computing environment. That means parallelizing the applications so that there are separate tasks to run on separate processors. Designers are using quite a range of techniques, from the obvious to real rocket science, all of which we can arrange based on any number of criteria, including either the level of interaction between tasks or the granularity of the tasks.
The simplest level of parallelization is done by users themselves, and has been around for years. There is no coupling between tasks. You just launch different scenarios on different servers. All the scenarios run in parallel, each at single-processor speed. But if you don’t have a bunch of scenarios—different process corners or operating modes, say—to run, this approach isn’t much help.
A more challenging approach for tool designers is to exploit data parallelism to partition one run of the tool across multiple machines. For instance, you can turn one run into many parallel runs by simply dividing the design up into chunks and running each chunk on a separate machine. This is really straightforward for some tools, such as design-rule checkers, where there is little or no interaction between adjacent design objects. You just break your design up into pieces, leaving a little overlap between your chunks, run each piece of the design on a separate server, and resolve any ambiguities that come up in the overlap areas after the run.
The principle still works for tasks that are less independent, but it gets harder. Some time ago, in Release 2008.12, Synopsys offered a version of their PrimeTime static timing analyzer that allows you to partition the design across multiple communicating servers. This is a bit more difficult than DRC, since the tool is going to have to deal with paths that cross the boundaries of the partitions. Synopsys vice president of engineering for PrimeTime Ken Rousseau explains that for timing analysis, the partitioning and reintegrating process is somewhat complex, and tends to get more difficult as the chunks of the design—the granularity, if you will—get smaller. It becomes difficult to avoid having large amounts of information flow between separate tasks, using up inter-server bandwidth and potentially stalling some tasks while they wait for others. Yet this level of parallelization is worth the trouble: Synopsys claims about a doubling of throughput by spreading one run across multiple servers.
Today, with the 2009.12 release, the company is moving to a significantly finer level of granularity in its partitioning. The new PrimeTime release not only allows you to distribute a design across multiple servers, but it multithreads the run on an individual server as well. The multithreading, according to the company, achieves an additional factor of two in performance.
"The real challenge in multithreading at this level is carving up one timing graph, and then dealing with what goes on at the boundaries," Rousseau says. With the finer-grained partitioning there is likely to be more inter-task communication, making the throughput more sensitive to inter-task latencies. So Synopsys uses this technique only for multiple cores on a single machine.
The two techniques—distributing and multithreading—are largely complementary, Synopsys says. So applying both at once to distribute a large design across a number of multicore servers can produce an aggregate five-times increase in throughput. This figure—like the overall performance of PrimeTime—may be dependent on both the structure of the design and on design styles. But based on evaluation with about a hundred benchmark designs, the company is happy with the five-times figure as an estimate. Given that many design shops today are delaying or cancelling orders for computing hardware, that figure is enough to get one’s attention.
bethM commented:
rharding64 commented:
JGeada commented:
EdaGuy commented:















