Sunday, April 12, 2009
Magma, Multicore, and Exploiting Parallelism in EDA Tools
Patrick Groeneveld, Chief Technologist at Magma Design Automation, gave a terrific keynote speech with great technical eloquence about exploiting parallelism for EDA tools at last week’s Electronic Design Processes Workshop in Monterey. Everyone’s interested in trying to use multicore processors to speed up the process of designing chips and Patrick’s looked into the problem. As soon as you look in this direction, Amdahl’s Law rears its head. Algorithms are overwhelmingly written as single-threaded task chains and Amdahl’s Law tells you that you the parts of the algorithm you can’t parallelize will limit the amount of speedup you can get.
Patrick converted Amdahl’s Law into a resistor network for his talk. Every task is a series resistance. When you parallelize a task, you add several resistances in parallel with the resistor that represents the task your parallelizing, which drops the effective resistance of that resistor but leaves the rest of the resistors alone, like this:

Then you can look at the cold, hard, unhappy numbers—and Patrick has. Even if you can parallelize 95% of the sequential tasks and make them infinitely fast, the best you can get is a 20x speedup. That would be great for a lot of IC-design tasks, but there is overhead associated with parallelization—primarily setting up the parallel subtasks and then merging the results. For example, consider parallelizing some image-processing tasks for a photo. First, you slice up the photo into regions. That takes time. Then you hand out each region to a processor and let it process that region. Then you reassemble the results, knit the photo back together, and fix up any problems with the boundaries between the split-up regions. All of these set-up and tear-down tasks take time that wasn’t in the original single-threaded code. EDA design is just like this, but even harder due to the complexity.
As a result, says Patrick, the real effective speedup appears to be about 2x to 3x and it doesn’t take many processors before you’ve gotten all the benefit you’re going to see from parallelization. In the real world of his customers, says Patrick, people prefer to wait two days to get a good chip design rather than get a poor one in a couple of hours.
I asked Patrick a biased question. I know that it’s possible to write C code that smoothly passes through a compiler’s optimizer. (I co-wrote an article with Tensilica's Senior Director of Software Engineering Dror Maydan about this a few years ago: Optimizing C programs for embedded SoC applications.) It’s also possible to write poorly organized source code that the compiler cannot optimize. So I asked Patrick if the same wasn’t true about chip design. Would it not be possible to design chips so that the physical layout and design of the chip was more easily parallelizable?
Yes, said Patrick, it is. Creating a well-organized chip with well-defined block boundaries helps a lot. Making sure that every block’s inputs and outputs are registered also helps a lot. However, said Patrick, EDA vendors must develop tools that accept any sort of design—good, bad, or indifferent—because EDA customers do not want the EDA vendors telling them how to design chips so that they can pass through the tools quicker, more easily, and with better results. Aye, there’s the rub. Neither do C programmers, as it turns out.
Here are Patrick’s secrets of parallel programming:
- Writing parallel code is very hard and writing parallel code is not compatible with using cheap, less-experienced programmers.
- There is significant overhead needed to partition problems into parallel tasks and there’s also post-processing overhead required to re-integrate the results.
- There’s a narrow sweet spot for parallelism for EDA analysis tools but synthesis tools appear more resistant to parallelization because synthesis tools deal with too many problem and data interdependencies.
- Amdahl’s Law holds and 4x appears to be the absolute maximum possible speedup available though parallelization.
- Parallelism costs in terms of quality of result.
- Parallel EDA tool startups have, so far, been spectacular failures (Monterey, Athena, Liga).
All is not bleak however. Here are Patrick’s solid secrets for faster chip design without parallelization (by reducing the number of times through the synthesis/analysis loop):
- Avoid stupid mistakes (typos, using the wrong data sets, setting command-line switches incorrectly). Check all of the setup conditions and commands before you set a tool in motion.
- Rely more on correct-by-construction techniques.
- Back off on constraints and give the tools more leeway. Things will run a lot faster even without the application of brute-force parallelism.
- Register all block inputs and outputs so that critical paths don’t run through several blocks. Buses may be an exception to this rule.
- Don’t believe that standard-cell optimization necessarily produces optimal chip-level results. Reducing the size of your standard cells by 5% doesn’t necessarily lead to a 5% reduction in overall chip size. For example, if the standard-cell size optimization forces interconnect vias off grid or into hard-to-reach positions, you may lose all the silicon that you saved and more when your EDA tool tries to stitch those reduced-size standard cells together.
© Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
