Zibb

Steve LeibsonLeibson's Law: It takes 10 years for any disruptive technology to become pervasive in the design community. This blog is about the disruptive technologies that either have or will win over electronic engineers, some that won't, and why. Please feel free to link to these blog entries! Written by Steve Leibson, a marketing consultant specializing in lead generation and content creation for high-tech companies, former VP of Content for Reed Business, and former Editor in Chief of EDN. See my consulting Web site at www.sleibson.com and my history site at www.hp9825.com. You can email me at steven.leibson followed by the magic email symbol @ followed by att.net.

View Steve Leibson's profile on LinkedIn


   Advertisement

Profile

RSS Feed

  • Add this blog to your RSS newsreader!

Recent Posts

Recent Comments

Most Commented On

Archives

By Category

Blog

Saturday, April 19, 2008

Multicore Parallel Programming: Can We Please do it Right This Time? – IEEE Electronic Design Processes Workshop 2008

Apr 19 2008 12:34PM | Permalink |Comments (24) |


Tim Mattson, a principal engineer at Intel’s Applications Research Laboratory, describes himself as an old-school parallel programmer. He’s been slinging code for impressively parallel machines since 1985. That makes him an authority in my book. He sounds like one too. Mattson was the keynote speaker at last week’s IEEE DATC Electronic Design Processes Workshop held in Monterey, California. He had a lot to say about programming parallel machines (increasingly common in the multicore SOC age) and he said it with passion and authority—a potent combination. But first, his disclaimer (which I loved): Mattson was speaking his mind and his opinions do not necessarily reflect those of Intel or its lawyers.

Mattson first overviewed the driving reason behind the trend towards using more processors in SOC and system design. I’ve discussed these drivers many times in this blog, but it doesn’t hurt to briefly repeat. Moore’s Law continues unabated. We get more transistors onto a chip every year. Mattson laid out the timeline for future lithographic advances in IC fabrication:

    Year            Lithography

2007                45nm
2009                32nm
2011                22nm
2013                16nm
2015                11nm
2017                  8nm

In 10 years, we’ll be capable of putting 32 billion transistors on a chip. By then, we’d better not be designing chips and systems the same way we do today.

Denard Scaling died at 90nm so we’re not seeing the power reductions we formerly got with every process shrink. As a result, power considerations are curbing the formerly rapid climb in clock rates, which had served as our principal tool for boosting processor performance. Now we’re turning to multiple processors to replace GHz clock rates as the tool of choice for boosting performance. All of this is cold, hard reality. Anyone who argues that we will continue to see clock-rate improvement ramps like those that have come before is delusional. (Those are my words, not Mattson’s.)

Consequently, parallel programming is about to become a way of life in our business and Mattson wants us to be ready. “Write your software for 100 processor cores,” says Mattson, “and then run it on fewer cores. Your code will work.” Mattson points out that Intel has been in the parallel computing business since 1985, when it started building parallel supercomputers based on (gulp) 80286/80287 microprocessors and floating-point coprocessors. Since then, Intel has underwritten the creation of a variety of parallel architectures, which has gotten Intel membership in the Dead Architecture Society according to Mattson. That guy can sure turn a phrase.

During a quarter century of parallel computing development, Mattson has seen many parallel programming fads come and go. He’d like for us to avoid going down roads that we have already established as dead ends. For one thing, says Mattson, we are not going to find a way to automatically distribute an arbitrary chunk of software across multiple processors. At least not in the foreseeable future. This is the famous “parallel programming problem” and Mattson says that our only hope is to get programmers to write parallel software “by hand. After 25+ years of research, we are no closer to solving the parallel programming problem.” Yet only a tiny fraction of programmers write parallel code.

“History repeats itself while historians repeat each other” (Philip Guedalla 1889-1944)

Computer scientists are dismal at remembering the past, says Mattson. They can’t even spell the word literature. So Mattson raises the banner of computer history to try to remind people of the noble, failed experiments in massively parallel programming with the hope that we can avoid remaking these same mistakes. Here are his five key points:

  1. Parallel systems are useless without parallel software.
  2. “If you build it, they will come,” is not true here.
  3. We can’t generate parallel software automatically, even though people still try.
  4. Our only hope is to get programmers to write parallel software by hand.
  5. After 25 years of research, we’re no closer to solving the “parallel programming problem.”

Next, Mattson demolished another favorite idol of computer science: the parallel programming language. It’s a well-known trait of computer scientists that they will try to solve every problem with a new programming language. The parallel programming problem is no exception. Mattson displayed an eye chart listing the names of nearly 250 parallel programming languages developed just during the 1990s. “This is silly,” said Mattson, “If creating a new language was the solution, the problem would already be solved. This is not the path to a solution.”

Then Mattson cited the Draeger Grocery Store experiment to prove his point. In this experiment, two jelly and jam displays were set up in one of the San Francisco Bay area’s Draeger’s grocery stores. One display had six types of jellies and jams. The other displayed 24 types. Although consumers examined more product in the display with more choices, they bought more product from the display with fewer choices. The conclusion: fewer choices lead to more sales. For parallel programming, the lesson is that having 250 programming language choices leads to paralysis. Yet there is a parallel programming paradigm that works: MPI, a message-passing API that works with existing, non-parallel programming languages that programmers already know how to use with existing languages.

So here’s Mattson’s bottom line: Let’s do it right this time. “Let’s stop taking potshots at each other’s APIs and adopt a more disciplined, scientific approach. Science is a community process. If we want to make progress on programmability, we need to:

  • Develop a systematic, human-centered model of how programmers solve parallel programs
  • Define a human-language of programmability—so we can objectively discuss the pros and cons of different programming technologies.
  • Define metrics so we can track progress and make systematic comparisons between APIs.

To that end, Mattson points to his book, Patterns for Parallel Programming, as a starting point. He then invited the workshop to join in peer review and experiments to validate his theories.


Related entries in: Computers, boards, buses | SOC | Software Development Tools | 


Reader Comments



at 4/19/2008 5:35:29 PM, Mapou said:
Great article, one that is sure to strike a few excitable nerves. Tim Mattson said, “If creating a new language was the solution, the problem would already be solved. This is not the path to a solution.”

I agree with Mattson. Note also that the same can be said of multithreading, which has been around for as long as multi-processing if not longer. So, in my view, any solution to the parallel programming problem should be neither linguistic nor thread-based. I wonder if Mattson sees the folly of taking the multithreading route. I suspect that this is not yet a priority at Intel since Mattson is till talking about the need to program for 100 cores so as to construct applications that will run on fewer cores. In a correct parallel programming environment, in my opinion, the programmer should not have to think about cores in the first place. One programmer''s opinion, of course.

Steve, it''s always a pleasure to read your articles. Your style is clear, concise, intelligent and focused. I hope you continue to pound on the parallel programming issue as it is the greatest single problem facing the industry in a long time. We are at a crossroad. It is almost as if we''re back in the pioneering days of the late 70s and early 80s. It''s an exciting time to be in this business. Thanks.



at 4/20/2008 12:07:36 AM, Steve Leibson said:
Mapou, many thanks for the kind comments on my writing. I can be clear and insightful only to the extent that presenters like Mattson do the same. In these articles, I mostly serve as eyes, ears, and scribe for EDN's readers. I toss in a few of my own observations if appropriate. I hope to write a lot more on multicore SOC design, which includes software design of course.



at 4/21/2008 5:21:43 AM, Barry said:
Steve, I think Tim's ideas show that the parallel programming problem is mostly about education. Fundamentally the software engineers job is to efficiently divide a program into blocks that can run concurrently and to define the communication between those blocks. This has little to do with the API or language used and everything to do with the approach taken at design time. Its a case of what you do, not how you do it.



at 4/21/2008 8:00:07 AM, Vivek said:
If MPI is the preferred communication API for parallel processors in a cluster, it is not exactly true that such an API can exactly emulate such a system on a multi-core multi-processor (MCMP) system. However, I agree that with the multitude of the available parallel programming languages, nothing is simplified for an average scientist or researcher to exploit the available MCMP system. IMHO, the need of the hour is a simple and unified API that can target the cores for parallel computation and communication.
--Just another grad student



at 4/21/2008 2:09:47 PM, Robert Leif said:
You might look at Ada, which was designed to work with multiprocessors. I quick search of Google for Ada multiprocessors provided 142,000 hits.



at 4/21/2008 2:28:16 PM, rus said:
Excuse my limited understanding, but wasn''t/isn''t the MPI a short-term solution for the multi-core processors?... For ''true parallel processing'' there is need to start from ground up and build up... Staring from understanding, then language, then education, then development optimization, then development, (many years) but that''s just the microprocessor, what about the end user OS? Adds more years... Well, you get the point. (Oh, and add to that the fact that technology is not waiting around for this...) -- just an EE -- vivek frm ithaca,ny?



at 4/21/2008 2:28:18 PM, desert rat said:
Interesting here that you''re talking about parallelized code on homogeneous cores rather than talking about segmented code on heterogeneous cores. We did this in the mainframe biz years ago....data comm processors, I/O processors, math processors, matrix processors (Vector-mode processors), etc. Also, this thread seems to say that no one remembers Gene Amdahl''s law...



at 4/21/2008 3:21:32 PM, Policebox said:
I don't know what Mapou is getting at about Multi-threading. Since the only way to take advantage of multiple processors is to divide a task into smaller tasks and farm the smaller tasks out to the processors.
This can be done by assigning different stages of the task to different processors (pipelining) or by assigning different parts of the data to different processors (parallel processing). Either way, it seems to me that the natural way to manage the subtasks is by making them threads.



at 4/21/2008 4:04:06 PM, rogerdq said:
Interesting view of the problem - wasn't the Inmos Transputer and it's associated language/OS Occam supposed to address at least some of these problems?

Robert Leif's post on Ada rang a few bells, too.

As I recall, the Forth community was addressing this problem as well.



at 4/21/2008 8:11:07 PM, Patrick H. Madden said:
Thanks for writing on EDP; we had some great talks this year, and Tim Mattson's was a stand-out.

@Desert Rat -- Mattson is certainly aware (and worried about) Amdahl's Law, but there are a lot of other folks who sort of gloss over it. There's the "official" position from the marketers at Intel, but that's not exactly what most of the research people believe.

Tim will be giving a talk at DAC this year. I don't know if he can be as blunt as he was at EDP, but I don't expect him to pull may punches.

A subset of Tim's slides are up on the EDP site; other talks should be appearing there shortly as well.



at 4/22/2008 12:12:25 AM, Sumit said:
Hi Steve

Have you looked at NVIDIA's many core GPU architecture and the CUDA C programming environment? A recent Microprocessor report article titled "Parallel Processing with CUDA" (available on the NVIDIA website -- google it) gives an overview of both CUDA and NVIDIA's GPU architecture.

GPUs have come a long way and have become fully programmable architectures and are now being applied to general purpose computing. NVIDIA launched the Tesla products based on GPUs to target the high-performance computing market.

It is interesting how Intel talks about many core being the architecture of the future, when a NVIDIA GPU today already has 128 processor cores on it.

CUDA probably has one of the biggest active developer communities for any many core architecture right now. Since CUDA is mostly plain C with some extensions to map things to a GPU, it is easy to imagine that it could be used as the programming environment of choice for multi-core SOC architectures as well.

Regards
Sumit
(disclaimer: I am a NVIDIA marketing person, but my opinions expressed here are to highlight a technology that is relevant to the topic under discussion that has been largely overlooked by the multi-core SOC community).



at 4/22/2008 1:26:12 AM, Peter said:
I think multi-threading is the solution added to multiple VM's. We did it on ACP in the 70's. The real problem was debugging the stuff - my eyesight was ruined looking through endless core dumps....it was exponentially complex the more real time threads running. Still we soon learnt to desk check the code better first....



at 4/22/2008 8:53:01 AM, bk said:
I find that the serialization & waiting for semaphores is what causes the most issues for keeping performance up with multiple threads. I will often get better elapsed times if I use a macro that uses a busy wait loop for 10,000+ iterations and only does an actual wait for the semaphore when the loop count expires. This shows that the cost to wake up a thread is way too expensive. We need to improve the wake-up overhead for threads so the serialization costs can come down.



at 4/23/2008 3:59:46 AM, Jüri Põldre said:
One possible source for these "Parallel educated engineers" comes from programmable hadrware. FPGA-s make it possible to dump parts of algorithm int HW and are becoming price/perf wise viable solutions. Also Students today are accustomed to think about programming in wider range - after all HW is also a program (VHDL/Verilog) not schematic any more.
The parallel programming enters from HW side where things happen in parallel inherently. This thinking is carried up to sw through HW/SW codesign.



at 4/23/2008 8:01:01 AM, Steve Leibson said:
Jüri Põldre, I cannot agree that putting an algorithm into a hardware block versus a block based on a single-tasking processor makes any difference whatsoever to a system's design. The hardware block is not multithreaded. It serves to divide the problem into smaller, more comprehensible parts with each part in its own little block. A block based on a single-tasked processor core does the same. Since you bring up FPGAs, I'll point out that FPGA fabrics are 10-20x less efficient at implementing hardware compared to an ASIC. Thus you'll be miles (kilometers) ahead using a firmware-programmable processor in an ASIC to implement a block. It's not the processor that's a problem, it's saddling many tasks on that processor that's a problem.



at 4/26/2008 10:48:58 AM, Anonymous said:
I'm not trying to brag, but I research transactional memory (its a small, but rapidly growing research area), which is essentially research-in-progress towards a simplified parallel programming model. The main thrust is to remove from the programmer the responsibility of managing locks, which is the hardest part of robustly building a realistic application. CUDA is also a good approach. MPI doesn't really add much to improve multi-core programming.

You may want to do a google search on transactional memory, I certainly think its a very good idea. And so do Sun (building processors with support for transactional memory), Intel (have released production-quality compilers for it), and Microsoft (are now hiring programmers to build a product based on transactional memory).



at 4/26/2008 4:14:07 PM, rhb said:
Once again we''re being told by the hardware designers to write the application to suit the hardware.

How about trying something different, such as looking at what is done in existing software and designing hardware to run it quickly?

Very often the nature of the problem doesn''t allow doing the things the hardware designers want us to do. And if it is possible, it''s a major expense to do so.



at 4/26/2008 9:58:22 PM, Grant Martin said:
rhb makes an interesting point,but I think the path of trying to design hardware to run existing software more quickly was the path of much mainstream processor design for many years until nonlinear increases in power and energy consumption made that a very difficult path to continue. Techniques such as superscalar processing, VLIW, speculative execution and the like all attempted to recover concurrency from software that was expressed in sequential form - and although these provided some gains in performance, the main performance gain came from cranking up the frequency until power made that difficult to continue. Hence the move to multiprocessor and multicore, arguably at a point when we were not well prepared for it. The last few years of multicore indicate that there are neither any free lunches nor silver bullets, and making hardware that will speed up existing software further seems a path that no-one really knows how to follow. Inevitably, work in either modifying software to exploit new hardware (if it can be done) or returning to first principles - the algorithms that we are trying to execute - and looking afresh at natural concurrency in these algorithms and hardware that can exploit that concurrency better - seems the only viable known paths. Of course, new research ideas might emerge that will allow the existing legacy software base to be sped up - but it doesn''t make much sense to wait and hope for a research breakthrough when there are useful viable methods to explore today, albeit costing effort and expense. (Disclosure: I am a colleague of Steve Leibson''s at Tensilica, working in engineering).



at 4/27/2008 12:01:48 PM, Steve Leibson said:
RHB, I believe hardware designers have spent the last 35 years of microprocessor design (and 25 years of CPU design before that) developing hardware ways to run code faster. The latest tricks included superscalar execution and speculative trace execution. Superscalar machines try to eke out instruction-level parallelism and the magic number seems to be around three simultaneous instructions. Speculative trace execution simply executes every possible code path speculatively in parallel and throws away the results from the paths not taken. It makes little sense to me to say that the hardware should work hard to make code run faster unless there's some sort of effort made in the software and system designs to streamline the effort. To use an analogy, it makes little sense to say that designers of internal combustion engines should make their designs more energy efficient without also reducing vehicle weight and without paying attention to aerodynamics. Hardware and software are the Yin and Yang of modern system design. They work in harmony, not as independent parts.



at 4/28/2008 11:53:03 AM, MattR said:
Tim, I''ve also been writing code for big iron since the 80''s and I agree with your bottom line (there needs to be some solid analysis of how parallel codes are written) but a few notes about existing tools: MPI is not the issue and it''s not the solution, it''s one of dozens of issues and that''s the problem. The complexity of the hardware requires that the code, in order to get performance, will be difficult to maintain or port. I can improve my code performance 10 fold by working on single processor performance. But the resulting code is a rat''s nest. So, along with patterns and education there is still room for tools to aid the process. It might be domain specific languages or something for working on legacy code. I don''t know, but it''s going to be more than a text book.



at 4/29/2008 12:19:17 AM, Steve Leibson said:
Very well said MattR.



at 8/6/2008 9:30:57 AM, many core said:
How many cores to we have to be at before parallel code writing becomes mainstream? Hopefully it will catch on before we reach
100 or so cores.



at 8/6/2008 9:46:20 AM, Steve Leibson said:
many core: before parallel programming techniques for, er, many cores catches on, the industry must come to accept that many inexpensive processors can get more work done for less energy than one big honking supercomputing processor. You would think that good old, logical engineers would adopt this design approach after looking at the cold hard facts. In reality, I think this will be difficult in our societies, which continue to drool over and worship supercars with 1000-hp V12 engines that can go 253 miles/hour...for 12 minutes before draining the gas tank dry. The same visceral adoration of things fast drives people to ignore the practicality.



at 11/19/2008 7:56:55 AM, Jimmy two times. said:
Who needs more compute more nowadays? Our PCs have more than most of us need already!
Maybe all this stuff has value in Grid-Computing servic centres - to aggregate a lot of compute power to serve a bunch of users? - We've got desktop apps, music, video all done now. And Hi-Def isn't worth the $$$ for the masses. Any opportunities must lie in the lower-end embedded space - new ideas and autmation,etc....

Post a comment



Display Name

Change Image
Before submitting this form, please type the characters displayed above.
Note the letters are NOT case sensitive.


ADVERTISEMENT

©1997-2009 Reed Business Information, a division of Reed Elsevier Inc. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other Reed Business sites