Subscribe to EDN

Multicore

March 11, 2009

Earlier I discussed power in integrated circuits. As most people know, power is the main reason that PC processors have had to move away from single core chips with increasingly fast clock rates and towards multi-core chips. Embedded chips are starting to go in the same direction too; modern cell-phones often contain three or more processors even without counting any special purpose ones used for a dedicated purpose like mp3 decode. The ARM Cortex is multicore.

Of course this moves the problem from the IC companies, how to design increasingly fast processors, to the software side, how to write code for multi-core chips. The IC companies have completely underestimated the difficulty of this.

The IC side of the house has assumed that this is a problem that just requires some effort for the software people to write the appropriate compilers or libraries. But in fact this has been a research problem for over forty years: how do you build a really big powerful computer out of lots of small cheap ones? It is unlikely to be solved immediately, although clearly a lot more research is going on in this area now.

There are some problems, traditionally known as “embarrassingly parallel,” which are fairly easy to handle. Far from being embarrassing, the parallelism is so simple that it is easy to make use of large numbers of processors at least in principle. Problems like ray-tracing, where each pixel is calculated independently, are the archetypal example. In fact nVidia and ATI graphics processors are essentially multi-core processors for calculating how a scene should be rendered (although they don’t use ray-tracing, they use cheaper polygon-based algorithms). In the EDA world, design rule checking or RET decoration are algorithms where it is (fairly) easy to parallelize them: divide the chip up into lots of areas, run the algorithm on each one in parallel and take a lot of care on stitching the bits back together again at the end.

But most problems are more like Verilog simulation. It is hard to get away from having a global timebase, and then the processors have to run in lock-step and the communication overhead is a killer. Yes, in limited cases processors can run ahead somewhat without violating causality (such as simulating fast processors connected by slow Ethernet) and so reduce the amount of required synchronization but the typical chip is not like that.

Years ago Gene Amdahl noted that the amount of speedup that you can get by building a parallel computer of some sort is limited not by what can be made parallel but what cannot. If, say, 10% of the code cannot be parallelized, then even if we take the limiting case that the parallel code finishes instantaneously, the maximum speedup is just 10 times. This has come to be known as Amdahl’s law. So that is the first limitation on how to use multi-core. To use hundreds of cores effectively then the amount of code that cannot be completely parallelized must be tiny.

The next problem is that it is not possible to divide up the problem at compile time and capture that decision in the binary. If you have a loop that you are going to go around 1000 times to calculate something for 1000 elements, then one way is to unroll the loop, spawn the calculation simultaneously on 1000 threads on 1000 processors and accumulate the results. If the amount of calculation is very large compared to the overhead of spawning and accumulating, this might be good. But if you only have two processors, then the first two threads will go ahead and the next 998 will block waiting for a processor to become available. All the overhead of spawning and accumulation and blocking is just that, overhead that slows down the overall computation. How to map computation to processors must be done partially at run-time when the resources available are known.

The other big problem is that most code already exists in libraries and in legacy applications. Even if a new programming paradigm is invented, it will take a long time to be universally used. Adding a little multi-threading is a lot simpler than completely re-writing Design Compiler in a new unfamiliar language, which is probably at least a hundred man-years of effort even given that the test suites already exist.

There are some hardware issues too. Even if it is possible to use hundreds of cores, the memory architecture needs to support enough bandwidth of the right type. Otherwise most of the cores will simply be waiting for relatively slow access to the main memory of the server, as shown in more detail in this IEEE Spectrum article. Of course it is possible to give each processor local memory, but if that is going to be effective those local memories cannot be kept coherent. And programming parallel algorithms in that kind of environment is known to be something only gods should attempt.

I’ve completely ignored the fact that it is known to be a hard problem to write parallel code correctly, and even harder when there really are multiple processors involved not just the pseudo-parallelism of multiple threads or processes. As it happens, despite spending my career in EDA, I’ve got a PhD in operating system design so I speak from experience here. Threads and locks, monitors, message passing, wait and signal, all that stuff we use in operating systems is not the answer.

Even if the programming problem is solved with clever programming languages, better education and improved parallel algorithms, the fundamental problems remain. Amdahl’s law limiting speedup, the bottleneck moving from the processor to the memory subsystem, and the need to dynamically handle the parallelism without introducing significant overhead. They are all hard problems to overcome. Meanwhile, although the numbers are small now, the number of cores per die is increasing exponentially; it just hasn’t got steep yet.

Our brains manage to be highly parallel though, and without our heads melting, so there is some sort of existence proof of what is possible. But, on the downside, we are really slow at calculating most things and also pretty error-proon.

Posted by Paul McLellan on March 11, 2009 | Comments (11)

March 28, 2009
In response to: Multicore
John McGehee commented:

Ah, Multicore solder... been using it since high school.


March 18, 2009
In response to: Multicore
Meredith Poor commented:

Three examples of CPU intensive activity come to mind where multicore CPUs can actually help: ~~~ First is the plain old sort problem, which is typically invoked with a single statement or command line. If there are four cores, each can be assigned 1/4th of the problem: sort one quarter of the records, then merge the sorted keys back through the 'host' CPU. ~~~ Second is the compression problem, such as JPEG. Depending on the nature of the problem, it may be constructive to use a multi-core efficient compression algorithm (satellites beaming back image data would be a good example). Some compression schemese sort various assemblages of data, so the sort issue may emerge along with compression. Rather than trying to divide a 'straight line' problem, use a more appropriate output. ~~~ Third might be described as the 'emergency stop' monitoring scenario. This occurs if, for example, a six axis robot is spray painting a car body, and some fault develops in a motor or something falls into the workcell and blocks the arm. Each of the motors is generating feedback that has to be monitored. Having one CPU do this may require the developer to be conscious of the sampling order. Having an 8 core system dedicate one core each to one motor means that the individual processors concern themselves with their voltage/current/temperature/rotation envelope, with some input from the immediate neighbor(s) (i.e., the previous and next degree of freedom). If their degree develops a fault, they raise an interrupt to the 'base processor'. This processor then executes a plan for retraction or controlled power-down, such that the robot is not either dripping paint on a ventilator, or left in a position where it might injure someone coming in to work on it.


March 18, 2009
In response to: Multicore
SQLGuy commented:

I''m not sure the market is big enough for real parallel operations. I work with databases (Oracle, SQL Server, DB2...) and for the most parts, these "enterprise" applications do a pretty good job with parallel ops. Over the years it''s been optimized significantly and actually works very efficiently today. The bigger challenge for the average user is to run multiple applications that are not related or loosely related without having one walk all over the other. For instance, I want to be able to work on a script file while another is compiling in the background while I''m copying 5 different data files (1GB to 300GB in size) across the wire. Of course, I want to do this without the stuttering, delays and intermittent freezes as the OS does its housekeeping to juggle the workloads. This seemingly simple multitasking still causes me varying levels of grief regardless of platform (Solaris, UX, Windows, Redhat). Works great for "isolated" workloads like VMware (as another reader indicated) but that because we explicitly allocate specific resources to specific VMs. When leaving things to the OS, multiple apps running in parallel could still use some major refinement.


March 16, 2009
In response to: Multicore
Wiltshire lad commented:

Yes Finian transputer would be nice. With a spot of OCCAM as well? It would be interesting to see what a ''modern'' rendition would look like, wouldn''t it?


March 14, 2009
In response to: Multicore
CharlieM commented:

The "parallel problem" can be traced to the exclusive use of linear-sequential (LS) technology. All known processors are LS Turing machines (TMs) with shared resources. The flexibility granted by software comes at the expense of linear-sequential operation. Any ?massively parallel? machine is composed of many LS machines. Everyone in this discipline is aware of these facts. The source of the limitations of current hardware and software is the exclusive use of Boolean algebra. The system of Boolean logic is static and can?t accommodate change. All relationships must therefore be viewed and evaluated in fixed frames. The frames can only be judged one at a time, or in pairs being compared. The logic used thus limits the mode of operation to a linear sequence of frames, or program. The common denominators of all computer languages and hardware constructions are the Boolean operations and memory functions. True parallel-concurrent operation can?t be based upon linear-sequential activity, but must be parallel-concurrent in a fundamental way. A system of dynamic logic, expressed in a parallel-concurrent algebra can do the trick. The results are control systems that are fast, self-monitoring, flexible, asynchronous, parallel-concurrent, and easy to understand and create. cmoel888@aol.com


March 13, 2009
In response to: Multicore
Finian commented:

The Transputer anyone ?


March 12, 2009
In response to: Multicore
M. Simon commented:

The data flow architecture of the SeaFORTH chips solves the waiting problem AUTOMATICALLY. Cores shut down when waiting for data and AUTOMATICALLY wake up when the data arrives. And there is no long latency either. Wake up takes about a nanosecond. A fully awake core that runs at 700 MIPS takes less than 10 mW. An asleep core takes less than 10 uW.


March 12, 2009
In response to: Multicore
M. Simon commented:

You need to stop eating those error proons. They will make you sick.


March 12, 2009
In response to: Multicore
Sunguy commented:

Sun Micro has done a great job with Solaris and apps such as oracle using its multicore-multithread systems. I like the way they get around the memory latency issues by allowing registers in the proc to store virtual address info while waiting on memory and then at the same time ( clock cycle ) spawn another thread for the proc core to work on.


March 11, 2009
In response to: Multicore
MP commented:

Another aspect of parallelism is debugging (for C/C++). Reproducing issues is a nightmare and drain on development resources. It is not easy to break every single algorithm into autonomous tasks that do not require communication.


March 11, 2009
In response to: Multicore
Meredith Poor commented:

Most of the time people look at parallelism as something one does within one computer. If you look at your 'farm' of servers, workstations, printers, etc. obviously they all run in parallel, however most of them are 'at rest' waiting for someone to give them something to do. In this respect overall most computer power is hopelessly wasted. Rather than trying to solve 'one problem' with multiple computing elements, it usually makes sense to simply give each core a relatively autonomous task (AKA virtual machine) where the computing elements are barely aware of each other's existence. Embedded control problems often have huge intervals (computationally speaking) between external events that trigger input and response cycles. The value of multi-cores in this situation is particularly vacuous.

POST A COMMENT
Display Name
captcha

Before submitting this form, please type the characters displayed above. Note the letters are case sensitive:

Advertisement
Advertisement
Advertisement
About EDN   |   Site Map   |   Contact Us   |   Subscription   |   RSS
© 2012 UBM Electronics. All rights reserved.
Use of this Web site is subject to its Terms of Use | Privacy Policy

Please visit these other UBM Canon sites

UBM Canon | Design News | Test & Measurement World | Packaging Digest | EDN | Qmed | Pharmalive | Appliance Magazine | Plastics Today | Powder Bulk Solids | Canon Trade Shows