Multicore programming: easy or difficult?
By Barry O?Rourke, CriticalBlue - March 25, 2009
The advent of homogeneous, shared memory multicore platforms is seen as both a threat and an opportunity for the software industry. Many commentators are concerned that efficiently and correctly porting existing code onto platforms with four or more cores is beyond the capabilities of many engineers. Others simply state that this is a solved problem and that mature SMP operating systems and threading libraries already exist and are well understood. So who is right? The answer, of course, is "it depends."
The promise of multicore in embedded systems is power reduction; however, this can only be achieved if existing sequential applications can be efficiently refactored to take advantage of parallel platforms. Most multicore platforms currently in the embedded market (ARM, MIPS) are 2 to 4 core versions of application class processors, which are supplied with full SMP OS support such as Linux. This allows engineers to immediately take advantage of the addi-tional cores simply by running more than one program at a time with the operating system handling the details. Com-bine this with clock scaling or power gating on any unused cores and power consumption drops dramatically.
To allow the clock frequency to be dropped in more computationally intensive applications some refactoring of the actual program code is required. Again the OS makes life easier by supplying mature multithread support and APIs, particularly POSIX Threads (PThreads), allowing the engineer to quickly parallelize sequential code in addition to accessing a wide range of existing multi-threaded applications available as open source. Once the program has been threaded, the OS will take care of allocating the threads at runtime between the available cores without further input from the developer. The ease of this refactoring process is extremely dependent upon the application in question
Getting it running
A number of different applications fall into the "easy" category. An example would be communications and net-working where a single program needs to deal with a large number of different, and largely independent, channels of interaction. Listing 1 is a simple top level for a server application that illustrates this. This listing is an implementation of a server, but it demonstrates how multi-threaded software can be relatively easy. A new thread is created for each incoming connection, and that thread exists for the duration of the connection. There is no interaction between threads; each one is a world unto itself. The operating system deals with assigning these threads to a processor and scheduling their execution.
So far, so good. Our investigation suggests that multicore programming requires little more from an application than using existing multi threading API's along with some careful design of parallel partitioning to ensure that shared data is protected. So what is all the fuss about?
Issues begin to arise when tackling more performance driven embedded applications such as signal and media proc-essing. In this case there is a strong drive to reduce the power consumption, and therefore clock frequency, of the indi-vidual cores to pay back the area spent on going multicore in the first place. The programmer must deliver.
Converting what is often highly optimized code into finer grain threads can be achieved fairly straightforwardly by taking a data parallel approach. Listing 2 shows code that processes a set of complex data types. Careful reading of this code satisfies us that each iteration of the loop is independent of the next since each call to process(...) only accesses a single element of the data array with all other data local to the function call.
When targeting a four core system, it would seem sensible to allow four iterations of the loop to execute in parallel with the expectation of achieving a better than 3x acceleration. The theoretical maximum of 4x is impossible to at-tain due to the overhead of thread management, yet another reason why it is vital to make the most of the available processors in the multicore system since it is actually necessary for the system to perform more processing to complete the same amount of work.
Listing 3 shows what a parallel implementation might look using PThreads. This is unarguably more convoluted but still understandable none the less. The main loop now steps through the data in strides of PROCESSOR_COUNT (in this case four). In each iteration, we create a set of threads and then wait for them to complete with pthread_join be-fore continuing with the next iteration. However when this code is run on the multicore target, the performance is substantially less than expected. In fact, it is barely better than the sequential implementation. Obviously something is not behaving as expected but what?
This is the sort of situation that gives the multicore programming its reputation for being difficult and a programmer can easily spend substantial time searching through the code for the cause of the problem. Is the problem from ineffi-ciently using the threading library; using a poor approach to parallelize the loop; or even misunderstanding the sequen-tial program's behavior? This is the point when multicore software analysis tools become a useful resource ().
The ability to capture and graphically view the schedule of execution of the functions and threads in the system quickly highlights three distinct issues. First, a lot of the runtime is spent in calls to the underlying thread manage-ment library. Next, the actual runtime of the user functions can vary dramatically between iterations of the loop. Clearly the program does not treat all of the elements in the data set equally and has some severe load balancing is-sues. Finally, the load balancing problem is exacerbated because there is only four threads, so once a core has finished executing its assigned thread there is nothing else for it to do.
The first issue of spending a lot of runtime in calls to the underlying thread management library is a common mis-take for beginners to make but can easily catch out more experienced PThreads programmers when unexpected pro-gram behavior results in more frequent thread creation than is expected. The standard solution is to create a fixed number of worker threads known as a thread pool. The exact implementation varies depending upon the application but generally the threads continue to exist and perform their tasks either cooperatively or under the control of a boss thread until some end condition is met, at which point the threads either exit or are shutdown by the boss thread al-lowing the main program to continue.
The second issue of dramatic variance in runtime of the user functions is resolved by examining the function trace provided by the tools. Here we see that the code must have been heavily optimized for some common input data val-ues allowing these to execute much faster, typically at the expense of increasing code size with lookup tables. While this may be good for single core performance it makes life difficult for the multicore programmer. Because the four threads are spread across four cores, the unoptimized case only needs to occur 25% of the time to affect every iteration of the loop. Since the problem is that 3 of the processors are idle while the 4th completes the larger task, the solution must ensure that additional work is ready for the idle cores. Developers can implement this by ensuring that each thread is free to process another data element as soon as it has completed the previous one, rather than waiting for the entire set to complete as enforced previously by the pthread_join(...) call.
Matching the number of threads to the number of cores is rarely a good idea for various reasons. It is far too easy to end up with the sort of load balancing problems this example is experiencing. Moreover, you are limiting the future scalability of the application if it is going to be run on a machine in the future with more cores. Given the continuing progression of Moore's law, this is almost a certainty. It's best to create enough threads to keep any realistic near future processor busy but not so many that it totally clogs up the system. So something in the order of tens of threads proba-bly makes sense; this way the OS scheduler can take more of the strain and keep the cores busy.
Fortunately the solutions to these problems are complementary but still require substantial refactoring of the top level loop for the code fragment (Listing 4). The most striking change is that the top level control loop has been re-moved and replaced by the loop in the wrapped_process(..) function. This results in the increment of the index vari-able through the loop becoming distributed across the threads with each thread responsible for updating the index of the next data set element to be processed and detecting when it should exit. Because we now have this shared index variable, gDataIndex, it is necessary to ensure that only a single thread can accesses it at a time to prevent a data race. This is achieved by the use of a mutex variable and further enhanced by encapsulating the data access within the func-tion fetchAndIncrementIndex(...). Analyzing the resulting schedule shows that performance is now approaching the optimal 4x acceleration ().
Clearly developing multicore software can be far trickier than our initial analysis suggested. What started as a 2 line sequential loop required the addition of 2 functions, 2 thread management loops, a mutex variable and a major restruc-turing of top level control flow in order to release the available performance. Fortunately this was a straightforward example without any iteration dependent variables other than the loop index. However, programmers are rarely so lucky in practice.
Does it still work?
Whenever code is modified there is the possibility of introducing errors. With extensive modifications, such as those outlined above, there is the near certainty of introducing errors. The move to parallel execution massively increases the scope of possible failure modes. When a program is executed sequentially, certain assumptions can be made. The most important of these being that the code will execute in the order presented in the program. Once the program is split across processors this assumption is no longer valid.
Consider the situation where part of a program running on one processor writes a value to a variable. Now imagine a second section running on a second processor reading that variable. Which value does the second process read? It all depends on whether the read occurs before, during, or after the write. This situation is called a data race: The two threads are racing each other to access the variable first.
This is a well-known problem with threaded programming and also exists on time-sliced thread implementations on single processor systems. Therefore a number of techniques have been designed to control access to a variable, thus ensuring that there are no conflicts from trying to access it at the same time from different threads. However, although techniques are available to solve the problem the onus is still on the programmer to implement them, and this is often by no means that easy or obvious. What is easy is to miss a variable access and therefore fail to include it in the race protection. This leads to a race condition which can be very hard to track down.
Looking again at the code fragment in Listing 4, the programmer has carefully wrapped up access to the gDataIndex global with functions which provide thread-safe access using a mutex. They also pass copies of this variable meaning that any alteration to it will not affect the original. This rigorous application of good programming practice goes a long way to minimize chances of bugs entering the code.
To verify that the code still exhibits the same behavior the program also runs a testbench comparing the output of the multicore and sequential versions of the code. The results match so everything must be OK, right? Unfortunately the testbench has missed a direct access to the global version deep within the call hierarchy of the program (Listing 5). This does not use the thread-safe access functions and as such causes a race condition. So why is there no error during testing?
Data races are very much dependent on timing. If the two competing accesses to the shared variable occur close to each other in time then small changes in the execution time of one thread or the other could change the access order to the variable suddenly triggering a previously unnoticed data race. So it is fully possible for a program to pass all of its testing and work perfecting well for years before some set of conditions, perhaps running on a system with even more cores or a different memory speed, combine to activate the bug.
Fortunately it is possible to detect this sort of problem using tools, which observe memory accesses made by the pro-gram at run time (). If any conflicting accesses to the same memory location by different threads are detected, they are flagged as potential problems if suitable synchronization mechanisms (such as a mutex) are not in place. Once the problem has been highlighted the programmer can make the required changes to ensure correctness.
Unfortunately not all types of data race are directly detectable with tools. The type described above is caused by in-structions interleaving in an unpredictable manner due to things happening at the same time. The other issue is more insidious. Once software is running in parallel any shared variable could be modified at any time in a largely unpre-dictable manner. This is a major difficulty when parallelizing sequential code. For example the code above clearly as-sumes that gDataIndex was used to fetch pData which will work sequentially, despite being bad programming practice, but is likely to cause problems when run in parallel, even when synchronized.
The best defense against these types of issues is to take a methodical approach to finding shared variables (data race detection tools will do this) and then carefully analyzing the context of each to establish if they should simply be pro-tected by a mutex to ensure atomicity or whether more substantial code changes are required ensure correct ordering.
The other great bug-bear of threaded programming are deadlocks. These can occur when two separate thread-safe resources are accessed together. Everything is fine provided the locks are claimed in the same order. In the situation when the locks are not claimed in the same order, the system can grind to a halt (Listing 6). The two functions are executed by different threads. If a thread reaches function_in_thread_1 first and executes both locks before the second thread executes the locks in function_in_thread_2 then everything will work correctly. The second thread will block on the resources becoming free.
It is when both threads reach their respective function at the same time that problems occur. Both threads grab the first lock they need, but when they come to acquire the second they find that the other process already has it. Neither thread can run to completion since they are waiting on the lock held by the other thereby locking up the entire sys-tem. This may seem like an easy problem to avoid, but in even a moderately complex program it is all too easy to for-get, or in a multi-person development team, not know which order to lock in.
To help guard against deadlock, tools can highlight potential problem areas in the code, but it is difficult to auto-matically say for certain whether a given sequence of locks may ever cause a lock or not. The best approach is to en-force a programming standard on a project so everyone working on an application knows what order to lock and unlock in. This can be further automated by implementing lock management libraries to provide a higher-level API for other programmers to use.
From this discussion, it appears that multicore does provide many challenges to the software developer. While there are many applications such as servers and user interface which are often already multithreaded on a single core that will naturally scale to many cores, it is clear that these represent a minority of the software back catalog, particularly in embedded systems.
Many high performance embedded applications have been massively optimized over many years to run in as few cy-cles as possible on a single core. This often results in code that is difficult to port efficiently onto a multicore platform or worse is almost impossible to verify as safe and correct once ported. Fortunately the multicore developer has access to an increasing number of tools to help reduce these difficulties.
Foremost among these tools has to be education. Whether designing new applications or porting existing code, a de-signer's move from sequential to parallel processing requires a substantial change in thinking beyond merely learning a new set of APIs. Fortunately a number of industry and academic groups are producing books, courses and online re-sources, such as the Multicore Association's MPP (multicore programming practices) project, to help developers up-skill.
Alongside this, a number of programming language extensions and libraries are becoming available with the aim of preventing the developer becoming bogged down in the management of their multithreaded programs by providing a more abstract API. These include OpenMP, Intel's Threaded Building Blocks and MPI among others.
Finally there are the emerging multicore and threading aware analysis tools which provide the automated analysis of large complex programs which is nearly impossible to do manually. While tools are unlikely to entirely automate the parallelization processes, they are an indispensable part of the design process.
So, is multicore development easy or difficult? Beyond the low hanging fruit, I'm going to have to say it is difficult, but thanks to the existing and emerging tools and methodologies it is at least not going to be impossible.