Feature
Optimizing multiprocessor systems
Efficient optimization of multiprocessor systems begins with avoiding inefficiency in the first place.
By Nicholas Cravotta, Contributing Technical Editor -- EDN, 9/4/2003
|
Optimizing a system means more than getting as close as you can to 100% usage. It also involves getting close to 0% inefficiency. Designing systems using multiple processors, each with multiple cores running multiple threads, presents a difficult optimization challenge. Higher order languages, framework environments, APIs, resource managers, and a slew of other abstractions enable you to step back from the complexity of a problem and view it from a higher, simpler, and more comprehensible level.
Abstraction's great advantage is that you can quickly sketch out your design and fine-tune it where you need better performance. In other words, optimizing a system means determining where abstraction introduces the most inefficiencies and then selectively replacing those abstractions with more concrete descriptions to remove those inefficiencies.
One plus one is much more than twoOne challenge of designing multiprocessor systems is that, although you can fairly easily prove that an individual task falls within system constraints, doing so in the presence of additional tasks contending for the same resources, such as computational cycles or memory bandwidth, can be difficult.
Multiprocessor architectures solve small and large problems. In a network-processing application, for example, in which several processors concurrently process packets so that one of the packets completes before a new one arrives, you may need to solve a lot of little problems with several processors. You might also see such an architecture in a DSP or an encryption farm, which sends small problems to a central location for processing. With large problems, such as image or signal processing, you need to break the problem into manageable pieces, with a separate processor core solving each piece. Note that some architectures employ both techniques: They break many small problems into pieces that separate processors solve.
Breaking a problem into pieces has the advantage of enabling more efficient use of resources. You can define how many cores run a process or task based on how performance-intensive the task is. A key disadvantage of breaking up a problem is that you have to stitch the pieces back together—a process that requires more overall processing than a single processor. It usually involves passing intermediate results to the next task and resolving seams; when you break an image into four quadrants, you have to resolve the seams that result between the quadrants.
Software partitioning is often arbitrary; thus, it artificially constrains the system. You base it on boundaries logical to human understanding rather than practical boundaries, such as the number of cycles to execute. It takes a lot of effort to create a software partition and even more effort to later shift boundaries. If you discover an imbalanced task—one that requires 2.5 cores' worth of performance—it may become difficult to repartition tasks to recapture the unused half core (see sidebar "Quantum math: 2.5+3.5=7"). You have to focus on other, unrelated abstractions that might be more efficient than the task-boundary abstraction. Optimizing such abstractions becomes more challenging, because you have to remove more abstractions to remove more inefficiencies.
What's inside? Who cares?Some environments use software constructs to abstractly access resources. For example, several NPUs (network-processing units) abstract table searches through an API, shielding the programmer from having to know which coprocessor the hardware team may select. It also enables code to be portable in case you need to change coprocessors. However, even a thin API comes at a cost. You don't use all coprocessors in the same way. If the API successfully captures differences, you can be sure it'll cost you. Even if you're talking only two extra instructions per access, using the resource five times in 200 lines of executed code results in a 5% inefficiency cost.
A similar form of abstraction is using library functions. The more general the code, the more applications the library can service. Such generality comes at the cost of the inefficiency of entering and leaving the function, as well as telling it exactly what you need to do in each use. For established protocols or well-defined algorithms, for example, libraries give you robust code for a part of your system that offers little differentiation. But, unless you optimize the code for your processor of choice instead of merely porting it by altering a few internal I/O functions, the code will be less efficient than it could be. Additionally, consider the cost of using a function. Instead of directly using parameters in their original memory location, you need to pass them or pointers to them to the function and then retrieve results using additional cycles.
Does abstracting resources make sense? Consider that each generation of coprocessor offers new features and commands that change how you use it. Current-generation coprocessors support macros, in which a single command executes several commands. The only way to take advantage of the macro optimization is to rewrite the code using the legacy API. The old API will still work, but using it will cost you dearly. Consider that you're going to need to reoptimize your system anyway, because all of the task-contention dependencies will change. The API is worth using if you want to support a second coprocessor as a backup, but you have to consider the thickness of the API and how much inefficiency it adds.
Intercore communicationAbstracting resources hides many of the relevant characteristics of accessing that resource. For example, you could store a value in local core memory, another core's memory, another processor's memory, or external memory. To the programmer, all accesses appear equal, but they are not. Each access has a different latency.
Working solely at the abstraction level, you solve problems on only a logical level, leaving your tools to solve them at the physical level. In many cases, the tools can manage resources with more detail than can a person, but only if the abstracted representation doesn't cripple optimization efforts. For example, say you need to allocate three cores for the processing of Task A. If the tool clumps Task A onto three of a processor's four cores, you may overload any core resources that this task uses intensively. Additionally, when Task A hands off intermediate results to Task B residing on another processor, it may have to pass through memory. This technique is less efficient than sending results to another core on the same processor using registers. By collocating Task A and Task B together onto three separate processors, you improve task handoff.
Note the added complexity that working on the abstraction level brings to profiling. Say that Task B requires only two cores. The first two instantiations of Task A can use core registers to hand off intermediate results to Task B on the same processor. The third Task A might have to use memory to hand off intermediate results to either of the Task B instantiations. Your profiling tools must be able to treat the three instantiations of Task A as individual tasks. If the tool aggregates or averages results, you lose the ability to see where inefficiency lies—in the handoff and in certain instantiations. Breaking the processing into many tasks running on their own cores makes the allocation problem extremely complex (see sidebar "Simulation across time and space").
Profiling is one area in which abstraction really does help. Because Task A supports two kinds of handoff, being able to abstractly define the handoffs and compile different versions of the task based on its allocation frees you from having to manually manage the process. Even if your tools don't directly support such abstraction, and, depending on where a task resides, you can indirectly use macros to create alternative code and compile several versions of the code as necessary.
Changing the physical allocation of tasks between cores and processors creates different profiles (see sidebar "Tackling allocation"). If the design tools, not the programmer, handle intercore communication, then the tools hide from the programmer the true latency of communication and the ability to optimize that communication. Thus, abstracted tools need their own tools—in this case, tools that enable them to optimize the physical configuration. In other words, the compiler can no longer look at code as a static list of instructions; it must view the code as dynamic instructions with a history and performance profile. The compiler must be a companion tool with the debugger, simulator, and so on. You no longer use tools at only a single point in time; you need to be able to use all of them together.
Profiling intercore communication is complex. Your ability to monitor messages between internal cores depends on how much visibility the processor offers, as well as how intrusive such monitoring is. Measuring communication efficiency between processors presents a different problem. Four processors in a mesh communicate over six channels, requiring several logic analyzers to simultaneously look at all traffic. You might consider using an FPGA inline during development to analyze traffic.
Keeping common values local not only reduces latency, but also conserves bus bandwidth. Unless the simulator can profile variable use for the compiler to optimally map memory, you'll be stuck trying to define an efficient map by hand. Consider two processors, P1 and P4, connected through processors P2 and P3. If tasks running on P1 and P4 exchange a lot of data, you can reduce latency by physically connecting P1 to P4. Although profiling tools may help you uncover information that suggests optimizing allocations in this way, most do not assist in the optimization. The compiler does not take into account the physical relationship between cores, and quickly testing new maps (optimizing using an abstract map of task dependencies to eliminate "obviously" inefficient combinations rather than completely simulating the system for each combination) is not an option.
You need to be able to redefine, repartition, and reallocate the system as you simulate, debug, profile, and learn new information. Current tools resist repartitioning by making it difficult to carry concrete optimizations back to abstracted models. Optimized code pierces the abstraction bubble, and the more concrete details you add, the more difficult it is to maintain the abstraction. At some point, your investment in the hybrid model forces you to commit to it.
Too much and not enoughAs you increase the number of processors or cores, each processing unit becomes more of a black box. Collecting aggregate results is relatively straight-forward, but being able to examine these results at the individual data-point level can open the door to more efficiency. For example, one data type may stall more readily the system than another. Knowing what kind of data causes these stalls gives you better insight than simply knowing where the stalls occur.
Note that, if you base task execution on a scheduling algorithm, a task contends with all other tasks in myriad configurations, so the aggregate does give you a useful measure. However, it is difficult to optimize over a wide range of configurations. Consider, then, limiting the tasks a cluster of cores can handle. You can reduce the number of combinations and begin to identify which configurations present the most inefficiencies. You can eliminate these configurations by differently limiting the tasks. The configurations you leave will occur more often, so optimizations you make for a task combination will more often yield benefits.
Efficient simulation is a trade-off between accuracy, observability, and simulation speed. As you increase one component, you reduce the others. Given the processing speeds of devices in multiprocessor systems, it is generally unfeasible to collect overall system-trace data while hunting down a bug or creating a profile, because there is more data available than you can collect. Thus, even though you can more quickly exercise a system with hardware, you lose the ability to see the details of what's happening.
For single-processor systems, you can set a breakpoint and take a snapshot of the system. You might even be able to unobtrusively perform this task, so that it has no impact on system execution. If the processor is running at high speeds, unless you stop it, you can collect only so much data before more processing takes place and the snapshot changes.
In multiprocessor architectures, you may be unable to stop all processors on the same cycle. In this kind of breakpoint, called chaining, each processor passes the breakpoint on to the next processor. Thus, a processor stopped late in the chain may alter relevant data. Additionally, you may be unable to restart execution from the breakpoint and so must reset the system. You also face the issue of uploading tremendous amounts of data from so many processors. The ability to limit what you upload may save you significant time over the course of a day's debugging.
Some processor vendors targeting multiple-processor applications offer design tools beyond compiler and simulator. For example, Intel's Architectural Development Tool helps you budget and estimate memory bandwidth, pipeline performance, core efficiency, and overhead before you begin modeling and simulation. Most processors form the core of their own simulation environment. If you are using different processors in a system, such as control and data (microprocessor unit plus DSP or NPU plus coprocessor), determine how much you'll need to massage the models to simulate processors together.
Several nonprocessor companies also offer multiprocessor-design tools. MLDesigner, from MLDesign, uses detailed or abstract models for system-performance modeling. Matlab and Simulink from The Mathworks are common platforms for DSP applications. And Axys offers MaxSim, a tool set for modeling and verifying multicore systems on chips.
You'll soon be able to leave your compiler instead of just your simulator on all weekend. Tools arriving next year promise to iteratively recompile code based on profiles run on previous compilations, thus testing out multiple configurations. For example, the compiler could determine the critical code path, identify the most common variables and registers, and place dependent values contiguously in memory so that you can retrieve them with one memory look-up rather than two.
Note that, no matter how comprehensive your profiling tools are, the results are only as accurate and relevant as the completeness of your code.
The true cost of abstractionWhat is the true cost of abstraction? To determine it, you have to consider all of the levels of abstraction of which you might take advantage. Again, abstraction is not bad. Ease of design is worth introducing some inefficiency into your design. But if an abstraction causes you to puncture the sphere of ease of design, you might question whether it's costing you more than you'll gain.
Understand the trade-off that each abstraction entails. For example, a design environment may abstract resources that could physically be software or a hardware-accelerator engine, depending on your application. Thus, the final code could be a set of instructions, such as a macro; a single instruction; or the setting of a few parameters, such as a configurable resource.
Development tools may appear to handle the physical allocation of these resources, but they often do not. You might still need to arbitrarily define thread boundaries, how elements will communicate, how to share resources, and so on. In others words, you still have to manage all of the hard stuff. The development environment gives you a limited ability to profile how you define these characteristics, but it does not directly help you determine the best way to define them. You pay the full abstraction penalty to use the tool without gaining the full benefits possible from such an abstraction.
As the tools evolve, they bring new challenges to accurately determining the real profile of a design (see sidebar "More on tools"). However, each generation of tools more tightly bounds the limits of inefficiency. One day, you'll reach the point when you won't care about inefficiencies, because you'll be willing to accept them at their worst. Who knows? One day, the ultimate abstraction tool might emerge, which would allow marketing executives to design products directly in PowerPoint, bypassing the engineers altogether.
Most vendors admit, once you push them to the wall, that programming in C costs you a certain percentage of efficiency (see sidebar "What's the metric for 'easy to program'?"). But programming in C is only one of the myriad abstractions that this article mentions (see sidebar "Other considerations"). If the cumulative cost of abstraction were a mere 10%, not choosing abstraction would be insane. However, it is unclear just how costly abstraction is, and the tools to accurately measure the cost are simply not available today.
The bottom line is that you can trade performance efficiency for the potential ease of design that comes through abstraction. You'll have your product up and running faster, but you'll also have to overprovide your design to accommodate all of the inefficiencies abstraction brings.
Ask yourself how much time to market costs you. And consider that 50% inefficiency in a six-processor system means you could optimize down to five or four processors if you're willing to make the effort. The question is not whether the abstraction trade-off is worth making. You need to look at the real cost of abstraction on a case-by-case basis.
| For more information... | ||
| For more information on products such as those discussed in this article, contact any of the following manufacturers directly, and please let them know you read about their products in EDN. | ||
| ACE www.ace.nl | Analog Devices www.adi.com | ARC www.arc.com |
| Axys Design Automation www.axysdesign.com | Consystant www.consystant.com | Cypress www.cypress.com |
| Express Logic www.expresslogic.com | Green Hills Software www.ghs.com | IDT www.idt.com |
| Intel www.intel.com | LSI Logic www.lsilogic.com | Mathworks www.mathworks.com |
| MLDesign Technologies www.mldesigner.com/ | Teja Technologies www.teja.com | Tensilica www.tensilica.com |
| Texas Instruments www.ti.com | ||
| Author Information |
You can reach Contributing Technical Editor Nicholas Cravotta at e-mail editor@nicholascravotta.com. |
|
|















You can reach Contributing Technical Editor Nicholas Cravotta at e-mail 