Using MCAPI to lighten an MPI load
Use MCAPI to less expensively deliver MPI performance in a system with both limited resources and limited requirements.
Sven Brehmer, Polycore Software Inc, Markus Levy, The Multicore Association, and Bryon Moyer, Independent Consultant -- EDN, November 17, 2011
HPC (high-performance computing) relies on large numbers of computers to perform tough jobs. One computer often acts as a master, parceling out data to processes that may be located anywhere in the world. The MPI (message-passing interface) provides a way to move the data from one place to the next. Normally, MPI would be implemented once in each server to handle the messaging traffic. With servers using many cores, however, it can be expensive to use a complete MPI implementation because MPI would have to run on each core in the computer in an asymmetric-multiprocessing configuration. On the other hand, the MCAPI (Multicore Communications API)—a protocol designed with embedded systems in mind—more efficiently moves MPI messages around the computer.Heavyweight champion
The well-established MPI HPC protocol is robust enough to handle the problems that might be encountered in a dynamic network of computers. For example, such networks are rarely static. MPI must be able to handle a variable number of nodes in the network—due to updates, maintenance, the purchase of additional machines, or even a user’s inadvertent unplugging of a physical network cable. Even with a constant number of servers, those servers run processes that may start or stop at any time. MPI thus includes the ability to discover who is on the network.
At the programming level, MPI reflects nothing about
computers or cores. It knows only about processes. Processes
start at initialization, and this discovery mechanism builds a
picture of the arrangement of the processes. MPI allows for
flexibility in the creation of a topology. When everything is
up and running, however, a map of processes can be used to
exchange data. A given program can exchange messages with
one process inside or outside a group or with every process in a group. The program itself has no idea whether it is talking
to the computer next to it or to one on another continent. A
program doesn’t care whether a computer running a process
with which it’s communicating is single-core or multicore,
homogeneous or heterogeneous, or symmetric or asymmetric.
It knows only that it wants to send an instant message to a
process. The MPI implementation on the computer must
ensure that the messages reach the targeted processes.Due to the architectural homogeneity of symmetric multicore implementations, achieving this goal is simple. An OS instance runs over a group of cores and manages them as a set of identical resources, naturally spreading a process over the cores. A multithreaded process can take advantage of the cores to improve computing performance; nothing else must be done.
However, symmetric multiprocessing starts to bog down with more cores because adding cores also bogs down bus and memory access. For computers designed to help solve big problems as quickly as possible, it stands to reason that more cores in a box is better, but only if the computer can effectively use them. To avoid the limitations of symmetric multiprocessing, you can instead use asymmetric multiprocessing for systems with multiple cores.
With asymmetric multiprocessing, each core or subgroup of cores runs its own independent OS instance, and some might even have no OS at all, running on “bare metal.” Because a process cannot span more than one OS instance, each OS instance and, potentially, each core runs its own processes. So, whereas a symmetric-multiprocessing configuration can still look like one process, asymmetric multiprocessing looks like many processes, even if they are multiple instances of the same process.
In this configuration, each OS must run its own instance of MPI to ensure that the network represents its processes and feeds it any messages coming its way. The environment connecting the cores within a closed box—or even on one chip—is smaller than the network within which MPI must operate. It also typically has fewer resources than a network does. MPI thus has too many features for communication within a server.
Different roles
Although they may look similar in spirit, MPI and MCAPI
play different roles. MPI comes from the HPC world; MCAPI,
from the embedded-system world. They thus have different
characteristics, including topology, coupling and locality,
resources, and timing, which are complementary to each
other (Table 1).The network over which MPI runs may change configuration at any moment either physically or by starting and stopping processes. In contrast, an embedded system is static. For the most part, it is physically impossible to disconnect the components in an embedded system. Even when you use something like a PCI card to add computing power, it’s not a plug-and-play configuration: The PCI slot makes it possible to add a board, but, once added, it’s generally expected that the board will remain there. Thus, MCAPI doesn’t need the performance to deal with topology changes.
Coupling refers to the strength with which two systems interconnect. Networks are loosely coupled, so breaking the network shouldn’t affect a computer’s ability to function, except to the extent that, if it needs something across the network, it can’t get it, and you must hope that the programmer created a graceful way to handle this situation. At the other end of the scale, an embedded system is typically restricted to one box. If the system has multiple cores, the cores of the processor connect tightly because they share a hard-wired bus, perhaps some memory, and the same silicon crystal.
Coupling closely ties to the concept of “locality”: A network may connect you to a computer halfway around the world; two cores are typically separated by microns. Whereas MPI must handle loosely coupled nonlocal nodes, MCAPI can assume tight coupling and close proximity.
The resources available to handle message passing also scale as you go from the network level down to the processor. It’s a straightforward matter to add storage to a network; it’s impossible to add on-chip RAM to a processor. The storage you can add to a network is huge; the fixed storage on a processor is limited. Thus, the resources available for managing MPI tend to be greater; MCAPI must operate on a budget.
Response time is also a consideration. Moving a message around the world takes time, and that time is not deterministic. Send the same message multiples times, and it will take different routes that have different delays. By contrast, many embedded systems have stringent real-time requirements that must be met. Milliseconds matter. MCAPI can therefore be quick and responsive in a way that MPI can’t be.
A featherweight steps in
Unlike with MPI, The Multicore Association designed the MCAPI specification to be lightweight so that it can handle interprocess communication in embedded systems, which usually have considerably more limited resources. Although MCAPI works differently from MPI, it still provides a basic, simple means of getting a message from one core to another. You can thus use MCAPI to less expensively deliver MPI performance in a system with both limited resources and limited requirements.
To bring MCAPI into an MPI design, consider a program
using MPI, which uses few MPI constructs that just send and
receive simple messages. The idea is to designate one master
core within the server to run a full MPI service plus a translator
for all other accelerator cores in the box. The accelerator
cores will run MCAPI instead of MPI, meaning that MPI
messages will run between the servers but MCAPI messages
will run between the cores in the server (Figure 1).For those program instances running on the accelerator cores, you then replace the MPI calls with the equivalent MCAPI calls. For that reason, this approach works only for simpler uses of MPI; many MPI constructs have no MCAPI equivalents. A translator converts any messages moving between the MPI and the MCAPI domains (Figure 2). The cost of this arrangement lies in the fact that the program must be edited and recompiled to use MCAPI instead of MPI for the accelerator cores. This approach also complicates program maintenance due to the existence of two versions of the program: one using MPI and one using MCAPI.

The trick is that this wrapping service must replicate the
MPI API, even though it’s simply stubbing out the actual
MPI functions. The wrapper then drops onto the accelerator
cores as a library that masquerades as the MPI library so that
the processes running in the accelerator cores feel as if the
system is meeting their MPI needs. In the master core, the
wrapper must additionally represent the processes running on
the accelerator cores so that the MPI messages route properly
when going to the accelerator cores.
A message can follow several paths. For example, Processor
N sends a message to Processor 0. In this case, Processor
0 is running on the master core with full MPI service, so
this scenario can be handled as a standard MPI message. In
another case, Processor N sends a message to Processor 3.
Here, Processor 3 is running on an accelerator core, so the
master, which recognizes that Processor 3 is on another core,
receives the message. The master wraps the MPI message
in an MCAPI message and sends it to the other core. The
accelerator core accepts the MCAPI message and unwraps the
MPI message, which is now available to Processor 3.
In a third scenario, Processor 1 sends a message to Processor 2. In this case, both processes are in the same server but on different cores. The master core wraps the MPI message in an MCAPI message, which the accelerator core unwraps for consumption by Processor 2. Another case has Processor 2 sending a message to Processor 3. Here, both processes are on the same core, but the core doesn’t have MPI running. So the message goes to the full MPI implementation on the master core over MCAPI, which routes the message right back. This approach may sound inefficient, but it’s faster than having the messages cross the Internet between servers.
In yet another alternative, Processor 2 sends a message to Processor N. The accelerator core wraps the message as MCAPI and sends it to the master core. Once the master core unwraps the message, the MPI service can route the message to the other server. Again, only one core needs to run MPI; the other cores run MCAPI. Although MCAPI assists in moving MPI messages, the processes exchanging MPI messages have no idea that anything but MPI is running. Using any of these approaches, MCAPI lets you effectively use the extra cores in modern servers in an HPC network.
Acknowledgment
This article originally appeared on EDN’s sister site, MCU Designline.
Authors’ biographies
Sven Brehmer is president of Polycore Software Inc.
Markus Levy is president of The Multicore Association.
Bryon Moyer is an independent consultant.
Talkback






















