What’s so cool about Boost.MPI?

At some point after a client brings us a project proposal, we usually have a conversation about separating the domain-specific part of the job from the infrastructure bits, so we can release the latter part as open source. While we at BoostPro enjoy getting our work out there as FOSS, it can often be a huge win for the customer, reducing long-term maintenance cost and improving code quality, not to mention being good PR. Well, a few years back I got a call from Daniel Egloff, a statistician doing high-performance financial simulations in a Swiss bank—the results of which were of crucial importance to the bank’s future.

Daniel was something of a renegade. He had to be—the official policy at his bank was that everyone was to use Windows and program in Java, so putting together a Debian cluster and assembling a team that could drive it with C++ required swimming against the corporate current, to say the least. But Daniel was also a visionary. It isn’t every day that a new client is so tuned-in to the value and economy of open source libraries that they not only want to release everything we do that way, but they will pay us to shepherd the library through the Boost review process. As a result of our work with Daniel, BoostPro produced three new Boost libraries: Boost.Accumulators, Boost.Time_series, and Boost.MPI. What a year!

Today I want to write a little about Boost.MPI, because it is some very cool technology and because lots of people don’t seem to understand what’s so cool about it.

What’s MPI?

MPI (which stands for Message Passing Interface) is a C-like library API for synchronization and communication between parallel processes, usually running on separate networked computers, sometimes with heterogeneous architectures. If you need to compute something that requires more juice than you can squeeze out of the most powerful single machine, MPI can be a great technology around which to build your application. The library handles all kinds of basic issues, abstracting away details like

  • the OS
  • the communication substrate (ethernet, infiniband, shared memory…)
  • the existence of multiple network interfaces
  • machine architecture (including endian-ness)

Without having to deal with any of these low-level concerns, the MPI user can write straightforward, portable code that orchestrates large-scale parallel computations. There’s lots more to MPI, but those are the basics.

So What’s Boost.MPI?

Boost.MPI is a full-on C++ library wrapper over the MPI API that—quoting from its web page—“better supports modern C++ development styles… and the use of modern C++ library techniques to maintain maximal efficiency.” Which sounds nice-to-have-but-not-too-exciting at first blush. To really appreciate what’s so cool about it, you have to care about making the most of your cluster hardware, and you’ll need to delve into a few of the details about how that hardware works.

Into the Details

Network cards have a fixed-size buffer; sending anything to another process involves getting it into that buffer.1 If the buffer fills up, packets are waiting to go out, and you can’t send anything further until that happens.

One mission of MPI (the non-boost variety) is to provide a portable high-level API for sending out these messages. Therefore, MPI deals with the low-level stuff and has/needs direct access to the network buffers, but code written on top of MPI does not. Actually, some clusters even have multiple network connections, such as connections with tree structure in addition to a channel to six nearest neighbors on a cube, and MPI picks the most appropriate avenue for your communication pattern. So not only don’t you need to, but you actually never want to access the network buffer directly.

MPI DataTypes and Type Maps

Now let’s assume a heterogeneous cluster, where size, alignment, and endian-ness may not match from machine to machine. In that case, you can’t just “blit the bytes;” somebody needs to figure out how to encode data for transmission and decode it upon receipt so that it has the same meaning on both ends. Let’s further assume a system that transmits in little-endian, so before sending from a big-endian machine one needs to swap bytes (all the other possible schemes have the same consequences—assuming this one just allows us to work with specific examples).

The code doing the encoding and decoding has to know about the data structure, rather than operating on the data as raw bytes only. For example, if the data structure is a sequence of 32-bit integers, you need to reverse each group of 4 consecutive bytes. If it’s a sequence of 16-bit integers, you need to reverse pairs of bytes, and if it’s a sequence of chars, you don’t need to do anything.

Now, suppose MPI only knew about byte sequences. MPI messages would be a lot like files, and it would be our job, as users of MPI, to do the encoding and decoding. How could we approach that? We’d probably use something like Boost.Serialization to marshall our data. We’d serialize our source data into a flat, portable representation that could be passed to MPI, which would then copy the bytes into the network buffer as needed. That’s two copies of every byte. One copy to serialize, and another copy into the network buffer.

Fortunately, MPI knows about way more than just bytes. MPI has datatypes, which aren’t actually types at all, but constants that identify C/C++ primitive types such as int and long double to the library. With a base address, a length, and the appropriate datatype, we can tell the library to transmit an array across the network:

// Use of raw, unimproved MPI interface.
err = MPI_Send(&vec[0], vec.size(), MPI_DOUBLE, dest, tag, comm);

Not all data is organized into contiguous arrays of primitives, though; so MPI also provides type maps, which allow us to define the sequence and offsets of fields in a struct:

struct particle
{
    char spin;
    char color;
    double position[3];
    double speed[3];
};
 
particle particles[1000]; // a bunch of particles to send
 
namespace particle_mpi // define the MPI datatype for particles
{
  // Constituent member types
  MPI_Datatype types[2] = {MPI_CHAR, MPI_DOUBLE}; 
 
  // repetition counts
  int reps[2] = {2, 6};
 
  // Prepare offsets
  MPI_AInt offsets[2];
  MPI_Address( &particles[0].spin, offsets );
  MPI_Address( &particles[0].position, offsets + 1 );
  for (int i = 1; i >= 0; --i) 
      offsets[i] -= offsets[0];
 
  // Finally, create the new datatype
  MPI_Datatype datatype;
  MPI_Type_struct( 2, reps, offsets, types, &datatype );
  MPI_Type_commit( &datatype );
}

Presto: Efficiency!

MPI type maps are great for efficiency in three ways:

  1. First, if there’s padding in your data structure, the type map captures that fact, and padding bytes aren’t sent.

  2. Once we tell MPI about the “shape” of the data structure, it can put a serialized representation directly into the network buffer. That’s just one copy of every byte.

  3. Using type maps saves memory. If you’re in a resource-constrained environment—and cluster nodes often do run very near their memory capacity—you might not have space to spare for an additional serialized representation of the data you’re sending. In fact, that was the case with Daniel’s simulation.

These efficiency gains are multiplied when you have to send the same data structure (with new values) multiple times. In the worst case, MPI type maps are on the order of the same size as the data structure they describe, so creating one might be roughly the same cost as a copy. So using the same “shape” over and over again can be important. Fortunately, that’s a natural pattern for many large-scale parallel computations.

So What’s The Catch?

I don’t know whether you noticed from our example, but an MPI type map is a pain to create! It’s even more painful to maintain as data structures evolve. As a result, in a typical application, type maps get created for a very few structs, and more complex structures are typically sent with a series of separate MPI calls (some of which use type maps) or with the extra serialization step. Fortunately, Michael Gauckler (Daniel’s protégé) had a brilliant idea that we implemented in Boost.MPI.

Skeleton and Content

The genius of Michael’s idea is in three realizations:

  1. You can represent all the values in an arbitrarily complex non-contiguous data structure with a single type map. It’s like a struct that extends from the data’s minimum address to its maximum, probably with lots of padding.

  2. When you serialize a complex data structure with Boost.Serialization, the Archive sees the type and address of every datum.

  3. You can treat addresses as byte offsets (e.g. from address zero), and build a type map that way.

So Boost.MPI has a Boost.Serialization Archive type that creates an MPI type map by treating addresses as offsets and translating fundamental C++ types into MPI datatypes. This step involves no actual data copying; it’s just sending the “bones” of the data structure with no real “meat.” Then it uses the type map and asks the underlying MPI library to send the “giant struct beginning at address zero,” thus avoiding an expensive intermediate serialization phase before MPI actually gets its paws on your data. And as long as you don’t change the layout of your data structure, you can send new “meat” without ever repeating the “bones” step. This approach became known as the “skeleton and content” technique. Michael and Daniel wrote a paper about it, which you can read here.

Conclusion

When were able to make the efficient use of MPI as simple as making the types in question serializable, that was a huge win. There simply wasn’t enough memory in the systems to do the most important communications with an intermediate serialization step, and the programmers didn’t have time to manually maintain MPI type maps, so this library was a crucial part of the project’s technical success.

Acknowledgements

Thanks very much to Matthias Troyer for working on the Boost.MPI project with us and for checking (and correcting) my facts.


  1. There are typically multiple buffering levels, some of which are in main memory (usually one for each target node, so you if your communication with node A is blocked you can still send to node B), but the basic facts remain the same: there is a limited amount of buffering available. Actually, the same applies to shared memory, in case that’s how your processes communicate. 

  • http://www.deanberris.com/ Dean Michael Berris

    I’ve used Boost.MPI in an analytics solution before and all I can say is that it makes parallel computing on a Beowulf cluster more joyful. It allows you to not worry about the MPI parts and just model your solution in idiomatic modern C++.

  • Hicham

    Having started trying using Boost.MPI over the past couple of days, it would be totally great if some sort of Best Practice article was posted somewhere. I have been especially struggling with broadcasting polymorphic objects.

  • Tim

    MPI_AInt offsets[2]; //… for (int i = 3; i != 0; –i) offsets[i] -= offsets[0];

    Isn’t this a buffer overrun …

  • http://daveabrahams.com Dave Abrahams

    Quite so! Thanks for catching that; I’ve fixed it.

  • Pingback: Boost: Making C++ a little nicer

  • Rrossi

    Well… i use boost mpi and i am very happy with it. Unfortunately i would like to report a bug and i don’t know where to. Even the simplest boost mpi program fails when using  openmpi >1.2  as it comes with common distributions (at least on ubuntu and redhat).

    by googling you can find it is a known issue with openmpi…which will require maintenance on the boos side

    the error is:

    rrossi@rrossi-notebook:~/Libraries/boost_1_47_0/lib$ python Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) [GCC 4.5.2] on linux2 Type “help”, “copyright”, “credits” or “license” for more information.

    import mpi python: symbol lookup error: /usr/lib/openmpi/lib/openmpi/mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int

    please feel free to contact me for details, i will also try to write you an email about this rrossi@cimne.upc.edu