Parallel processing in C/C++
1 Overview
Some long-standing tools for parallelizing C, C++, and Fortran code are openMP
for writing threaded code to run in parallel on one machine and MPI
for writing code that passages message to run in parallel across (usually) multiple nodes.
3 MPI
3.1 MPI Overview
There are multiple MPI implementations, of which openMPI and mpich are very common. openMPI is quite common, and we’ll use that.
In MPI programming, the same code runs on all the machines. This is called SPMD (single program, multiple data). As we saw a bit with the pbdR code, one invokes the same code (same program) multiple times, but the behavior of the code can be different based on querying the rank (ID) of the process. Since MPI operates in a distributed fashion, any transfer of information between processes must be done explicitly via send and receive calls (e.g., MPI_Send, MPI_Recv, MPI_Isend, and MPI_Irecv). (The ``MPI_’’ is for C code; C++ just has Send, Recv, etc.)
The latter two of these functions (MPI_Isend and MPI_Irecv) are so-called non-blocking calls. One important concept to understand is the difference between blocking and non-blocking calls. Blocking calls wait until the call finishes, while non-blocking calls return and allow the code to continue. Non-blocking calls can be more efficient, but can lead to problems with synchronization between processes.
In addition to send and receive calls to transfer to and from specific processes, there are calls that send out data to all processes (MPI_Scatter), gather data back (MPI_Gather) and perform reduction operations (MPI_Reduce).
Debugging MPI code can be tricky because communication can hang, error messages from the workers may not be seen or readily accessible, and it can be difficult to assess the state of the worker processes.
3.2 Basic syntax for MPI in C
Here’s a basic hello world example The code is also in mpiHello.c.
// see mpiHello.c
#include <stdio.h>
#include <math.h>
#include <mpi.h>
int main(int argc, char* argv) {
int myrank, nprocs, namelen;
char process_name[MPI_MAX_PROCESSOR_NAME];
(&argc, &argv);
MPI_Init(MPI_COMM_WORLD, &nprocs);
MPI_Comm_size(MPI_COMM_WORLD, &myrank);
MPI_Comm_rank(process_name, &namelen);
MPI_Get_processor_name("Hello from process %d of %d on %s\n",
printf, nprocs, process_name);
myrank();
MPI_Finalizereturn 0;
}
There are C (mpicc) and C++ (mpic++) compilers for MPI programs (mpicxx and mpiCC are synonyms). I’ll use the MPI C++ compiler even though the code is all plain C code.
Then we’ll run the executable via mpirun
. Here the code will just run on my single machine, called arwen
. See Section 3.3 for details on how to run on multiple machines.
mpicxx mpiHello.c -o mpiHello
mpirun -np 4 mpiHello
Here’s the output we would expect:
## Hello from processor 0 of 4 on arwen
## Hello from processor 1 of 4 on arwen
## Hello from processor 2 of 4 on arwen
## Hello from processor 3 of 4 on arwen
To actually write real MPI code, you’ll need to go learn some of the MPI syntax. See quad_mpi.c and quad_mpi.cpp, which are example C and C++ programs (for approximating an integral via quadrature) that show some of the basic MPI functions. Compilation and running are as above:
mpicxx quad_mpi.cpp -o quad_mpi
mpirun -machinefile .hosts -np 4 quad_mpi
And here’s the output we would expect:
23 November 2021 03:28:25 PM
QUAD_MPI
C++/MPI version
Estimate an integral of f(x) from A to B.
f(x) = 50 / (pi * ( 2500 * x * x + 1 ) )
A = 0
B = 10
N = 999999999
EXACT = 0.4993633810764567
Use MPI to divide the computation among 4 total processes,
of which one is the main process and does not do core computations.
Process 1 contributed MY_TOTAL = 0.49809
Process 2 contributed MY_TOTAL = 0.00095491
Process 3 contributed MY_TOTAL = 0.000318308
Estimate = 0.4993634591634721
Error = 7.808701535383378e-08
Time = 10.03146505355835
Process 2 contributed MY_TOTAL = 0.00095491
QUAD_MPI:
Normal end of execution.
23 November 2021 03:28:36 PM
3.3 Starting MPI-based jobs
MPI-based executables require that you start your process(es) in a special way via the mpirun command. Note that mpirun, mpiexec and orterun are synonyms under openMPI.
The basic requirements for starting such a job are that you specify the number of processes you want to run and that you indicate what machines those processes should run on. Those machines should be networked together such that MPI can ssh to the various machines without any password required.
3.3.1 Running an MPI job with machines specified manually
There are two ways to tell mpirun the machines on which to run the worker processes.
First, we can pass the machine names directly, replicating the name if we want multiple processes on a single machine. In the example here, these are machines accessible to me, and you would need to replace those names with the names of machines you have access to. You’ll need to set up SSH keys so that you can access the machines without a password.
mpirun --host gandalf,radagast,arwen,arwen -np 4 hostname
Alternatively, we can create a file with the relevant information.
echo 'gandalf slots=1' > .hosts
echo 'radagast slots=1' >> .hosts
echo 'arwen slots=2' >> .hosts
mpirun -machinefile .hosts -np 4 hostname
One can also just duplicate a given machine name as many times as desired, rather than using slots
.
3.3.2 Running an MPI job within a Slurm job
If you are running your code as part of a job submitted to Slurm, you generally won’t need to pass the machinefile or np arguments as MPI will get that information from Slurm. So you can simply run your executable, in this case first checking which machines mpirun is using:
mpirun hostname
mpirun quad_mpi
3.3.3 Additional details
To limit the number of threads for each process, we can tell mpirun to export the value of OMP_NUM_THREADS to the processes. E.g., calling a C program, quad_mpi:
export OMP_NUM_THREADS=2
mpirun -machinefile .hosts -np 4 -x OMP_NUM_THREADS quad_mpi
There are additional details involved in carefully controlling how processes are allocated to nodes, but the default arguments for mpirun should do a reasonable job in many situations.