High Performance Computing:
An Introduction to Parallel Programming With Beowulf

Forrest M. Hoffman and William W. Hargrove

Parallel computer systems have been available commercially for many years. Used primarily in federal defense organizations and research laboratories, these costly behemoths were difficult to use and program, often requiring specialized skills and intimate knowledge of the unique architecture of each machine. With so few customers and such specialized equipment, the supercomputer industry floundered until the recent trend of constructing supercomputers from clusters of workstations and commodity PCs.

Authors Forrest Hoffman (standing) and Bill Hargrove sit "inside" the computer they constructed from commodity PCs.

The Beowulf project (http://www.beowulf.org/), begun at NASA's Goddard Space Flight Center, has helped solidify this trend through the use of Linux and other Open Source software running on inexpensive, off-the-shelf PCs (Becker et al. 1995). This work has opened the door for low-cost, high performance cluster computing. In addition, standards and tools have been developed for such distributed memory parallel computer systems making it easier for programmers to build scalable and portable parallel computer applications.

Because of the low costs involved in building a Beowulf-style cluster from commodity PCs, many research organizations, universities, and businesses have done just that in an effort to hold down costs while increasing computational performance for complex problems. Having built our own Beowulf-style cluster (Hoffman and Hargrove 1999), we often hear from students or researchers who have built Beowulf clusters but are not sure where to go from there. We hope this article will provide some direction for those who need help getting started with programming their own parallel computer applications.

Parallel Computing Theory

High performance parallel computing is accomplished by splitting up large and complex tasks across multiple processors. During World War II, well before the advent of the electronic computer, a similar technique was used for carrying out long calculations associated with the design of the atomic bomb for the Manhattan Project. To significantly reduce the amount of time it took to solve a large mathematical problem, each part of the problem was performed by a different person. Interestingly enough, the people who performed these calculations were called computers. Today electronic computers can work in harmony to solve scientific problems not dreamed of even a decade ago.

Problem Decomposition and Granularity

While a variety of methods can be used to achieve improved performance on multiple computers, the most common way to organize and orchestrate parallel processing is to write code which automatically decomposes the problem at hand and allows the processors to communicate with each other when necessary while performing their work.

Not every computational problem is amenable to parallel computing. If no sub-tasks can be performed simultaneously or if the system being modeled is highly interdependent (i.e., the problem is "fine grained"), attempts to parallelize such codes may result in increased time-to-solution. For the finest-grained problems computational performance is limited by the speed of the fastest single CPU available on the market.

Luckily, many complex scientific problems can be decomposed either by isolating separate tasks which can be performed independently and simultaneously by multiple processors or more often by splitting up the space (or time) coordinates of the system being modeled so that values for each sub-region (or time interval) can be calculated simultaneously. For example, many image processing applications perform calculations on individual cells or pixels without requiring knowledge of the state of any other pixel in the frame. These kinds of applications are "coarse grained," are generally easy to parallelize, and attain the most benefit from parallel processing. Such applications are often referred to as "embarrassingly parallel."

Most scientific applications, however, fall somewhere in between coarse and fine granularity. These applications usually require some amount of interaction between sub-regions so individual processors must be able to communicate with each other to exchange calculated values, a technique called message passing. For example, values for cells on a map may depend on the values of their nearest neighboring cells. If the map is decomposed into two pieces, each being processed on a separate CPU, the processors must exchange cell values on adjacent edges of the map.

Spatial decompositions must be made with care and attention to the spatial interdependency of the problem. A common pitfall is to turn what was a computational problem on a single processor into a communications problem on multiple processors. This occurs most frequently when a problem is not properly decomposed or when the amount of interaction across space (or time) is so great that the code spends more time communicating than calculating. Striking the right balance between computation and communication for any particular problem on any particular parallel computing platform is more art than science.

While techniques for problem decomposition will not be considered here, it is important that programmers give careful thought to how they split up the work inside their applications. Once a good decomposition strategy is found, it is time to begin programming.

Message Passing

Many different techniques can be used for message passing on a Beowulf cluster or other parallel computer platform, including Threads or Inter-Process Communication (IPC) on a single node with multiple processors, or TCP sockets, Remote Procedure Calls (RPCs), or less sophisticated exchanging of messages through files visible on multiple nodes. But the best and easiest strategy is to use software libraries specifically designed for message passing on parallel computers. The two most popular libraries or applications program interfaces (APIs) are PVM (Parallel Virtual Machine) and MPI (Message Passing Interface). Because of the wide availability of these two APIs, parallel code which performs message passing using the PVM or MPI libraries can be run on everything from laptops to CRAYs. Further, code developed on Beowulf clusters can be easily moved to very large commercial parallel computers, often without changing a single line of code.

PVM (http://www.epm.ornl.gov/pvm/) is available on a wide range of computer platforms and has bindings for C, C++, and FORTRAN as well as implementations for Java and Python. It includes many advanced features for building complex distributed computing applications.

MPI is really a specification created by a committee, the Message Passing Interface Forum (MPIF), initially in 1994. The specification describes the features and syntax of the API, but leaves the details and techniques of the actual implementation of these features to developers who want to create libraries meeting the MPI specification. Various implementations of MPI are available on a wide variety of computer platforms, but the two most popular ones are MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) (Gropp et al. 1996) and LAM (Local Area Multicomputer)/MPI (http://www.mpi.nd.edu/lam/). Both offer C, C++, and FORTRAN bindings and are widely used on Beowulf clusters. All vendors of commercial parallel computers provide their own implementation of MPI optimized for their systems.

While MPI does not offer some of the specialized features available in PVM, it is based on agreed-upon standards, is increasingly used for code development, and has adequate features for most parallel applications. Hence, the coding examples presented here will be written in C using MPI. The codes have been tested on a Beowulf cluster using the Gnu C compiler (gcc) with the MPICH implementation of MPI.

Using MPI

The first step is to download and install the desired message passing library on all the different computer architectures which will be used. On a Beowulf cluster with a shared filesystem, a single installation should do the trick. After installing the software, the machines should be populated with the list of available nodes. Check the documentation to find out where this file should reside.

Next, one must become familiar with the syntax and semantics of using MPI calls. The most common method of programming parallel applications today is called Single Program Multiple Data (SPMD). While different programs could be written for each processor working on a single problem in parallel, SPMD codes, which consist of a single code base and resulting executable, are easier to write and maintain. Only SPMD code examples will be presented here.

Program 1 is a "Hello World!" program that illustrates the basic MPI calls necessary to startup and end an MPI program.

Program 1: hello.c

#include <stdio.h> #include "mpi.h" void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("Hello World! I'm process %d of %d on %s\n", me, nprocs, processor_name); MPI_Finalize(); }

**Program 1: `hello.c`**
#include <stdio.h> #include "mpi.h" void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("Hello World! I'm process %d of %d on %s\n", me, nprocs, processor_name); MPI_Finalize(); }

In order to successfully compile the code, the MPI header file (mpi.h) must be included at the top of the code. Just inside main(), MPI_Init() must be called and handed the command line arguments so that the environment is setup correctly for the program to run in parallel.

The next three MPI routines in Program 1 return information about the parallel environment for use later in the code. In this example, we merely print out the information, but in most parallel codes this information is used to do automatic problem decomposition and to setup communications between processes. MPI_Comm_size() provides the number of processes, which is subsequently stored in nprocs, in the communicator group MPI_COMM_WORLD. MPI_COMM_WORLD is a special communicator which denotes all of the processes available at initialization. MPI_Comm_rank() provides the rank or process number (ranging from 0 to nprocs-1) of the calling process. The rank is subsequently stored in me. MPI_Get_processor_name() provides the hostname of the node (not the individual processor) being used, stored in processor_name, as well as the length of this hostname, stored in namelen.

Next, the code prints "Hello World!" and the values of the variables obtained in the three previous MPI calls. Finally, MPI_Finalize() is called to terminate the parallel environment.

MPI programs can be compiled in many ways, but most MPI implementations provide an easy-to-use script which will set desired compiler flags, point the compiler at the right directory for MPI header files, and include the necessary libraries for the linker. The MPICH implementation provides a script called mpicc which will use the desired compiler, in this case gcc, and will pass the other command line arguments to it. Output 1 shows how to compile Program 1 with mpicc.

Output 1

[forrest@beowulf hello]$ mpicc -O -o hello hello.c [forrest@beowulf hello]$ mpirun -np 6 hello Hello World! I'm process 4 of 6 on beowulf005 Hello World! I'm process 1 of 6 on beowulf002 Hello World! I'm process 5 of 6 on beowulf006 Hello World! I'm process 2 of 6 on beowulf003 Hello World! I'm process 3 of 6 on beowulf004 Hello World! I'm process 0 of 6 on beowulf001

**Output 1**
[forrest@beowulf hello]$ mpicc -O -o hello hello.c [forrest@beowulf hello]$ mpirun -np 6 hello Hello World! I'm process 4 of 6 on beowulf005 Hello World! I'm process 1 of 6 on beowulf002 Hello World! I'm process 5 of 6 on beowulf006 Hello World! I'm process 2 of 6 on beowulf003 Hello World! I'm process 3 of 6 on beowulf004 Hello World! I'm process 0 of 6 on beowulf001

To run the code in parallel, a special command must be used to startup the program on each processor. This command, called mpirun, makes one or more network connections to each node (usually using rsh or ssh) to initiate the run on each processor. Here we will assume there is one process running on each processor and one or more processors available in each node (individual computer). Each process makes its own network connection to the communicator group and, once all the processes have "checked in," the rest of the program begins to execute on each processor.

Output 1 shows how mpirun can be used to run the hello program. The -np flag is used to tell mpirun how many processes to start. Unless told otherwise, mpirun starts one process on local node and starts the remaining processes on nodes listed in the machines file which should have been configured when MPI was installed. Last, the name of the executable file is provided along with any command line flags used by that program.

Program 1 was run using mpirun and the results are shown in Output 1. Each process was started on a different node (beowulf001 through beowulf006) and each process figured out its own rank (0 through 5) as well as the total number of processes involved in the job (6). While each process started at the same time, the printed output appears in no particular order. This is normal behavior when multiple processes all print at the same time.

With these basics out of the way, we can start doing something a little more interesting. Program 2A does nothing of value, but demonstrates the most fundamental means of message passing using MPI: the MPI_Bcast(), MPI_Send(), and MPI_Recv() function calls.

Program 2A: prog2a.c

#include <stdio.h> #include "mpi.h" #define ASIZE 100 #define PI 3.141592653589793238462643 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int i; double seed, init_val[ASIZE], val[ASIZE], sum, tsum; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); if (!me) { /* Only the first process in the group */ printf("Enter some kind of seed value:\n"); scanf("%lf", &seed); for (i = 0; i < ASIZE; i++) init_val[i] = (double)i * seed * PI; } /* Broadcast computed initial values to all other processes */ if (MPI_Bcast(init_val, ASIZE, MPI_DOUBLE, 0, MPI_COMM_WORLD) != MPI_SUCCESS) fprintf(stderr, "Oops! An error occurred in MPI_Bcast()\n"); for (i = 0, sum = 0.0; i < ASIZE; i++) { val[i] = init_val[i] * me; sum += val[i]; } printf("%d: My sum is %lf\n", me, sum); /* Send sum back to the first process */ if (me) { /* All processes except the one of rank 0 */ MPI_Send(&sum, 1, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD); } else { tsum = sum; for (i = 1; i < nprocs; i++) { MPI_Recv(&sum, 1, MPI_DOUBLE, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status); tsum += sum; } printf("%d: Total sum is %lf\n", me, tsum); } MPI_Finalize(); }

**Program 2A: `prog2a.c`**
#include <stdio.h> #include "mpi.h" #define ASIZE 100 #define PI 3.141592653589793238462643 void main(int argc, char *argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int i; double seed, init_val[ASIZE], val[ASIZE], sum, tsum; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); if (!me) { / Only the first process in the group / printf("Enter some kind of seed value:\n"); scanf("%lf", &seed); for (i = 0; i < ASIZE; i++) init_val[i] = (double)i seed * PI; } /* Broadcast computed initial values to all other processes / if (MPI_Bcast(init_val, ASIZE, MPI_DOUBLE, 0, MPI_COMM_WORLD) != MPI_SUCCESS) fprintf(stderr, "Oops! An error occurred in MPI_Bcast()\n"); for (i = 0, sum = 0.0; i < ASIZE; i++) { val[i] = init_val[i] me; sum += val[i]; } printf("%d: My sum is %lf\n", me, sum); /* Send sum back to the first process / if (me) { / All processes except the one of rank 0 */ MPI_Send(&sum, 1, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD); } else { tsum = sum; for (i = 1; i < nprocs; i++) { MPI_Recv(&sum, 1, MPI_DOUBLE, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status); tsum += sum; } printf("%d: Total sum is %lf\n", me, tsum); } MPI_Finalize(); }

The MPI_Bcast() routine is used to send data from one process to all other processes taking part in the parallel program. In Program 2A, initial values are calculated by a single process (the process of rank 0) using a seed value input by the user. Now to get these initial values to the other processes, the MPI_Bcast() routine is called, and it is given the address of the buffer to send (in this case init_val), the size of the buffer (ASIZE), the data type (MPI_DOUBLE which denotes the C type double), the rank of the originator (in this case the zeroth process), and the communicator (MPI_COMM_WORLD).

In this example, we check the return value of MPI_Bcast() to be sure it was successful, i.e., that it returned the value of MPI_SUCCESS. If a failure occurs an error message is printed.

Now that each process has the same init_val array, each value in the array is multiplied by the process rank and summed. This sum, unique to each process, is then printed along with the rank of the process which computed it. In order to sum the sums across all processors, we might use the MPI_Send() and MPI_Recv() commands. In Program 2A each process except process 0 calls the MPI_Send() routine to send its sum to process 0. In the mean time, process 0 calls MPI_Recv() a total of nprocs-1 times and accumulates the received values in tsum.

The MPI_Send() routine is passed the address of the buffer to send (&sum), the number of elements in the buffer (1), the data type of the elements of the buffer (MPI_DOUBLE), the destination rank (0), a message tag (1), and the communicator (MPI_COMM_WORLD). The MPI_Recv() routine is passed the address of the buffer in which to receive (&sum), the number of elements in the buffer (1), the data type of the elements of the buffer (MPI_DOUBLE), the rank of the originator (in this case MPI_ANY_SOURCE which acts like a wild card allowing messages to be received from any sender), a message tag (1), the communicator (MPI_COMM_WORLD), and a status structure which contains information about the message received.

Message tags can be handy when processes pass lots of different kinds of messages around. Message tags can be used to selectively receive messages of a particular type. Messages can be selectively received also by specifying a particular originator rank.

Output 2A shows the results of compiling and running Program 2A. The program, prog2a, is run on six processors using the mpirun command. The first processor (of rank 0) prints a message and the user types in a value, 1.2345. Next, each processor prints its local sum; notice that these lines are in no particular order. Finally, the first processor prints the global sum.

Output 2A

[forrest@beowulf001 prog2]$ mpicc -O -o prog2a prog2a.c [forrest@beowulf001 prog2]$ mpirun -np 6 prog2a Enter some kind of seed value: 1.2345 0: My sum is 0.000000 1: My sum is 19197.565848 5: My sum is 95987.829239 4: My sum is 76790.263391 3: My sum is 57592.697543 2: My sum is 38395.131695 0: Total sum is 287963.487716

**Output 2A**
[forrest@beowulf001 prog2]$ mpicc -O -o prog2a prog2a.c [forrest@beowulf001 prog2]$ mpirun -np 6 prog2a Enter some kind of seed value: 1.2345 0: My sum is 0.000000 1: My sum is 19197.565848 5: My sum is 95987.829239 4: My sum is 76790.263391 3: My sum is 57592.697543 2: My sum is 38395.131695 0: Total sum is 287963.487716

It happens that many commonly-needed operations have their own routines in MPI. Such is the case with computing a global sum as was done in Program 2A. The MPI_Reduce() function performs a number of global operations including computing a global sum. Program 2B is the same as Program 2A, except the MPI_Send() and MPI_Recv() calls are replaced by a single MPI_Reduce() call.

Program 2B: prog2b.c

#include <stdio.h> #include "mpi.h" #define ASIZE 100 #define PI 3.141592653589793238462643 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int i; double seed, init_val[ASIZE], val[ASIZE], sum, tsum; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); if (!me) { /* Only the first process in the group */ printf("Enter some kind of seed value:\n"); scanf("%lf", &seed); for (i = 0; i < ASIZE; i++) init_val[i] = (double)i * seed * PI; } /* Broadcast computed initial values to all other processes */ if (MPI_Bcast(init_val, ASIZE, MPI_DOUBLE, 0, MPI_COMM_WORLD) != MPI_SUCCESS) fprintf(stderr, "Oops! An error occurred in MPI_Bcast()\n"); for (i = 0, sum = 0.0; i < ASIZE; i++) { val[i] = init_val[i] * me; sum += val[i]; } printf("%d: My sum is %lf\n", me, sum); /* Add the value of sum from all processes and store the total in tsum on process 0. */ MPI_Reduce(&sum, &tsum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (!me) printf("%d: Total sum is %lf\n", me, tsum); MPI_Finalize(); }

**Program 2B: `prog2b.c`**
#include <stdio.h> #include "mpi.h" #define ASIZE 100 #define PI 3.141592653589793238462643 void main(int argc, char *argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int i; double seed, init_val[ASIZE], val[ASIZE], sum, tsum; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); if (!me) { / Only the first process in the group / printf("Enter some kind of seed value:\n"); scanf("%lf", &seed); for (i = 0; i < ASIZE; i++) init_val[i] = (double)i seed * PI; } /* Broadcast computed initial values to all other processes / if (MPI_Bcast(init_val, ASIZE, MPI_DOUBLE, 0, MPI_COMM_WORLD) != MPI_SUCCESS) fprintf(stderr, "Oops! An error occurred in MPI_Bcast()\n"); for (i = 0, sum = 0.0; i < ASIZE; i++) { val[i] = init_val[i] me; sum += val[i]; } printf("%d: My sum is %lf\n", me, sum); /* Add the value of sum from all processes and store the total in tsum on process 0. */ MPI_Reduce(&sum, &tsum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (!me) printf("%d: Total sum is %lf\n", me, tsum); MPI_Finalize(); }

The MPI_Reduce() routine is provided the send buffer (&sum), the receive buffer (&tsum), the number of elements in the send buffer (1), the data type of the elements in the send buffer (MPI_DOUBLE), the operation (in this case MPI_SUM), the rank of the destination process (0), and the communicator (MPI_COMM_WORLD). This routine accumulates the values in the sum variable on each processor and stores the result in the tsum variable on processor 0. A similar routine, called MPI_Allreduce(), performs identical operations but returns the results to all processes in the group.

Output 2B shows the results of compiling and running Program 2B. The program, prog2b, is run on six processors using the same seed value as before. Each processor prints its local sum, in some order, and process 0 prints the global sum. As you can see, the result is identical to Output 2A and it was easier to program.

Output 2B

[forrest@beowulf001 prog2]$ mpicc -O -o prog2b prog2b.c [forrest@beowulf001 prog2]$ mpirun -np 6 prog2b Enter some kind of seed value: 1.2345 0: My sum is 0.000000 1: My sum is 19197.565848 5: My sum is 95987.829239 4: My sum is 76790.263391 3: My sum is 57592.697543 2: My sum is 38395.131695 0: Total sum is 287963.487716

**Output 2B**
[forrest@beowulf001 prog2]$ mpicc -O -o prog2b prog2b.c [forrest@beowulf001 prog2]$ mpirun -np 6 prog2b Enter some kind of seed value: 1.2345 0: My sum is 0.000000 1: My sum is 19197.565848 5: My sum is 95987.829239 4: My sum is 76790.263391 3: My sum is 57592.697543 2: My sum is 38395.131695 0: Total sum is 287963.487716

It should already be evident from these simple examples that programming parallel applications using MPI is not terribly difficult, but the resulting applications can be quite powerful. Large and complex applications can see incredible performance gains by harnessing the power of tens or hundreds of processors which may otherwise be sitting idle on someone's desk. There are, however, many ways a programmer can abuse the power of MPI or totally screw up application code if he is not careful. A few of the most common pitfalls are described below.

Blocking, Buffering, and Safety

The MPI_Recv() and MPI_Send() routines used in Program 2A are provided for synchronous communication. These routines are blocking. As a result, MPI_Recv() returns only after the receive buffer contains the new message and MPI_Send() returns either only after a matching receive call occurs or only after the outgoing message has been copied to a temporary system buffer. This behavior can lead to deadlock in poorly designed programs.

Program 3A creates in each process an array that it attempts to pass to the next highest numbered process. The process of rank nprocs-1 attempts to pass its array to the process of rank 0. These send and receive operations are performed nprocs times so that at the end of the loop each process ends up with a copy of its own original array. The program then verifies that the array it received last is the same as the one it sent originally.

Program 3A: prog3a.c

#include <stdio.h> #include "mpi.h" #define ASIZE 100 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int val[ASIZE], sval[ASIZE], rval[ASIZE], i, j, flag; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("I'm process %d of %d\n", me, nprocs); /* Initialize: stuff some bogus values in an array */ for (j = 0; j < ASIZE; j++) val[j] = sval[j] = j * me; for (i = 0; i < nprocs; i++) { /* Receive values from neighboring process */ MPI_Recv(rval, ASIZE, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &status); printf("%d: Received a message from process %d\n", me, status.MPI_SOURCE); /* Send values to neighboring process */ printf("%d: Will send to process %d\n", me, (me < (nprocs-1) ? (me+1) : 0)); MPI_Send(sval, ASIZE, MPI_INT, (me < (nprocs-1) ? (me+1) : 0), i, MPI_COMM_WORLD); for (j = 0; j < ASIZE; j++) sval[j] = rval[j]; } for (j = flag = 0; j < ASIZE; j++) if (rval[j] != val[j]) flag++; if (flag) printf("%d: %d values were different!\n", me, flag); else printf("%d: No values were changed.\n", me); MPI_Finalize(); }

**Program 3A: `prog3a.c`**
#include <stdio.h> #include "mpi.h" #define ASIZE 100 void main(int argc, char *argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int val[ASIZE], sval[ASIZE], rval[ASIZE], i, j, flag; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("I'm process %d of %d\n", me, nprocs); / Initialize: stuff some bogus values in an array / for (j = 0; j < ASIZE; j++) val[j] = sval[j] = j me; for (i = 0; i < nprocs; i++) { /* Receive values from neighboring process / MPI_Recv(rval, ASIZE, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &status); printf("%d: Received a message from process %d\n", me, status.MPI_SOURCE); / Send values to neighboring process */ printf("%d: Will send to process %d\n", me, (me < (nprocs-1) ? (me+1) : 0)); MPI_Send(sval, ASIZE, MPI_INT, (me < (nprocs-1) ? (me+1) : 0), i, MPI_COMM_WORLD); for (j = 0; j < ASIZE; j++) sval[j] = rval[j]; } for (j = flag = 0; j < ASIZE; j++) if (rval[j] != val[j]) flag++; if (flag) printf("%d: %d values were different!\n", me, flag); else printf("%d: No values were changed.\n", me); MPI_Finalize(); }

Program 3A first calls MPI_Recv() in an attempt to receive the array which should be passed by its lower-ranked neighbor. Next, MPI_Send() is called in an attempt to pass its array to its higher-ranked neighbor. The problem is that MPI_Recv() blocks until the matching send is executed by the neighboring process. Because all processes are blocking on the MPI_Recv() call, none of them gets down to calling MPI_Send() and deadlock results. All processes wait forever.

Output 3A shows the results of compiling and running Program 3A. When run, the program printed "I'm process 0 of 6" and then appeared to hang until it was interrupted with a control-c. All processes were waiting to receive messages and no messages were ever sent.

Output 3A

[forrest@beowulf001 prog3]$ mpicc -O -o prog3a prog3a.c [forrest@beowulf001 prog3]$ mpirun -np 6 prog3a I'm process 0 of 6 bm_list_4912: p4_error: interrupt SIGINT: 2 rm_l_1_6207: p4_error: interrupt SIGINT: 2 rm_l_5_20338: p4_error: interrupt SIGINT: 2 rm_l_4_9232: p4_error: interrupt SIGINT: 2 I'm process 1 of 6 p1_6206: p4_error: interrupt SIGINT: 2 rm_l_2_4917: p4_error: interrupt SIGINT: 2 I'm process 4 of 6 p4_9231: p4_error: interrupt SIGINT: 2 I'm process 5 of 6 p5_20337: p4_error: interrupt SIGINT: 2 rm_l_3_9303: p4_error: interrupt SIGINT: 2 I'm process 3 of 6 p3_9302: p4_error: interrupt SIGINT: 2 p0_4911: p4_error: interrupt SIGINT: 2 I'm process 2 of 6 p2_4916: p4_error: interrupt SIGINT: 2

**Output 3A**
[forrest@beowulf001 prog3]$ mpicc -O -o prog3a prog3a.c [forrest@beowulf001 prog3]$ mpirun -np 6 prog3a I'm process 0 of 6 bm_list_4912: p4_error: interrupt SIGINT: 2 rm_l_1_6207: p4_error: interrupt SIGINT: 2 rm_l_5_20338: p4_error: interrupt SIGINT: 2 rm_l_4_9232: p4_error: interrupt SIGINT: 2 I'm process 1 of 6 p1_6206: p4_error: interrupt SIGINT: 2 rm_l_2_4917: p4_error: interrupt SIGINT: 2 I'm process 4 of 6 p4_9231: p4_error: interrupt SIGINT: 2 I'm process 5 of 6 p5_20337: p4_error: interrupt SIGINT: 2 rm_l_3_9303: p4_error: interrupt SIGINT: 2 I'm process 3 of 6 p3_9302: p4_error: interrupt SIGINT: 2 p0_4911: p4_error: interrupt SIGINT: 2 I'm process 2 of 6 p2_4916: p4_error: interrupt SIGINT: 2

Program 3B is just like Program 3A except the send and receive calls are switched around. This program will work on some parallel systems because of buffering, but it could result in deadlock under a number of conditions. Although MPI_Send() is blocking, it may return after the message has been copied to a system buffer but before it is received at its destination. This approach may work if messages are small enough, system buffers are large enough, and the chosen MPI implementation allows for it, but it is considered unsafe. In addition, it reduces the scalability and portability of the code. MPI assumes that safe programs will not rely on system buffering.

Program 3B: prog3b.c

#include <stdio.h> #include "mpi.h" #define ASIZE 100 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int val[ASIZE], sval[ASIZE], rval[ASIZE], i, j, flag; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("I'm process %d of %d\n", me, nprocs); /* Initialize: stuff some bogus values in an array */ for (j = 0; j < ASIZE; j++) val[j] = sval[j] = j * me; for (i = 0; i < nprocs; i++) { /* Send values to neighboring process */ MPI_Send(sval, ASIZE, MPI_INT, (me < (nprocs-1) ? (me+1) : 0), i, MPI_COMM_WORLD); /* Receive values from neighboring process */ MPI_Recv(rval, ASIZE, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &status); for (j = 0; j < ASIZE; j++) sval[j] = rval[j]; } for (j = flag = 0; j < ASIZE; j++) if (rval[j] != val[j]) flag++; if (flag) printf("%d: %d values were different!\n", me, flag); else printf("%d: No values were changed.\n", me); MPI_Finalize(); }

**Program 3B: `prog3b.c`**
#include <stdio.h> #include "mpi.h" #define ASIZE 100 void main(int argc, char *argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int val[ASIZE], sval[ASIZE], rval[ASIZE], i, j, flag; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("I'm process %d of %d\n", me, nprocs); / Initialize: stuff some bogus values in an array / for (j = 0; j < ASIZE; j++) val[j] = sval[j] = j me; for (i = 0; i < nprocs; i++) { /* Send values to neighboring process / MPI_Send(sval, ASIZE, MPI_INT, (me < (nprocs-1) ? (me+1) : 0), i, MPI_COMM_WORLD); / Receive values from neighboring process */ MPI_Recv(rval, ASIZE, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &status); for (j = 0; j < ASIZE; j++) sval[j] = rval[j]; } for (j = flag = 0; j < ASIZE; j++) if (rval[j] != val[j]) flag++; if (flag) printf("%d: %d values were different!\n", me, flag); else printf("%d: No values were changed.\n", me); MPI_Finalize(); }

Output 3B shows the results of compiling and running Program 3B. The program ran to completion and produced the desired result even though Program 3B is unsafe.

Output 3B

[forrest@beowulf001 prog3]$ mpicc -O -o prog3b prog3b.c [forrest@beowulf001 prog3]$ mpirun -np 6 prog3b I'm process 0 of 6 I'm process 1 of 6 1: No values were changed. I'm process 5 of 6 5: No values were changed. I'm process 3 of 6 3: No values were changed. I'm process 4 of 6 4: No values were changed. I'm process 2 of 6 2: No values were changed. 0: No values were changed.

**Output 3B**
[forrest@beowulf001 prog3]$ mpicc -O -o prog3b prog3b.c [forrest@beowulf001 prog3]$ mpirun -np 6 prog3b I'm process 0 of 6 I'm process 1 of 6 1: No values were changed. I'm process 5 of 6 5: No values were changed. I'm process 3 of 6 3: No values were changed. I'm process 4 of 6 4: No values were changed. I'm process 2 of 6 2: No values were changed. 0: No values were changed.

Program 3C offers a better alternative. It uses a non-blocking receive call followed by a blocking send. MPI_Irecv() "posts up" a receive so that if a message comes it, it is copied into the receive buffer even though the program may be off doing something else. This is a much more efficient way to do message exchanges. An MPI_Isend() could have been used followed by a blocking receive; however, this can be less efficient under some circumstances.

Program 3C: prog3c.c

#include <stdio.h> #include "mpi.h" #define ASIZE 100 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int val[ASIZE], sval[ASIZE], rval[ASIZE], i, j, flag; MPI_Status status; MPI_Request request; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("I'm process %d of %d\n", me, nprocs); /* Initialize: stuff some bogus values in an array */ for (j = 0; j < ASIZE; j++) val[j] = sval[j] = j * me; for (i = 0; i < nprocs; i++) { /* Post up receive for values from neighboring process */ MPI_Irecv(rval, ASIZE, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &request); /* Send values to neighboring process */ MPI_Send(sval, ASIZE, MPI_INT, (me < (nprocs-1) ? (me+1) : 0), i, MPI_COMM_WORLD); /* Wait until the the receive request has completed */ MPI_Wait(&request, &status); for (j = 0; j < ASIZE; j++) sval[j] = rval[j]; } for (j = flag = 0; j < ASIZE; j++) if (rval[j] != val[j]) flag++; if (flag) printf("%d: %d values were different!\n", me, flag); else printf("%d: No values were changed.\n", me); MPI_Finalize(); }

**Program 3C: `prog3c.c`**
#include <stdio.h> #include "mpi.h" #define ASIZE 100 void main(int argc, char *argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int val[ASIZE], sval[ASIZE], rval[ASIZE], i, j, flag; MPI_Status status; MPI_Request request; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("I'm process %d of %d\n", me, nprocs); / Initialize: stuff some bogus values in an array / for (j = 0; j < ASIZE; j++) val[j] = sval[j] = j me; for (i = 0; i < nprocs; i++) { /* Post up receive for values from neighboring process / MPI_Irecv(rval, ASIZE, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &request); / Send values to neighboring process / MPI_Send(sval, ASIZE, MPI_INT, (me < (nprocs-1) ? (me+1) : 0), i, MPI_COMM_WORLD); / Wait until the the receive request has completed */ MPI_Wait(&request, &status); for (j = 0; j < ASIZE; j++) sval[j] = rval[j]; } for (j = flag = 0; j < ASIZE; j++) if (rval[j] != val[j]) flag++; if (flag) printf("%d: %d values were different!\n", me, flag); else printf("%d: No values were changed.\n", me); MPI_Finalize(); }

MPI_Irecv() is passed the address of the receive buffer (rval), the size of the buffer (ASIZE), the data type (MPI_INT), the rank of the source (MPI_ANY_SOURCE), a message tag (i), a communicator (MPI_COMM_WORLD), and a request handle (request) which can be used to check on the status of the receive later. Notice that MPI_Wait() is called just before rval is needed. This is to ensure that the receive has completed before trying to use its results. MPI_Wait() is passed the previously-filled request handle from the MPI_Irecv() call and a pointer to a status structure. Forgetting to wait for non-blocking communications to occur can result in indeterminate and undesirable, but often entertaining, behavior.

Output 3C shows the successful result obtained by running Program 3C.

Output 3C

[forrest@beowulf001 prog3]$ mpicc -O -o prog3c prog3c.c [forrest@beowulf001 prog3]$ mpirun -np 6 prog3c I'm process 0 of 6 I'm process 2 of 6 2: No values were changed. I'm process 3 of 6 3: No values were changed. I'm process 5 of 6 5: No values were changed. I'm process 4 of 6 4: No values were changed. 0: No values were changed. I'm process 1 of 6 1: No values were changed.

**Output 3C**
[forrest@beowulf001 prog3]$ mpicc -O -o prog3c prog3c.c [forrest@beowulf001 prog3]$ mpirun -np 6 prog3c I'm process 0 of 6 I'm process 2 of 6 2: No values were changed. I'm process 3 of 6 3: No values were changed. I'm process 5 of 6 5: No values were changed. I'm process 4 of 6 4: No values were changed. 0: No values were changed. I'm process 1 of 6 1: No values were changed.

Conclusions

We have presented some of the theory behind parallel computing, discussed problem decomposition and code granularity, provided an introduction to the most frequently-used MPI message passing routines, and described many of the common pitfalls associated with using message passing. The conscientious programmer has much to consider when beginning to write parallel code, but the communication among processes is made much simpler through the use of MPI message passing routines.

References

Becker, Donald J., Thomas Sterling, Daniel Savarese, John E. Dorband, Udaya A. Ranawak, Charles V. Packer. 1995. Beowulf: A Parallel Workstation for Scientific Computation. Proceedings, International Conference on Parallel Processing.
Gropp, W., E. Lusk, N. Doss, and A. Skjellum. September 1996. "A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard." Parallel Computing, 22(6):789-828.
Hoffman, F. M. and W. W. Hargrove. March 1999. "Cluster Computing: Linux Taken to the Extreme." Linux Magaine, Vol. 1, No. 1, pp. 56-59.

Forrest M. Hoffman is a computer specialist at Oak Ridge National Laboratory in Oak Ridge, Tennessee.

William W. Hargrove is an ecologist and spatial modeler at Oak Ridge National Laboratory.

High Performance Computing: An Introduction to Parallel Programming With Beowulf