Forrest M. Hoffman and William W. Hargrove
Parallel computer systems have been available commercially for many years. Used primarily in federal defense organizations and research laboratories, these costly behemoths were difficult to use and program, often requiring specialized skills and intimate knowledge of the unique architecture of each machine. With so few customers and such specialized equipment, the supercomputer industry floundered until the recent trend of constructing supercomputers from clusters of workstations and commodity PCs.
Authors Forrest Hoffman (standing) and Bill Hargrove sit "inside" the computer they constructed from commodity PCs. |
The Beowulf project (http://www.beowulf.org/), begun at NASA's Goddard Space Flight Center, has helped solidify this trend through the use of Linux and other Open Source software running on inexpensive, off-the-shelf PCs (Becker et al. 1995). This work has opened the door for low-cost, high performance cluster computing. In addition, standards and tools have been developed for such distributed memory parallel computer systems making it easier for programmers to build scalable and portable parallel computer applications.
Because of the low costs involved in building a Beowulf-style cluster from commodity PCs, many research organizations, universities, and businesses have done just that in an effort to hold down costs while increasing computational performance for complex problems. Having built our own Beowulf-style cluster (Hoffman and Hargrove 1999), we often hear from students or researchers who have built Beowulf clusters but are not sure where to go from there. We hope this article will provide some direction for those who need help getting started with programming their own parallel computer applications.
High performance parallel computing is accomplished by splitting up large and complex tasks across multiple processors. During World War II, well before the advent of the electronic computer, a similar technique was used for carrying out long calculations associated with the design of the atomic bomb for the Manhattan Project. To significantly reduce the amount of time it took to solve a large mathematical problem, each part of the problem was performed by a different person. Interestingly enough, the people who performed these calculations were called computers. Today electronic computers can work in harmony to solve scientific problems not dreamed of even a decade ago.
While a variety of methods can be used to achieve improved performance on multiple computers, the most common way to organize and orchestrate parallel processing is to write code which automatically decomposes the problem at hand and allows the processors to communicate with each other when necessary while performing their work.
Not every computational problem is amenable to parallel computing. If no sub-tasks can be performed simultaneously or if the system being modeled is highly interdependent (i.e., the problem is "fine grained"), attempts to parallelize such codes may result in increased time-to-solution. For the finest-grained problems computational performance is limited by the speed of the fastest single CPU available on the market.
Luckily, many complex scientific problems can be decomposed either by isolating separate tasks which can be performed independently and simultaneously by multiple processors or more often by splitting up the space (or time) coordinates of the system being modeled so that values for each sub-region (or time interval) can be calculated simultaneously. For example, many image processing applications perform calculations on individual cells or pixels without requiring knowledge of the state of any other pixel in the frame. These kinds of applications are "coarse grained," are generally easy to parallelize, and attain the most benefit from parallel processing. Such applications are often referred to as "embarrassingly parallel."
Most scientific applications, however, fall somewhere in between coarse and fine granularity. These applications usually require some amount of interaction between sub-regions so individual processors must be able to communicate with each other to exchange calculated values, a technique called message passing. For example, values for cells on a map may depend on the values of their nearest neighboring cells. If the map is decomposed into two pieces, each being processed on a separate CPU, the processors must exchange cell values on adjacent edges of the map.
Spatial decompositions must be made with care and attention to the spatial interdependency of the problem. A common pitfall is to turn what was a computational problem on a single processor into a communications problem on multiple processors. This occurs most frequently when a problem is not properly decomposed or when the amount of interaction across space (or time) is so great that the code spends more time communicating than calculating. Striking the right balance between computation and communication for any particular problem on any particular parallel computing platform is more art than science.
While techniques for problem decomposition will not be considered here, it is important that programmers give careful thought to how they split up the work inside their applications. Once a good decomposition strategy is found, it is time to begin programming.
Many different techniques can be used for message passing on a Beowulf cluster or other parallel computer platform, including Threads or Inter-Process Communication (IPC) on a single node with multiple processors, or TCP sockets, Remote Procedure Calls (RPCs), or less sophisticated exchanging of messages through files visible on multiple nodes. But the best and easiest strategy is to use software libraries specifically designed for message passing on parallel computers. The two most popular libraries or applications program interfaces (APIs) are PVM (Parallel Virtual Machine) and MPI (Message Passing Interface). Because of the wide availability of these two APIs, parallel code which performs message passing using the PVM or MPI libraries can be run on everything from laptops to CRAYs. Further, code developed on Beowulf clusters can be easily moved to very large commercial parallel computers, often without changing a single line of code.
PVM (http://www.epm.ornl.gov/pvm/) is available on a wide range of computer platforms and has bindings for C, C++, and FORTRAN as well as implementations for Java and Python. It includes many advanced features for building complex distributed computing applications.
MPI is really a specification created by a committee, the Message Passing Interface Forum (MPIF), initially in 1994. The specification describes the features and syntax of the API, but leaves the details and techniques of the actual implementation of these features to developers who want to create libraries meeting the MPI specification. Various implementations of MPI are available on a wide variety of computer platforms, but the two most popular ones are MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) (Gropp et al. 1996) and LAM (Local Area Multicomputer)/MPI (http://www.mpi.nd.edu/lam/). Both offer C, C++, and FORTRAN bindings and are widely used on Beowulf clusters. All vendors of commercial parallel computers provide their own implementation of MPI optimized for their systems.
While MPI does not offer some of the specialized features available in
PVM, it is based on agreed-upon standards, is increasingly used for code
development, and has adequate features for most parallel applications.
Hence, the coding examples presented here will be written in C using MPI.
The codes have been tested on a Beowulf cluster using the Gnu C compiler
(gcc
) with the MPICH implementation of MPI.
The first step is to download and install the desired message
passing library on all the different computer architectures which
will be used. On a Beowulf cluster with a shared filesystem, a single
installation should do the trick. After installing the software, the
machines
should be populated with the list of available
nodes. Check the documentation to find out where this file should
reside.
Next, one must become familiar with the syntax and semantics of using MPI calls. The most common method of programming parallel applications today is called Single Program Multiple Data (SPMD). While different programs could be written for each processor working on a single problem in parallel, SPMD codes, which consist of a single code base and resulting executable, are easier to write and maintain. Only SPMD code examples will be presented here.
Program 1 is a "Hello World!" program that illustrates the basic MPI calls necessary to startup and end an MPI program.
#include <stdio.h> #include "mpi.h" void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("Hello World! I'm process %d of %d on %s\n", me, nprocs, processor_name); MPI_Finalize(); } |
In order to successfully compile the code, the MPI header file
(mpi.h
) must be included at the top of the code. Just inside
main()
, MPI_Init()
must be called and handed
the command line arguments so that the environment is setup correctly
for the program to run in parallel.
The next three MPI routines in Program 1
return information about the parallel environment for use later in
the code. In this example, we merely print out the information,
but in most parallel codes this information is used to do automatic
problem decomposition and to setup communications between processes.
MPI_Comm_size()
provides the number of processes, which
is subsequently stored in nprocs
, in the communicator
group MPI_COMM_WORLD
. MPI_COMM_WORLD
is
a special communicator which denotes all of the processes available
at initialization. MPI_Comm_rank()
provides the rank
or process number (ranging from 0 to nprocs
-1) of the
calling process. The rank is subsequently stored in me
.
MPI_Get_processor_name()
provides the hostname of
the node (not the individual processor) being used, stored in
processor_name
, as well as the length of this hostname,
stored in namelen
.
Next, the code prints "Hello World!" and the values of the
variables obtained in the three previous MPI calls. Finally,
MPI_Finalize()
is called to terminate the parallel
environment.
MPI programs can be compiled in many ways, but most MPI implementations
provide an easy-to-use script which will set desired compiler flags, point
the compiler at the right directory for MPI header files, and include the
necessary libraries for the linker. The MPICH implementation provides
a script called mpicc
which will use the desired compiler,
in this case gcc
, and will pass the other command line
arguments to it. Output 1 shows how to compile
Program 1 with mpicc
.
[forrest@beowulf hello]$ mpicc -O -o hello hello.c [forrest@beowulf hello]$ mpirun -np 6 hello Hello World! I'm process 4 of 6 on beowulf005 Hello World! I'm process 1 of 6 on beowulf002 Hello World! I'm process 5 of 6 on beowulf006 Hello World! I'm process 2 of 6 on beowulf003 Hello World! I'm process 3 of 6 on beowulf004 Hello World! I'm process 0 of 6 on beowulf001 |
To run the code in parallel, a special command must be used
to startup the program on each processor. This command, called
mpirun
, makes one or more network connections to each node
(usually using rsh
or ssh
) to initiate the run
on each processor. Here we will assume there is one process running
on each processor and one or more processors available in each node
(individual computer). Each process makes its own network connection
to the communicator group and, once all the processes have "checked in,"
the rest of the program begins to execute on each processor.
Output 1 shows how mpirun
can
be used to run the hello
program. The -np
flag is used to tell mpirun
how many processes to start.
Unless told otherwise, mpirun
starts one process on
local node and starts the remaining processes on nodes listed in the
machines
file which should have been configured when MPI
was installed. Last, the name of the executable file is provided along
with any command line flags used by that program.
Program 1 was run using mpirun
and
the results are shown in Output 1. Each process
was started on a different node (beowulf001 through beowulf006) and
each process figured out its own rank (0 through 5) as well as the total
number of processes involved in the job (6). While each process started at
the same time, the printed output appears in no particular order. This is
normal behavior when multiple processes all print at the same time.
With these basics out of the way, we can start doing something a little
more interesting. Program 2A does nothing of
value, but demonstrates the most fundamental means of message passing
using MPI: the MPI_Bcast()
, MPI_Send()
,
and MPI_Recv()
function calls.
#include <stdio.h> #include "mpi.h" #define ASIZE 100 #define PI 3.141592653589793238462643 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int i; double seed, init_val[ASIZE], val[ASIZE], sum, tsum; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); if (!me) { /* Only the first process in the group */ printf("Enter some kind of seed value:\n"); scanf("%lf", &seed); for (i = 0; i < ASIZE; i++) init_val[i] = (double)i * seed * PI; } /* Broadcast computed initial values to all other processes */ if (MPI_Bcast(init_val, ASIZE, MPI_DOUBLE, 0, MPI_COMM_WORLD) != MPI_SUCCESS) fprintf(stderr, "Oops! An error occurred in MPI_Bcast()\n"); for (i = 0, sum = 0.0; i < ASIZE; i++) { val[i] = init_val[i] * me; sum += val[i]; } printf("%d: My sum is %lf\n", me, sum); /* Send sum back to the first process */ if (me) { /* All processes except the one of rank 0 */ MPI_Send(&sum, 1, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD); } else { tsum = sum; for (i = 1; i < nprocs; i++) { MPI_Recv(&sum, 1, MPI_DOUBLE, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status); tsum += sum; } printf("%d: Total sum is %lf\n", me, tsum); } MPI_Finalize(); } |
The MPI_Bcast()
routine is used to send data from one
process to all other processes taking part in the parallel program.
In Program 2A, initial values are calculated
by a single process (the process of rank 0) using a seed value input by
the user. Now to get these initial values to the other processes, the
MPI_Bcast()
routine is called, and it is given the address
of the buffer to send (in this case init_val
), the size of
the buffer (ASIZE
), the data type (MPI_DOUBLE
which denotes the C type double
), the rank of the
originator (in this case the zeroth process), and the communicator
(MPI_COMM_WORLD
).
In this example, we check the return value of MPI_Bcast()
to be sure it was successful, i.e., that it returned the value of
MPI_SUCCESS
. If a failure occurs an error message is
printed.
Now that each process has the same init_val
array,
each value in the array is multiplied by the process rank and summed.
This sum, unique to each process, is then printed along with the rank
of the process which computed it. In order to sum the sums across all
processors, we might use the MPI_Send()
and
MPI_Recv()
commands. In Program
2A each process except process 0 calls the MPI_Send()
routine to send its sum to process 0. In the mean time, process 0 calls
MPI_Recv()
a total of nprocs
-1 times and
accumulates the received values in tsum
.
The MPI_Send()
routine is passed the address of the buffer
to send (&sum
), the number of elements in the buffer (1),
the data type of the elements of the buffer (MPI_DOUBLE
),
the destination rank (0), a message tag (1), and the communicator
(MPI_COMM_WORLD
). The MPI_Recv()
routine is passed the address of the buffer in which to receive
(&sum
), the number of elements in the buffer (1), the
data type of the elements of the buffer (MPI_DOUBLE
), the
rank of the originator (in this case MPI_ANY_SOURCE
which
acts like a wild card allowing messages to be received from any sender),
a message tag (1), the communicator (MPI_COMM_WORLD
), and
a status structure which contains information about the message
received.
Message tags can be handy when processes pass lots of different kinds of messages around. Message tags can be used to selectively receive messages of a particular type. Messages can be selectively received also by specifying a particular originator rank.
Output 2A shows the results of
compiling and running Program 2A.
The program, prog2a
, is run on six processors using the
mpirun
command. The first processor (of rank 0) prints
a message and the user types in a value, 1.2345. Next, each processor
prints its local sum; notice that these lines are in no particular order.
Finally, the first processor prints the global sum.
[forrest@beowulf001 prog2]$ mpicc -O -o prog2a prog2a.c [forrest@beowulf001 prog2]$ mpirun -np 6 prog2a Enter some kind of seed value: 1.2345 0: My sum is 0.000000 1: My sum is 19197.565848 5: My sum is 95987.829239 4: My sum is 76790.263391 3: My sum is 57592.697543 2: My sum is 38395.131695 0: Total sum is 287963.487716 |
It happens that many commonly-needed operations have their own
routines in MPI. Such is the case with computing a global sum as was
done in Program 2A. The
MPI_Reduce()
function performs a number of global
operations including computing a global sum. Program 2B is the same as Program 2A, except the MPI_Send()
and
MPI_Recv()
calls are replaced by a single
MPI_Reduce()
call.
#include <stdio.h> #include "mpi.h" #define ASIZE 100 #define PI 3.141592653589793238462643 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int i; double seed, init_val[ASIZE], val[ASIZE], sum, tsum; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); if (!me) { /* Only the first process in the group */ printf("Enter some kind of seed value:\n"); scanf("%lf", &seed); for (i = 0; i < ASIZE; i++) init_val[i] = (double)i * seed * PI; } /* Broadcast computed initial values to all other processes */ if (MPI_Bcast(init_val, ASIZE, MPI_DOUBLE, 0, MPI_COMM_WORLD) != MPI_SUCCESS) fprintf(stderr, "Oops! An error occurred in MPI_Bcast()\n"); for (i = 0, sum = 0.0; i < ASIZE; i++) { val[i] = init_val[i] * me; sum += val[i]; } printf("%d: My sum is %lf\n", me, sum); /* Add the value of sum from all processes and store the total in tsum on process 0. */ MPI_Reduce(&sum, &tsum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (!me) printf("%d: Total sum is %lf\n", me, tsum); MPI_Finalize(); } |
The MPI_Reduce()
routine is provided the send buffer
(&sum
), the receive buffer (&tsum
),
the number of elements in the send buffer (1), the data type of the
elements in the send buffer (MPI_DOUBLE
), the operation
(in this case MPI_SUM
), the rank of the destination process
(0), and the communicator (MPI_COMM_WORLD
). This routine
accumulates the values in the sum
variable on each
processor and stores the result in the tsum
variable on
processor 0. A similar routine, called MPI_Allreduce()
,
performs identical operations but returns the results to all processes
in the group.
Output 2B shows the results of compiling and
running Program 2B. The program,
prog2b
, is run on six processors using the same seed value
as before. Each processor prints its local sum, in some order, and
process 0 prints the global sum. As you can see, the result is
identical to Output 2A and it was easier to
program.
[forrest@beowulf001 prog2]$ mpicc -O -o prog2b prog2b.c [forrest@beowulf001 prog2]$ mpirun -np 6 prog2b Enter some kind of seed value: 1.2345 0: My sum is 0.000000 1: My sum is 19197.565848 5: My sum is 95987.829239 4: My sum is 76790.263391 3: My sum is 57592.697543 2: My sum is 38395.131695 0: Total sum is 287963.487716 |
It should already be evident from these simple examples that programming parallel applications using MPI is not terribly difficult, but the resulting applications can be quite powerful. Large and complex applications can see incredible performance gains by harnessing the power of tens or hundreds of processors which may otherwise be sitting idle on someone's desk. There are, however, many ways a programmer can abuse the power of MPI or totally screw up application code if he is not careful. A few of the most common pitfalls are described below.
The MPI_Recv()
and MPI_Send()
routines used
in Program 2A are provided for synchronous
communication. These routines are blocking. As a result,
MPI_Recv()
returns only after the receive buffer contains
the new message and MPI_Send()
returns either only after
a matching receive call occurs or only after the outgoing message has
been copied to a temporary system buffer. This behavior can lead to
deadlock in poorly designed programs.
Program 3A creates in each process an
array that it attempts to pass to the next highest numbered process.
The process of rank nprocs
-1 attempts to pass its array to
the process of rank 0. These send and receive operations are performed
nprocs
times so that at the end of the loop each process ends
up with a copy of its own original array. The program then verifies that
the array it received last is the same as the one it sent originally.
#include <stdio.h> #include "mpi.h" #define ASIZE 100 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int val[ASIZE], sval[ASIZE], rval[ASIZE], i, j, flag; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("I'm process %d of %d\n", me, nprocs); /* Initialize: stuff some bogus values in an array */ for (j = 0; j < ASIZE; j++) val[j] = sval[j] = j * me; for (i = 0; i < nprocs; i++) { /* Receive values from neighboring process */ MPI_Recv(rval, ASIZE, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &status); printf("%d: Received a message from process %d\n", me, status.MPI_SOURCE); /* Send values to neighboring process */ printf("%d: Will send to process %d\n", me, (me < (nprocs-1) ? (me+1) : 0)); MPI_Send(sval, ASIZE, MPI_INT, (me < (nprocs-1) ? (me+1) : 0), i, MPI_COMM_WORLD); for (j = 0; j < ASIZE; j++) sval[j] = rval[j]; } for (j = flag = 0; j < ASIZE; j++) if (rval[j] != val[j]) flag++; if (flag) printf("%d: %d values were different!\n", me, flag); else printf("%d: No values were changed.\n", me); MPI_Finalize(); } |
Program 3A first calls MPI_Recv()
in an attempt to receive the array which should be passed by its
lower-ranked neighbor. Next, MPI_Send()
is called in an
attempt to pass its array to its higher-ranked neighbor. The problem is
that MPI_Recv()
blocks until the matching send is executed
by the neighboring process. Because all processes are blocking on
the MPI_Recv()
call, none of them gets down to calling
MPI_Send()
and deadlock results. All processes wait
forever.
Output 3A shows the results of compiling and
running Program 3A. When run, the program
printed "I'm process 0 of 6
" and then appeared to hang
until it was interrupted with a control-c. All processes were waiting
to receive messages and no messages were ever sent.
[forrest@beowulf001 prog3]$ mpicc -O -o prog3a prog3a.c [forrest@beowulf001 prog3]$ mpirun -np 6 prog3a I'm process 0 of 6 bm_list_4912: p4_error: interrupt SIGINT: 2 rm_l_1_6207: p4_error: interrupt SIGINT: 2 rm_l_5_20338: p4_error: interrupt SIGINT: 2 rm_l_4_9232: p4_error: interrupt SIGINT: 2 I'm process 1 of 6 p1_6206: p4_error: interrupt SIGINT: 2 rm_l_2_4917: p4_error: interrupt SIGINT: 2 I'm process 4 of 6 p4_9231: p4_error: interrupt SIGINT: 2 I'm process 5 of 6 p5_20337: p4_error: interrupt SIGINT: 2 rm_l_3_9303: p4_error: interrupt SIGINT: 2 I'm process 3 of 6 p3_9302: p4_error: interrupt SIGINT: 2 p0_4911: p4_error: interrupt SIGINT: 2 I'm process 2 of 6 p2_4916: p4_error: interrupt SIGINT: 2 |
Program 3B is just like Program 3A except the send and receive calls
are switched around. This program will work on some parallel systems
because of buffering, but it could result in deadlock under a number
of conditions. Although MPI_Send()
is blocking, it may
return after the message has been copied to a system buffer but before
it is received at its destination. This approach may work if messages
are small enough, system buffers are large enough, and the chosen MPI
implementation allows for it, but it is considered unsafe. In addition,
it reduces the scalability and portability of the code. MPI assumes
that safe programs will not rely on system buffering.
#include <stdio.h> #include "mpi.h" #define ASIZE 100 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int val[ASIZE], sval[ASIZE], rval[ASIZE], i, j, flag; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("I'm process %d of %d\n", me, nprocs); /* Initialize: stuff some bogus values in an array */ for (j = 0; j < ASIZE; j++) val[j] = sval[j] = j * me; for (i = 0; i < nprocs; i++) { /* Send values to neighboring process */ MPI_Send(sval, ASIZE, MPI_INT, (me < (nprocs-1) ? (me+1) : 0), i, MPI_COMM_WORLD); /* Receive values from neighboring process */ MPI_Recv(rval, ASIZE, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &status); for (j = 0; j < ASIZE; j++) sval[j] = rval[j]; } for (j = flag = 0; j < ASIZE; j++) if (rval[j] != val[j]) flag++; if (flag) printf("%d: %d values were different!\n", me, flag); else printf("%d: No values were changed.\n", me); MPI_Finalize(); } |
Output 3B shows the results of compiling and running Program 3B. The program ran to completion and produced the desired result even though Program 3B is unsafe.
[forrest@beowulf001 prog3]$ mpicc -O -o prog3b prog3b.c [forrest@beowulf001 prog3]$ mpirun -np 6 prog3b I'm process 0 of 6 I'm process 1 of 6 1: No values were changed. I'm process 5 of 6 5: No values were changed. I'm process 3 of 6 3: No values were changed. I'm process 4 of 6 4: No values were changed. I'm process 2 of 6 2: No values were changed. 0: No values were changed. |
Program 3C offers a better alternative. It
uses a non-blocking receive call followed by a blocking send.
MPI_Irecv()
"posts up" a receive so that if a message comes
it, it is copied into the receive buffer even though the program may be
off doing something else. This is a much more efficient way to do
message exchanges. An MPI_Isend()
could have been used
followed by a blocking receive; however, this can be less efficient
under some circumstances.
#include <stdio.h> #include "mpi.h" #define ASIZE 100 void main(int argc, char **argv) { int me, nprocs, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int val[ASIZE], sval[ASIZE], rval[ASIZE], i, j, flag; MPI_Status status; MPI_Request request; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Get_processor_name(processor_name, &namelen); printf("I'm process %d of %d\n", me, nprocs); /* Initialize: stuff some bogus values in an array */ for (j = 0; j < ASIZE; j++) val[j] = sval[j] = j * me; for (i = 0; i < nprocs; i++) { /* Post up receive for values from neighboring process */ MPI_Irecv(rval, ASIZE, MPI_INT, MPI_ANY_SOURCE, i, MPI_COMM_WORLD, &request); /* Send values to neighboring process */ MPI_Send(sval, ASIZE, MPI_INT, (me < (nprocs-1) ? (me+1) : 0), i, MPI_COMM_WORLD); /* Wait until the the receive request has completed */ MPI_Wait(&request, &status); for (j = 0; j < ASIZE; j++) sval[j] = rval[j]; } for (j = flag = 0; j < ASIZE; j++) if (rval[j] != val[j]) flag++; if (flag) printf("%d: %d values were different!\n", me, flag); else printf("%d: No values were changed.\n", me); MPI_Finalize(); } |
MPI_Irecv()
is passed the address of the receive buffer
(rval
), the size of the buffer (ASIZE
),
the data type (MPI_INT
), the rank of the source
(MPI_ANY_SOURCE
), a message tag (i
), a
communicator (MPI_COMM_WORLD
), and a request handle
(request
) which can be used to check on the status
of the receive later. Notice that MPI_Wait()
is
called just before rval
is needed. This is to ensure
that the receive has completed before trying to use its results.
MPI_Wait()
is passed the previously-filled request handle
from the MPI_Irecv()
call and a pointer to a status
structure. Forgetting to wait for non-blocking communications to occur
can result in indeterminate and undesirable, but often entertaining,
behavior.
Output 3C shows the successful result obtained by running Program 3C.
[forrest@beowulf001 prog3]$ mpicc -O -o prog3c prog3c.c [forrest@beowulf001 prog3]$ mpirun -np 6 prog3c I'm process 0 of 6 I'm process 2 of 6 2: No values were changed. I'm process 3 of 6 3: No values were changed. I'm process 5 of 6 5: No values were changed. I'm process 4 of 6 4: No values were changed. 0: No values were changed. I'm process 1 of 6 1: No values were changed. |
We have presented some of the theory behind parallel computing, discussed problem decomposition and code granularity, provided an introduction to the most frequently-used MPI message passing routines, and described many of the common pitfalls associated with using message passing. The conscientious programmer has much to consider when beginning to write parallel code, but the communication among processes is made much simpler through the use of MPI message passing routines.
Forrest M. Hoffman is a computer specialist at Oak Ridge National Laboratory in Oak Ridge, Tennessee.
William W. Hargrove is an ecologist and spatial modeler at Oak Ridge National Laboratory.