Review of MPI Message Passing |
It is not safe to modify or use the application buffer after completion of a non-blocking send. It is the programmer's responsibility to insure that the application buffer is free for reuse.
Non-blocking communications are primarily used to overlap computation with communication to effect performance gains.
Review of MPI Message Passing |
Blocking Point-to-Point Routines | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MPI_Bsend
Buffered send
| MPI_Recv
| Receive
| MPI_Rsend
| Ready send
| MPI_Send
| Standard send
| MPI_Sendrecv
| Combined send and receive
| MPI_Sendrecv_replace
| Combined send and receive using a common buffer
| MPI_Ssend
| Synchronous send
| Non-Blocking Point-to-Point Routines |
MPI_Ibsend
| Buffered send
| MPI_Irecv
| Receive
| MPI_Irsend
| Ready send
| MPI_Isend
| Standard send
| MPI_Issend
| Synchronous send
| Persistent Communications Point-to-Point Routines |
MPI_Bsend_init
| Creates a persistent buffered send request
| MPI_Recv_init
| Creates a persistent receive request
| MPI_Rsend_init
| Creates a persistent ready send request
| MPI_Send_init
| Creates a persistent standard send request
| MPI_Ssend_init
| Creates a persistent synchronous send request
| MPI_Start
| Activates a persistent request operation
| MPI_Startall
| Activates a collection of persistent request operations
| Completion / Testing Point-to-Point Routines |
MPI_Iprobe
| Non-blocking query for a message's arrival
| MPI_Probe
| Blocking query for a message's arrival
| MPI_Test
| MPI_Testall MPI_Testany MPI_Testsome Non-blocking tests for message arrival(s)
| MPI_Wait
| MPI_Waitall MPI_Waitany MPI_Waitsome Blocking waits for the completion of non-blocking operation
request(s)
| Collective Communication Routines |
MPI_Allgather
| Gathers individual messages from each process in the
communicator and distributes the resulting message to each process.
| MPI_Allgatherv
| Same as MPI_Allgather but allows for messages to be of
different sizes and displacements.
| MPI_Allreduce
| Performs the specified reduction operation across all tasks
in the communicator and then distributes the result to all tasks.
| MPI_Alltoall
| Sends a distinct message from each process to every process.
| MPI_Alltoallv
| Same as MPI_Alltoall but allows for messages to be of
different sizes and displacements.
| MPI_Barrier
| Creates a barrier synchronization in the communicator
| MPI_Bcast
| Broadcasts a message from one process to all other
processes in the communicator.
| MPI_Gather
| Gathers distinct messages from each task in the communicator
to a single destination task.
| MPI_Gatherv
| Same as MPI_Gatherv but allows for messages to be of
different sizes and displacements.
| MPI_Reduce
| Performs a reduction operation across all tasks in the
communicator and places the result in a single task.
| MPI_Reduce_scatter
| First does an element-wise reduction on a vector across
all tasks in the group. Next, the result vector is split into disjoint
segments and distributed across the tasks. This is equivalent to an MPI_Reduce
followed by an MPI_Scatter operation.
| MPI_Scan
| Performs a parallel prefix reduction on data distributed
across the communicator
| MPI_Scatter
| Distributes distinct messages from a single source task
to each task in the group.
| MPI_Scatterv
| Same as MPI_Scatterv but allows for messages to be of
different sizes and displacements.
| |
Factors Affecting MPI Performance |
Platform / Architecture Related:
Network Related:
Application Related:
MPI Implementation Related:
Message Buffering |
Implementations Differ:
MPI_Buffer_attach - Allocates user buffer space |
MPI_Buffer_detach - Frees user buffer space |
MPI_Bsend - Buffer send, blocking |
MPI_Ibsend - Buffer send, non-blocking |
Advantages:
Disadvantages:
Correctness:
Implementation Notes:
![]() | For IBM's MPI, using a buffered send will not resolve the receive buffer
exhaustion problems in the unsafe example program shown above.
The environment variable MP_BUFFER_MEM allows the user to specify how much system buffer space is available for the receive process. Defaults are IP = 2.8 MB and US = 64 MB. The maximum which may be specified is 64 MB. Using this with MP_EAGER_LIMIT may improve performance for some programs (discussed in next section). |
MPI Message Passing Protocols |
Two Common Message Passing Protocols:
Implementations Differ:
MPI Message Passing Protocols |
Potential Advantages:
Potential Disdvantages:
Implementation Notes:
![]() | The eager protocol message size is set with the
MP_EAGER_LIMIT environment variable.
Its default values appear in the table below. These values
guarantee that at least 32 messages can be outstanding between any
two tasks.
The maximum user specified MP_EAGER_LIMIT value is 64K bytes, and may require increasing the default receive buffer size with MP_BUFFER_MEM. See Buffering for details. |
MPI Message Passing Protocols |
Potential Advantages:
Potential Disdvantages:
MPI Message Passing Protocols |
Sender-Receiver Synchronization: Polling vs. Interrupt |
Implementation Dependent:
Polling Mode:
Interrupt Mode:
Sometimes Polling is Better:
Sometimes Interrupt is Better:
Time | Code | Description | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
t=1 | DO LOOP MPI_IRECV (B ... request[0]) MPI_ISEND (A ... request[1]) MPI_WAIT (request[0]) COMPUTE LOOP1 (uses B) MPI_WAIT (request[1]) COMPUTE LOOP2 (modifies A) ENDDO |
In the above scenario, using interrupt mode would allow task 1 to be interrupted during t=2, while it is computing in loop1. Its data to task 0 could be sent, allowing that task to continue working sooner.
Implementation Notes:
![]() | Setting the MP_CSS_INTERRUPT
environment variable determines whether or not interrupts from the
high-performance switch are enabled at MPI initialization. Valid
values are yes and no.
Setting MP_CSS_INTERRUPT to
yes allows the process to be interrupted earlier if
a message arrives or can be sent. If not set, the
default is no. This environment variable has no associated
command-line flag.
The default interval for handling communication calls is 400 milliseconds. |
Message Size |
Example code for message size timing results
Point-to-Point Communications |
Send routines (match any receive, probe; non-blocking can match any completion/testing)
Receive routines (match any send)
Probe routines(match any send)
Completion / Testing routines (match any non-blocking send/receive)
Example code for point-to-point routines timing results
Persistent Communications |
Step 1: Create persistent requests
The desired routine is called to setup buffer location(s) which will be sent/received. The five available routines are:
MPI_Recv_init
Creates a persistent receive request
| MPI_Bsend_init
| Creates a persistent buffered send request
| MPI_Rsend_init
| Creates a persistent ready send request
| MPI_Send_init
| Creates a persistent standard send request
| MPI_Ssend_init
| Creates a persistent synchronous send request
| |
Step 2: Start communication transmission
Data transmission is begun by calling either of the MPI_Start routines.
MPI_Start
Activates a persistent request operation
| MPI_Startall
| Activates a collection of persistent request operations
| |
Step 3: Wait for communication completion
Because persistent operations are non-blocking, the appropriate MPI_Wait or MPI_Test routine must be used to insure their completion.
Step 4: Deallocate persistent request objects
When there is no longer a need for persistent communications, the programmer should explicitly free the persistent request objects by using the MPI_Request_free() routine.
![]() MPI_Recv_init (&rbuff, n, MPI_CHAR, src, tag, comm, &reqs[0]); MPI_Send_init (&sbuff, n, MPI_CHAR, dest, tag, comm, &reqs[1]); for (i=1; i <=REPS; i++){ ... MPI_Startall (2, reqs); ... MPI_Waitall (2, reqs, stats); ... } MPI_Request_free (&reqs[0]); MPI_Request_free (&reqs[1]); |
Example code using persistent communications
Comparison code using MPI_Irecv and MPI_Isend
Collective Communications |
Implementation Notes:
![]() | For IBM's MPI, non-blocking collective communication routines are described in the MPI Subroutine Reference. |
Derived Datatypes |
![]() /* Some declarations */ typedef struct { float f1,f2,f3,f4; int i1,i2; } f4i2; f4i2 rbuff, sbuff; MPI_Datatype newtype, oldtypes[2]; int blockcounts[2]; MPI_Aint offsets[2], extent; MPI_Status stat; .... /* Setup MPI structured type for the 4 floats and 2 ints */ offsets[0] = 0; oldtypes[0] = MPI_FLOAT; blockcounts[0] = 4; MPI_Type_extent(MPI_FLOAT, &extent); offsets[1] = 4 * extent; oldtypes[1] = MPI_INT; blockcounts[1] = 2; MPI_Type_struct(2, blockcounts, offsets, oldtypes, &newtype); MPI_Type_commit(&newtype); ... /* Send/Receive 4 floats and 2 ints as a single element of derived datatype */ for (i=1; i<=REPS; i++){ MPI_Send(&sbuff, 1, newtype, 1, tag, MPI_COMM_WORLD); MPI_Recv(&rbuff, 1, newtype, 1, tag, MPI_COMM_WORLD, &stat); } ... /* Send/Receive 4 floats and then 2 ints individually */ for (i=1; i<=REPS; i++){ MPI_Send(&sbuff.f1, 4, MPI_FLOAT, 1, tag, MPI_COMM_WORLD); MPI_Send(&sbuff.i1, 2, MPI_INT, 1, tag, MPI_COMM_WORLD); MPI_Recv(&rbuff.f1, 4, MPI_FLOAT, 1, tag, MPI_COMM_WORLD, &stat); MPI_Recv(&rbuff.i1, 2, MPI_INT, 1, tag, MPI_COMM_WORLD, &stat); } |
The simple example code provided below demonstrated a performance improvement of between 26% 1 and 38% 3 when a derived datatype was used instead of individual send/receive operations.
Derived datatype vs. individual sends/receives
example code
Non-Contiguous Datatype Performance | ||
---|---|---|
Method | IBM 604e Bandwidth (MB/sec) |
IBM Power3 NH-2 Bandwidth (MB/sec) |
MPI_Type_vector | 12.1 | 38.6 |
MPI_Type_struct | 9.6 | 38.6 |
User pack/unpack | 20.2 | 24.7 |
Individual send/receive | 0.3 | 0.4 |
Network Contention |
SP Switch Network Contention | |
---|---|
# Processors | Bandwidth (MB/sec) |
2 | 34 |
4 | 34 |
8 | 34 |
16 | 31 |
32 | 25 |
64 | 22 |
Above results reported by William Gropp and Ewing Lusk in their Supercomputing 96 tutorial, "Tuning MPI Applications for Peak Performance".
IBM SP Specific Factors |
Type of Switch and Switch Adapter:
Switch Type | Peak Bi-directional Switch Bandwidth | Adapter Type | Comments |
---|---|---|---|
HPS Switch | 80 MB/s per node | HiPS-1 | No longer supported in IBM's parallel system software. Primarily seen on older POWER2 SP systems. |
HiPS-2 | |||
SP Switch | 300 MB/s per node | SPS | For MCA nodes (P2SC) |
SPS MX | For PowerPC (604e) nodes | ||
SPS MX2 | For POWER3 nodes | ||
SP Switch2 | 1 GB/s per node | SPS2 (Colony) |
For POWER3 nodes
Possible configurations: |
Federation | 4 GB/s per node | For POWER4 nodes |
User Space vs. IP Communications:
Switch Type
Adapter Type Node Type |
Protocol | Latency (usec) |
Pt to Pt Bandwidth
(MB/sec) Between two MPI tasks on different nodes |
---|---|---|---|
SP Switch SPS MX Adapter 604e 332 MHz Node |
IP | 138 | 33 |
US | 28 | 84 | |
SP Switch2 SPS2 Adapter double-single POWER3 NH-2 375 MHz Node |
IP | 105 | 77 |
US | 20 | 390 |
Communications Network Used:
MP_EUIDEVICE=csss - double adapter nodes
MP_EUIDEVICE=css0 - all other node types
Number of MPI Tasks on an SMP Node:
SMP Processor Type | Number of MPI Tasks Per Node |
Avg Per Task Bandwidth (MB/sec) |
Avg Aggregate Bandwidth (MB/sec) |
---|---|---|---|
POWER3 375 MHz (16 cpus) 2 | 1 | 371 | 371 |
2 | 304 | 608 | |
3 | 257 | 771 | |
4 | 210 | 840 | |
5 | 175 | 875 | |
6 | 161 | 966 | |
7 | 143 | 1001 | |
8 | 133 | 1064 | |
9 | 117 | 1053 | |
10 | 107 | 1070 | |
11 | 88 | 968 | |
12 | 88 | 1056 | |
13 | 81 | 1053 | |
14 | 74 | 1036 | |
15 | 68 | 1020 | |
16 | 62 | 992 | |
604e 332 MHz (4 cpus) 3 | 1 | 84 | 84 |
2 | 67 | 134 | |
3 | 45 | 135 | |
4 | 32 | 128 |
Use of MP_SHARED_MEMORY Environment Variable:
SMP Processor Type | Number of MPI Tasks Per Node |
Per Task Bandwidth MP_SHARED_MEMORY=yes |
---|---|---|
604e 332 MHz | 4 | 75 - 85 |
POWER3 375 MHz | 16 | 257 - 283 |
POWER4 1.3 GHz | 32 | @500 - 1300 |
Miscellaneous Environment Variables:
This completes the tutorial.
![]() |
Please complete the online evaluation form. |
Where would you like to go now?
References and More Information |
Notes
1 | Timing results were obtained on LLNL's ASCI White system using
two IBM SP nodes. Each node was a 16-way SMP, 375 MHz, Nighthawk II
with 16 GB of memory, running under AIX 4.3 and PSSP 3.3 software.
All communications were conducted over the SP Switch with User Space
protocol. Executions were performed in a production batch system.
with one MPI task per node.
|
2 | Timing results were obtained on LLNL's ASCI White system using
two IBM SP nodes. Each node was a 16-way SMP, 375 MHz, Nighthawk II
with 16 GB of memory, running under AIX 4.3 and PSSP 3.3 software.
All communications were conducted over the SP Switch with User Space
protocol. Executions were performed in a production batch system
with 1-16 MPI tasks per node.
|
3 | Timing results were obtained on LLNL's ASCI Blue system using two IBM SP nodes. Each node was a 4-way SMP, 332 MHz, 604e with 1.5 GB of memory, running under AIX 4.3 and PSSP 3.2 software. All communications were conducted over the SP Switch with User Space protocol. Executions were performed in a production batch system. |