MPI Basics
Also known as: Message Passing Interface, MPI, parallel computing MPI
The standard API for parallel programs that run across many compute nodes, communicating by explicitly sending and receiving messages — the infrastructure behind most scientific supercomputing.
- Primary domain
- Concurrency & Parallelism
- Sub-category
- Concurrent & Parallel Computing
In simple terms
When a simulation needs 10,000 CPU cores working together, something must coordinate them. MPI (Message Passing Interface) is the standard that all of high-performance computing agreed on: each process has its own memory (no shared state), and processes communicate by explicitly calling MPI_Send and MPI_Recv. Write one program, launch it on N nodes, and each instance receives a unique rank and knows how to talk to the others.
The Visual Map
flowchart TB
L["mpirun -n 4 ./sim"] --> R0 & R1 & R2 & R3
subgraph world["MPI_COMM_WORLD"]
R0["rank 0<br/>own memory"]
R1["rank 1<br/>own memory"]
R2["rank 2<br/>own memory"]
R3["rank 3<br/>own memory"]
end
R0 <-->|MPI_Send / MPI_Recv| R1
R1 <-->|nearest-neighbour exchange| R2
R2 <--> R3
R0 & R1 & R2 & R3 -->|MPI_Allreduce: sum| ALL["every rank gets the global sum"]
More detail
MPI is a library specification (not an implementation) — implementations include OpenMPI, MPICH, and vendor-optimised variants for supercomputer interconnects (Cray, IBM Spectrum MPI, Intel MPI). The programming model:
Point-to-point communication:
MPI_Send(buf, count, type, dest, tag, comm)— send to rankdest.MPI_Recv(buf, count, type, src, tag, comm, status)— receive.- Non-blocking variants (
MPI_Isend,MPI_Irecv) allow overlap of computation and communication.
Collective operations (all processes participate simultaneously):
MPI_Bcast— root process broadcasts to all.MPI_Reduce / MPI_Allreduce— combine values from all processes (sum, max, etc.) and deliver to root or all.MPI_Scatter / MPI_Gather— distribute a dataset across processes or collect it back.MPI_Barrier— synchronise all processes at a point.
Communicators and topologies: MPI_COMM_WORLD contains all processes. Sub-communicators group related processes. Cartesian or graph topologies label processes with coordinates, enabling structured communication patterns (e.g. nearest-neighbour exchange in a 3D grid simulation).
One-sided communication (RMA): MPI_Put, MPI_Get, MPI_Accumulate — write directly to a remote memory window without the remote process calling a receive. Used for fine-grained irregular communication.
MPI is typically combined with OpenMP in a hybrid MPI+OpenMP model: one MPI process per node, multiple OpenMP threads per process — MPI handles inter-node communication; OpenMP handles intra-node parallelism on shared-memory cores. It is the de facto standard for HPC: weather forecasting, molecular dynamics, quantum chemistry, and climate modelling all run on MPI, and large-scale ML training uses collectives conceptually identical to MPI_Allreduce.
Under the Hood
The canonical first MPI program — SPMD (single program, multiple data): every rank runs identical code but branches on its rank:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // who am I? (0..size-1)
MPI_Comm_size(MPI_COMM_WORLD, &size); // how many of us?
int my_value = (rank + 1) * 10; // each rank holds different data
int total = 0;
// every rank contributes my_value; every rank receives the sum
MPI_Allreduce(&my_value, &total, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
printf("rank %d of %d: my=%d, global sum=%d\n", rank, size, my_value, total);
MPI_Finalize();
return 0;
}
// mpicc sum.c -o sum && mpirun -n 4 ./sum
// rank 0 of 4: my=10, global sum=100 (10+20+30+40)
// rank 1 of 4: my=20, global sum=100 ...
One source file, launched as N processes; rank is the only thing that differs between them, and MPI_Allreduce is a single call that internally runs a whole communication algorithm (often the ring below) to combine and redistribute.
Engineering Trade-offs
- Explicit messages vs shared-memory ease. Forcing every exchange to be a
Send/Recvmakes data movement visible and tunable — the basis of HPC’s extreme scaling — at the cost of far more code and care than shared memory, where communication is implicit. - Communication is the scaling wall. Computation parallelises easily; the limit is interconnect latency and bandwidth. Past some node count,
Allreducecost dominates, which is why algorithm choice (ring vs tree vs hierarchical) and computation/communication overlap (Isend+ compute) decide real-world efficiency. - Collectives vs hand-rolled point-to-point. Library collectives are tuned to the machine’s topology and almost always beat naive loops of sends — but they’re synchronisation points: one slow rank stalls everyone (the straggler problem that haunts large jobs).
- Static SPMD vs elasticity. Classic MPI fixes the process count at launch and assumes no node fails — brutally efficient on dedicated supercomputers, a poor fit for the cloud’s elastic, failure-prone fleets, which is why cloud-native workloads favour different models.
Real-world examples
- GROMACS and NAMD simulate molecular dynamics on thousands of CPUs using MPI for domain decomposition.
- PyTorch distributed training uses
dist.all_reduce(backed by NCCL on GPUs or Gloo on CPUs) — the same collective concept asMPI_Allreduce. - The TOP500 list of fastest supercomputers benchmarks systems with the HPL Linpack benchmark, which is a pure MPI workload.
- WRF (Weather Research and Forecasting model) uses MPI to decompose the atmosphere across thousands of cores.
Common misconceptions
- “MPI is only for Fortran and C.” MPI has bindings for Python (mpi4py), Java, and R. Many scientific Python packages (petsc4py, h5py with parallel HDF5) use MPI under the hood.
- “Shared memory within a node makes MPI unnecessary.” For intra-node communication, shared memory is faster; that’s why hybrid MPI+OpenMP is preferred. MPI is for inter-node — shared memory doesn’t cross network cards.
Try it yourself
MPI_Allreduce isn’t magic — it’s the ring-allreduce algorithm (the same one NCCL uses for GPU training). Simulate it with plain processes and watch each rank end with the global sum:
python3 -c "
# ring-allreduce: P ranks, each starts with one value, all end with the sum.
# each rank only ever talks to its ring neighbour, for P-1 steps.
P = 4
start = [10, 20, 30, 40]
acc = start[:] # what each rank has accumulated so far
parcel = start[:] # the value currently arriving at each rank
for step in range(P - 1):
parcel = [parcel[(r - 1) % P] for r in range(P)] # pass right around the ring
acc = [acc[r] + parcel[r] for r in range(P)] # add what just arrived
print(f'after step {step+1}: {acc}')
print('every rank holds the sum:', acc, '(=', sum(start), ')')
print('each rank sent', P - 1, 'messages, not', P * (P - 1), 'for all-to-all')
"
The point is the communication pattern: a ring means every rank moves the same amount of data regardless of P, which is exactly why ring-allreduce scales to thousands of GPUs where a naive all-to-all would choke.
Learn next
- OpenMP — the intra-node, shared-memory half of hybrid HPC.
- Actor model — message passing for latency-tolerant distributed apps.
- Distributed system — the broader field MPI specialises for tightly-coupled compute.
Relationships
- Requires
- Related
Neighborhood
A visual companion to the relationships above. Click any node to visit that topic.