Computer Atlas

MPI Basics

Also known as: Message Passing Interface, MPI, parallel computing MPI

supplemental advanced technology 6 min read · Updated 2026-06-11

The standard API for parallel programs that run across many compute nodes, communicating by explicitly sending and receiving messages — the infrastructure behind most scientific supercomputing.

Primary domain
Concurrency & Parallelism
Sub-category
Concurrent & Parallel Computing

In simple terms

When a simulation needs 10,000 CPU cores working together, something must coordinate them. MPI (Message Passing Interface) is the standard that all of high-performance computing agreed on: each process has its own memory (no shared state), and processes communicate by explicitly calling MPI_Send and MPI_Recv. Write one program, launch it on N nodes, and each instance receives a unique rank and knows how to talk to the others.

The Visual Map

flowchart TB
  L["mpirun -n 4 ./sim"] --> R0 & R1 & R2 & R3
  subgraph world["MPI_COMM_WORLD"]
    R0["rank 0<br/>own memory"]
    R1["rank 1<br/>own memory"]
    R2["rank 2<br/>own memory"]
    R3["rank 3<br/>own memory"]
  end
  R0 <-->|MPI_Send / MPI_Recv| R1
  R1 <-->|nearest-neighbour exchange| R2
  R2 <--> R3
  R0 & R1 & R2 & R3 -->|MPI_Allreduce: sum| ALL["every rank gets the global sum"]

More detail

MPI is a library specification (not an implementation) — implementations include OpenMPI, MPICH, and vendor-optimised variants for supercomputer interconnects (Cray, IBM Spectrum MPI, Intel MPI). The programming model:

Point-to-point communication:

  • MPI_Send(buf, count, type, dest, tag, comm) — send to rank dest.
  • MPI_Recv(buf, count, type, src, tag, comm, status) — receive.
  • Non-blocking variants (MPI_Isend, MPI_Irecv) allow overlap of computation and communication.

Collective operations (all processes participate simultaneously):

  • MPI_Bcast — root process broadcasts to all.
  • MPI_Reduce / MPI_Allreduce — combine values from all processes (sum, max, etc.) and deliver to root or all.
  • MPI_Scatter / MPI_Gather — distribute a dataset across processes or collect it back.
  • MPI_Barrier — synchronise all processes at a point.

Communicators and topologies: MPI_COMM_WORLD contains all processes. Sub-communicators group related processes. Cartesian or graph topologies label processes with coordinates, enabling structured communication patterns (e.g. nearest-neighbour exchange in a 3D grid simulation).

One-sided communication (RMA): MPI_Put, MPI_Get, MPI_Accumulate — write directly to a remote memory window without the remote process calling a receive. Used for fine-grained irregular communication.

MPI is typically combined with OpenMP in a hybrid MPI+OpenMP model: one MPI process per node, multiple OpenMP threads per process — MPI handles inter-node communication; OpenMP handles intra-node parallelism on shared-memory cores. It is the de facto standard for HPC: weather forecasting, molecular dynamics, quantum chemistry, and climate modelling all run on MPI, and large-scale ML training uses collectives conceptually identical to MPI_Allreduce.

Under the Hood

The canonical first MPI program — SPMD (single program, multiple data): every rank runs identical code but branches on its rank:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);   // who am I? (0..size-1)
    MPI_Comm_size(MPI_COMM_WORLD, &size);   // how many of us?

    int my_value = (rank + 1) * 10;         // each rank holds different data
    int total = 0;

    // every rank contributes my_value; every rank receives the sum
    MPI_Allreduce(&my_value, &total, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);

    printf("rank %d of %d: my=%d, global sum=%d\n", rank, size, my_value, total);

    MPI_Finalize();
    return 0;
}
// mpicc sum.c -o sum && mpirun -n 4 ./sum
//   rank 0 of 4: my=10, global sum=100   (10+20+30+40)
//   rank 1 of 4: my=20, global sum=100   ...

One source file, launched as N processes; rank is the only thing that differs between them, and MPI_Allreduce is a single call that internally runs a whole communication algorithm (often the ring below) to combine and redistribute.

Engineering Trade-offs

  • Explicit messages vs shared-memory ease. Forcing every exchange to be a Send/Recv makes data movement visible and tunable — the basis of HPC’s extreme scaling — at the cost of far more code and care than shared memory, where communication is implicit.
  • Communication is the scaling wall. Computation parallelises easily; the limit is interconnect latency and bandwidth. Past some node count, Allreduce cost dominates, which is why algorithm choice (ring vs tree vs hierarchical) and computation/communication overlap (Isend + compute) decide real-world efficiency.
  • Collectives vs hand-rolled point-to-point. Library collectives are tuned to the machine’s topology and almost always beat naive loops of sends — but they’re synchronisation points: one slow rank stalls everyone (the straggler problem that haunts large jobs).
  • Static SPMD vs elasticity. Classic MPI fixes the process count at launch and assumes no node fails — brutally efficient on dedicated supercomputers, a poor fit for the cloud’s elastic, failure-prone fleets, which is why cloud-native workloads favour different models.

Real-world examples

  • GROMACS and NAMD simulate molecular dynamics on thousands of CPUs using MPI for domain decomposition.
  • PyTorch distributed training uses dist.all_reduce (backed by NCCL on GPUs or Gloo on CPUs) — the same collective concept as MPI_Allreduce.
  • The TOP500 list of fastest supercomputers benchmarks systems with the HPL Linpack benchmark, which is a pure MPI workload.
  • WRF (Weather Research and Forecasting model) uses MPI to decompose the atmosphere across thousands of cores.

Common misconceptions

  • “MPI is only for Fortran and C.” MPI has bindings for Python (mpi4py), Java, and R. Many scientific Python packages (petsc4py, h5py with parallel HDF5) use MPI under the hood.
  • “Shared memory within a node makes MPI unnecessary.” For intra-node communication, shared memory is faster; that’s why hybrid MPI+OpenMP is preferred. MPI is for inter-node — shared memory doesn’t cross network cards.

Try it yourself

MPI_Allreduce isn’t magic — it’s the ring-allreduce algorithm (the same one NCCL uses for GPU training). Simulate it with plain processes and watch each rank end with the global sum:

python3 -c "
# ring-allreduce: P ranks, each starts with one value, all end with the sum.
# each rank only ever talks to its ring neighbour, for P-1 steps.
P = 4
start = [10, 20, 30, 40]
acc    = start[:]          # what each rank has accumulated so far
parcel = start[:]          # the value currently arriving at each rank

for step in range(P - 1):
    parcel = [parcel[(r - 1) % P] for r in range(P)]   # pass right around the ring
    acc    = [acc[r] + parcel[r] for r in range(P)]    # add what just arrived
    print(f'after step {step+1}: {acc}')

print('every rank holds the sum:', acc, '(=', sum(start), ')')
print('each rank sent', P - 1, 'messages, not', P * (P - 1), 'for all-to-all')
"

The point is the communication pattern: a ring means every rank moves the same amount of data regardless of P, which is exactly why ring-allreduce scales to thousands of GPUs where a naive all-to-all would choke.

Learn next

  • OpenMP — the intra-node, shared-memory half of hybrid HPC.
  • Actor model — message passing for latency-tolerant distributed apps.
  • Distributed system — the broader field MPI specialises for tightly-coupled compute.

Neighborhood

A visual companion to the relationships above. Click any node to visit that topic.