OpenMP
Also known as: Open Multi-Processing, shared memory parallelism
A set of compiler directives and library routines for shared-memory parallelism in C, C++, and Fortran — add a pragma above a loop and the compiler parallelises it across all cores on one machine.
- Primary domain
- Concurrency & Parallelism
- Sub-category
- Multithreading & Multiprocessing
In simple terms
OpenMP makes it almost trivial to parallelise a loop in C or Fortran: add #pragma omp parallel for above a for loop, compile with -fopenmp, and the compiler automatically spawns one thread per core and divides the loop iterations among them. No mutexes, no thread creation code, no work queue — just the pragma. For compute-bound loops over independent iterations, this can achieve near-linear speedup with two lines of change.
The Visual Map
flowchart TB
M["main thread (serial)"] --> F{"#pragma omp parallel for"}
F -->|fork| T0["thread 0<br/>i = 0..249"]
F --> T1["thread 1<br/>i = 250..499"]
F --> T2["thread 2<br/>i = 500..749"]
F --> T3["thread 3<br/>i = 750..999"]
T0 & T1 & T2 & T3 -->|"join + reduction(+:sum)"| J["combine private partials<br/>-> one result"]
J --> S["main thread (serial)"]
More detail
OpenMP is a directive-based extension: the compiler sees the pragmas and generates the threading code; a non-OpenMP compiler sees them as comments and compiles serial code. This portability is intentional.
Core constructs:
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += a[i] * b[i]; // dot product, parallelised
}
parallel— fork a team of threads; join at the end of the block.for— distribute loop iterations across the team (requires no loop-carried dependencies).reduction(+:sum)— each thread accumulates into a private copy ofsum; copies are combined at the end.private(x)— each thread gets its own copy ofx.shared(y)— all threads sharey(the default); reads are fine, concurrent writes need synchronisation.critical— a mutex-protected section; all threads may execute it, one at a time.atomic— cheaper thancriticalfor simple operations (x++,x += v).
Scheduling: schedule(static) divides iterations evenly upfront; schedule(dynamic) steals chunks as threads finish — better for irregular-cost iterations.
Task parallelism: #pragma omp task creates an explicit task (useful for recursive or graph-structured work like tree traversal that doesn’t fit a parallel for loop).
SIMD integration: #pragma omp simd hints that a loop should be vectorised with SIMD in addition to thread-level parallelism.
OpenMP is typically combined with MPI in hybrid programs: OpenMP parallelises across cores on one node; MPI parallelises across nodes. It reduces the barrier to shared-memory parallelism from “implement a thread pool” to “add a pragma”, and is the shared-memory parallel model most scientists learn first.
Under the Hood
The reduction clause is where OpenMP earns its keep — it silently solves the race that a naive sum += from many threads would create:
#include <stdio.h>
#include <omp.h>
int main(void) {
long n = 100000000, sum = 0;
// WITHOUT reduction this would be a data race: every thread
// read-modify-writes the same `sum`. The clause gives each thread
// a PRIVATE partial sum, then adds them once at the join.
#pragma omp parallel for reduction(+:sum)
for (long i = 1; i <= n; i++)
sum += i;
printf("sum 1..%ld = %ld (threads used: %d)\n",
n, sum, omp_get_max_threads());
return 0;
}
// gcc -fopenmp sum.c -o sum && ./sum
// the compiler turned one pragma into: fork team, split the range,
// per-thread accumulation, barrier, combine. No lock in sight.
What looks like one annotated loop expands, at compile time, into thread creation, range partitioning, private accumulators, a join barrier, and the final combine — the threading boilerplate you’d otherwise hand-write, generated and tuned by the compiler.
Engineering Trade-offs
- Incremental parallelism vs hidden ceilings. One pragma can near-linearly speed up a clean loop — but only if iterations are truly independent and the work is compute-bound. A loop-carried dependency makes the pragma silently wrong; a memory-bandwidth-bound loop makes it silently useless past a couple of cores.
reduction/privatevscritical. Reductions and private variables give each thread its own state and combine cheaply at the end; falling back tocritical/atomicserialises threads at that point and can erase the speedup. Choosing the right data-sharing clause is the performance work.- Static vs dynamic scheduling. Static splitting has zero overhead and perfect cache locality for uniform iterations; dynamic work-stealing handles irregular per-iteration cost but adds synchronisation. The wrong choice shows up as idle cores waiting on one slow chunk.
- Shared-memory only — by design. All threads share one address space, which is what makes it so simple and confines it to a single node. Scaling past one machine requires MPI; the hybrid model exists precisely because neither alone covers both axes.
Real-world examples
- GROMACS molecular dynamics simulator uses OpenMP for within-node parallelism alongside MPI for between-node.
- GCC and Clang both support OpenMP;
-fopenmpis a one-flag change. - FFTW (the fast Fourier transform library used everywhere) supports OpenMP for threaded transforms.
- NumPy’s BLAS backend (OpenBLAS, MKL) uses OpenMP threads internally for matrix operations.
Common misconceptions
- “OpenMP scales to multiple machines.” OpenMP is shared-memory only — all threads share one address space. For multi-node scaling, use MPI.
- “Just add the pragma and it gets faster.” Only if iterations are independent (no loop-carried dependencies) and work is compute-bound rather than memory-bound. Parallelising a memory-bandwidth-limited loop across 16 cores does nothing if RAM is the bottleneck.
Try it yourself
Compile a real OpenMP program and measure the speedup as you add threads (gcc ships with OpenMP on stock Ubuntu/WSL):
# requires: gcc
cat > /tmp/omp.c <<'EOF'
#include <stdio.h>
#include <omp.h>
int main(void) {
long n = 500000000, sum = 0;
double t = omp_get_wtime();
#pragma omp parallel for reduction(+:sum)
for (long i = 1; i <= n; i++) sum += i % 7;
printf("threads=%d sum=%ld time=%.3fs\n",
omp_get_max_threads(), sum, omp_get_wtime() - t);
return 0;
}
EOF
gcc -fopenmp -O2 /tmp/omp.c -o /tmp/omp
for t in 1 2 4 8; do OMP_NUM_THREADS=$t /tmp/omp; done
Watch the time roughly halve from 1 to 2 to 4 threads, then flatten as you run out of physical cores or hit memory bandwidth — the practical ceiling every parallel speedup eventually meets.
Learn next
- MPI basics — the across-nodes counterpart in hybrid HPC.
- Thread — the OS primitive OpenMP manages for you.
- SIMD — data-level parallelism that stacks on top of OpenMP’s thread-level.
Relationships
- Related
Neighborhood
A visual companion to the relationships above. Click any node to visit that topic.