Low-Latency Systems

Engineering for sub-millisecond execution — cache-aware data layout, lock-free concurrency, and the hardware-software co-design that squeezes the last microseconds out of a machine.

Most software is fast enough. Some is not: trading engines, game loops, packet processors, and real-time audio all live or die by the microsecond. Low-Latency Systems covers the techniques that get there — laying data out to match the cache, avoiding locks and unpredictable branches, pinning work to cores, and bypassing the kernel when it gets in the way.

This category extends the CS2023 SPD (Systems / Parallel & Distributed) area toward the high-performance-computing track. It treats performance as a first-class design constraint rather than an afterthought.

Core

The essentials. Start here.

Cache-Line Alignment

Laying data out to match the CPU's cache lines — so hot fields share a line, unrelated fields don't, and the processor stops wasting fetches and fighting over ownership.

core advanced concept
Lock-Free Programming

Coordinating threads with atomic hardware operations instead of locks — so no thread can ever block another, eliminating lock contention and the latency spikes it causes.

core advanced concept
Memory Pool

Pre-allocating a block of memory and handing out fixed-size chunks from it — trading flexibility for speed and predictability by sidestepping the general-purpose allocator.

core advanced concept

Important

What you'll meet next.

Branchless Programming

Replacing conditional branches with arithmetic and bit-manipulation so the CPU's branch predictor is never wrong — turning misprediction penalties into simple computation.

advanced concept
Core Affinity

Binding a thread permanently to a specific CPU core — eliminating migration overhead, warming the core's private caches, and making execution timing predictable.

advanced concept
Data-Oriented Design

Organising code around the memory layout data needs rather than object-oriented abstractions — transforming how systems are structured so the CPU spends time computing, not waiting for cache misses.

advanced concept
NUMA Awareness

On multi-socket servers each CPU has fast local memory and slow remote memory — NUMA-aware code allocates memory on the same socket as the thread that uses it, halving memory latency.

advanced concept
SIMD Intrinsics

C/C++ functions that map directly to vector CPU instructions — bypassing the auto-vectoriser to write hand-tuned code that processes 4, 8, or 16 values in a single clock cycle.

advanced concept

Supplemental

Niche, historical, or specialized.

eBPF

A Linux kernel technology that safely runs sandboxed programs inside the kernel at near-native speed — enabling programmable networking, observability, and security without writing kernel modules or rebooting.

supplemental advanced concept
Huge Pages

Memory pages of 2 MB or 1 GB rather than the default 4 KB — reducing TLB misses for large working sets (databases, JVM heaps) by increasing the coverage each TLB entry provides, at the cost of internal fragmentation.

supplemental advanced concept
Kernel Bypass

A technique that allows applications to access hardware (network cards, storage) directly from userspace, bypassing the OS kernel — eliminating system call overhead, context switches, and interrupt processing to achieve single-digit microsecond latency.

supplemental advanced concept
RDMA

A technology that allows one computer to directly read or write another computer's memory over a network — bypassing both CPUs — achieving ~1 µs latency and near-line-rate bandwidth for distributed computing, ML training, and storage.

supplemental advanced concept

Core

Cache-Line Alignment

Lock-Free Programming

Memory Pool

Important

Branchless Programming

Core Affinity

Data-Oriented Design

NUMA Awareness

SIMD Intrinsics

Supplemental

eBPF

Huge Pages

Kernel Bypass

RDMA