Low-Latency Systems
Engineering for sub-millisecond execution — cache-aware data layout, lock-free concurrency, and the hardware-software co-design that squeezes the last microseconds out of a machine.
Most software is fast enough. Some is not: trading engines, game loops, packet processors, and real-time audio all live or die by the microsecond. Low-Latency Systems covers the techniques that get there — laying data out to match the cache, avoiding locks and unpredictable branches, pinning work to cores, and bypassing the kernel when it gets in the way.
This category extends the CS2023 SPD (Systems / Parallel & Distributed) area toward the high-performance-computing track. It treats performance as a first-class design constraint rather than an afterthought.
Core
The essentials. Start here.-
Cache-Line Alignment
Laying data out to match the CPU's cache lines — so hot fields share a line, unrelated fields don't, and the processor stops wasting fetches and fighting over ownership.
core advanced concept -
Lock-Free Programming
Coordinating threads with atomic hardware operations instead of locks — so no thread can ever block another, eliminating lock contention and the latency spikes it causes.
core advanced concept -
Memory Pool
Pre-allocating a block of memory and handing out fixed-size chunks from it — trading flexibility for speed and predictability by sidestepping the general-purpose allocator.
core advanced concept
Important
What you'll meet next.-
Branchless Programming
Replacing conditional branches with arithmetic and bit-manipulation so the CPU's branch predictor is never wrong — turning misprediction penalties into simple computation.
advanced concept -
Core Affinity
Binding a thread permanently to a specific CPU core — eliminating migration overhead, warming the core's private caches, and making execution timing predictable.
advanced concept -
Data-Oriented Design
Organising code around the memory layout data needs rather than object-oriented abstractions — transforming how systems are structured so the CPU spends time computing, not waiting for cache misses.
advanced concept -
NUMA Awareness
On multi-socket servers each CPU has fast local memory and slow remote memory — NUMA-aware code allocates memory on the same socket as the thread that uses it, halving memory latency.
advanced concept -
SIMD Intrinsics
C/C++ functions that map directly to vector CPU instructions — bypassing the auto-vectoriser to write hand-tuned code that processes 4, 8, or 16 values in a single clock cycle.
advanced concept
Supplemental
Niche, historical, or specialized.-
eBPF
A Linux kernel technology that safely runs sandboxed programs inside the kernel at near-native speed — enabling programmable networking, observability, and security without writing kernel modules or rebooting.
supplemental advanced concept -
Huge Pages
Memory pages of 2 MB or 1 GB rather than the default 4 KB — reducing TLB misses for large working sets (databases, JVM heaps) by increasing the coverage each TLB entry provides, at the cost of internal fragmentation.
supplemental advanced concept -
Kernel Bypass
A technique that allows applications to access hardware (network cards, storage) directly from userspace, bypassing the OS kernel — eliminating system call overhead, context switches, and interrupt processing to achieve single-digit microsecond latency.
supplemental advanced concept -
RDMA
A technology that allows one computer to directly read or write another computer's memory over a network — bypassing both CPUs — achieving ~1 µs latency and near-line-rate bandwidth for distributed computing, ML training, and storage.
supplemental advanced concept