Markov Chains

In simple terms

A Markov chain is a system that hops between states at random, where the probability of the next state depends only on the current state — not on how you got there. This “Markovian” (memoryless) property makes the math tractable. A biased random walk (stepping left or right with fixed probabilities), a board game (the probability of your next square depends only on your current square and dice roll), and Google’s PageRank (a random web surfer clicking links) are all Markov chains. They’re everywhere in probability, statistics, and machine learning.

More detail

Formal definition: a sequence of random variables X_0, X_1, X_2, ... taking values in a state space S, with the Markov property: P(X_{t+1} = j | X_t = i, X_{t-1}, ..., X_0) = P(X_{t+1} = j | X_t = i) = P_{ij}

The transition matrix P (where P_{ij} = probability of moving from state i to state j, with rows summing to 1) encodes the entire chain.

Key properties:

Irreducibility: every state can be reached from every other state (the chain doesn’t get stuck in a subset).

Aperiodicity: the chain doesn’t cycle with a fixed period.

Stationary distribution: a probability distribution π such that πP = π — if you start with distribution π and take a step, you remain in π. An irreducible, aperiodic chain on a finite state space always converges to its unique stationary distribution π, regardless of the starting state. This is the ergodic theorem for Markov chains.

Convergence: the distribution P^t(x_0, ·) after t steps converges to π as t → ∞. The mixing time measures how quickly this happens.

Examples:

Random walk on a graph: at each step, move to a uniformly random neighbour. The stationary distribution is proportional to the degree of each node.

PageRank: Google’s original ranking algorithm models a random web surfer: follow a random hyperlink with probability 0.85; jump to a random page with probability 0.15 (the “teleportation” factor that ensures aperiodicity and irreducibility). The stationary distribution π is the PageRank vector. Computed by power iteration: π = πP.

Weather model: states = Sunny, Cloudy, Rainy. Transition probabilities based on historical weather data. What’s the probability of sun in 10 days? → π₀ P^{10}.

Queueing theory: M/M/1 queue (arrivals ~ Poisson, service ~ Exponential, 1 server) is a Markov chain on states 0, 1, 2, … (queue length). Stationary distribution: geometric with parameter ρ = arrival_rate / service_rate. Little’s Law emerges from Markov chain analysis.

Continuous-time Markov chains (CTMC): transitions happen at continuous times, governed by a rate matrix Q. Used in reliability engineering, chemistry (reaction kinetics), and biology (gene expression).

MCMC (Markov Chain Monte Carlo): a method for sampling from complex probability distributions. Design a Markov chain whose stationary distribution is the target distribution π; run the chain long enough to mix, then use the samples. Key algorithms:

Metropolis-Hastings: at each step, propose a new state from a proposal distribution; accept with probability min(1, π(proposal)/π(current)). Ensures detailed balance (reversibility) → π is stationary.
Gibbs sampling: a special case where you sample each variable conditioned on all others. Used extensively in Bayesian inference.

MCMC makes Bayesian inference feasible for complex models — the posterior may not have a closed form, but you can sample from it.

Hidden Markov Model (HMM): a Markov chain with hidden states — you observe outputs generated by the states, but not the states themselves. The Forward-Backward algorithm and Viterbi algorithm solve HMM inference. HMMs underpin speech recognition (phones as hidden states, acoustic observations), protein sequence analysis, and NLP tagging.

Why it matters

Markov chains appear throughout computer science, statistics, and applied mathematics. PageRank (computed as the stationary distribution of a Markov chain) shaped internet search for two decades. MCMC made Bayesian statistics practical. HMMs drove speech recognition before deep learning. Modern language models (GPT, Claude) can be viewed as learned conditional distributions over next tokens — a learned (non-stationary) Markov-like process. Understanding Markov chains is essential for probabilistic modelling, reinforcement learning (MDPs are Markov Decision Processes), queueing theory, and Monte Carlo simulation.

Real-world examples

Google PageRank: stationary distribution of a random walk on the web graph; powered Google Search for ~2004–2011.
Speech recognition (HMM era): CMU Sphinx, HTK — HMMs with phone-level states; dominated ASR before deep learning.
MCMC for Bayesian inference: Stan and PyMC use NUTS (No-U-Turn Sampler, a variant of Hamiltonian Monte Carlo — an MCMC method) for posterior sampling.
Monopoly: the probability of landing on each square is the stationary distribution of a Markov chain (board positions + jail mechanics).

Common misconceptions

“Markov chains are memoryless, so they can’t model sequences with long-range dependencies.” True for first-order chains. Higher-order chains (conditioning on the last k states) and HMMs extend the reach. Modern LLMs overcome this limitation with attention mechanisms.
“MCMC is exact sampling.” MCMC samples have correlations and require burn-in; they are approximately distributed as π after mixing. Diagnostics (R-hat, effective sample size) assess whether the chain has mixed.

Learn next

Markov chains rest on probability and statistics. Hidden Markov models extend them for sequence modelling. Bayesian inference uses MCMC as its computational workhorse. Optimization theory is the other major tool in probabilistic ML.