Markov Chain Monte Carlo

Difficulty:Advanced

Reading Time:25 min

Track:

Practitioner

A family of sampling algorithms that approximate hard probability distributions by constructing a Markov chain.

ML PractitionerModule 15 of 17

Markov Chain Monte Carlo

88%

TL;DR

MCMC approximates a hard target distribution $p(x)$ by simulating a Markov chain whose stationary distribution is exactly $p(x)$ ; after warm-up, the visited states behave like (correlated) samples from $p$ .
It shines when you can evaluate $p(x)$ only up to a normalizing constant — typical of Bayesian posteriors $p(\theta \mid D) \propto p(D \mid \theta)\,p(\theta)$ whose evidence integral is intractable.
The Metropolis-Hastings acceptance ratio $A = \min\left(1, \frac{p(x')\,q(x \mid x')}{p(x)\,q(x' \mid x)}\right)$ uses only ratios of $p$ , so the unknown normalizer cancels.
Correctness comes from detailed balance: the constructed transition leaves $p$ invariant, so the chain converges to the target regardless of where it starts.
Practical use requires discarding burn-in, monitoring convergence ( $\hat{R}$ , trace plots), and remembering that successive draws are autocorrelated, so the effective sample size is far below the raw count.

Learning Objectives

Explain the mathematical foundation of Markov Chain Monte Carlo sampling
Describe the Metropolis-Hastings acceptance ratio and its rationale
Identify diagnostic metrics for MCMC convergence such as Gelman-Rubin $\hat{R}$ and Effective Sample Size
Explain the concept of burn-in or warm-up phase in MCMC

Intuition

How to think conceptually about this topic

Imagine a landscape where height represents probability density. The sampler proposes nearby moves. Moves to higher-density regions are usually accepted, but some lower-density moves are accepted too.

That occasional willingness to move downhill matters. It keeps the chain from getting trapped at one local peak and lets it explore the distribution. The result is not a perfect map, but an approximation that improves with better proposals, diagnostics, and enough effective samples.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Markov Chain Monte Carlo (MCMC) is used when a probability distribution is known up to proportionality, but direct integration or exact sampling is too difficult. This is common in Bayesian inference, where the posterior may be high-dimensional and analytically intractable.

MCMC builds a Markov chain whose stationary distribution is the target distribution. After warm-up, the visited states behave like dependent samples from that target. Histograms, expectations, credible intervals, and posterior summaries can then be estimated from those samples.

Where is it used?

MCMC is used in Bayesian modeling, physics simulation, hierarchical medical models, ecological inference, and risk analysis when closed-form calculations are not available.

How It Compares

MCMC vs Variational Inference vs Exact/Conjugate Inference

Dimension	MCMC (Metropolis-Hastings)	Variational Inference	Exact / Conjugate
Asymptotic accuracy	Exact in the limit of infinite samples	Approximate — biased by the chosen variational family	Exact (closed form)
Speed	Slow — many serial iterations	Fast — cast as an optimization problem	Instant once the formula is derived
Scalability to high dimensions / big data	Poor — mixing degrades and cost grows	Good — scales with stochastic gradient methods	N/A — only special models qualify
Gives the true posterior?	Yes (asymptotically, as samples)	No — an approximation, often under-dispersed	Yes — the exact posterior
Requires only an unnormalized density?	Yes — normalizer cancels	Yes — uses the unnormalized joint via the ELBO	No — relies on a known conjugate form

TakeawayUse exact/conjugate inference whenever the model admits it; otherwise trade accuracy for speed — MCMC for asymptotically exact (but slow) posteriors, variational inference for fast (but approximate) ones.

When to Use It

Reach for this when

The posterior is analytically intractable (no conjugate form) and you need full uncertainty quantification, not just a point estimate.
You can evaluate the target only up to a normalizing constant — e.g. an unnormalized Bayesian posterior $p(\theta)\,p(D \mid \theta)$ .
The posterior may be skewed, correlated, or multi-modal, so a Gaussian approximation would be misleading.

Avoid it when

The model has a conjugate / closed-form posterior — exact inference is faster and exact.
You face very high-dimensional parameter spaces or massive datasets where MCMC mixes too slowly — consider variational inference or stochastic-gradient methods.
You need an answer in real time / under a tight latency budget — MCMC requires many serial iterations and convergence checks.

Rules of thumb

Always run multiple chains from dispersed starts and check Gelman-Rubin $\hat{R} < 1.01$ before trusting results.
For random-walk Metropolis, tune the proposal so the acceptance rate lands near the theoretically optimal $\approx 23\%$ (in high dimensions).
Report effective sample size, not raw iteration count, as the measure of estimate precision.
When mixing is poor or dimension is high, switch from random-walk Metropolis to gradient-based samplers like Hamiltonian Monte Carlo / NUTS.

Implementation

Reference code implementation

Python

model_fitting.py

1import numpy as np
2
3# Let's build a simple MCMC walker!
4def target_density(x):
5    # A weird, two-peaked math equation we want to map
6    return np.exp(-0.5 * (x - 2)**2) + 0.5 * np.exp(-0.5 * (x + 2)**2)
7
8current_x = 0.0
9samples = []
10n_iterations = 10000
11
12for _ in range(n_iterations):
13    # Propose a random step
14    proposed_x = current_x + np.random.normal(0, 1.5)
15    
16    # Calculate if the new spot is better or worse
17    acceptance_ratio = target_density(proposed_x) / target_density(current_x)
18    
19    # Roll the dice to see if we accept the move!
20    if np.random.rand() < acceptance_ratio:
21        current_x = proposed_x
22        
23    samples.append(current_x)
24
25print(f"We took {len(samples)} steps to map the mountain!")

Strengths & Advantages

It can estimate posterior quantities for models where direct integration is not feasible.
It does not require a Gaussian posterior and can represent skewed, correlated, or multi-modal distributions if the chain mixes well.

Limitations & Drawbacks

It is notoriously slow. Taking millions of tiny random steps takes a lot of computing power and time.
It can be very difficult to know exactly when the algorithm has 'finished' mapping the mountain. You often have to run multiple walkers at the same time to check if they agree.

Real-World Case Studies

Optimal scaling of random-walk Metropolis

Computational statistics / Bayesian inference

Scenario

A practitioner sampling a high-dimensional posterior with random-walk Metropolis must choose the variance of the Gaussian proposal. Too small a step and the chain crawls (high acceptance but tiny moves); too large and almost every proposal is rejected (the chain stalls). The question is what acceptance rate to target.

Approach

Roberts, Gelman and Gilks (1997) analyzed the diffusion limit of random-walk Metropolis as the dimension $d \to \infty$ for product targets, deriving the proposal scaling that maximizes the efficiency (effective sample size per iteration) of the resulting chain.

Outcome

They showed the asymptotically optimal acceptance rate is $\approx 0.234$ (about 23.4%), achieved by scaling the proposal standard deviation like $2.38/\sqrt{d}$ . Tuning a sampler toward this $\approx 23\%$ acceptance target is now a standard practitioner rule of thumb that directly maximizes effective sample size per unit cost.

Source: Weak convergence and optimal scaling of random walk Metropolis algorithms (Annals of Applied Probability, 1997) — Roberts, G.O., Gelman, A. and Gilks, W.R.

Common Misconceptions

MisconceptionMCMC samples are completely independent of each other.

CorrectionBecause each step is proposed from the current position, adjacent samples are highly correlated. This autocorrelation reduces the effective sample size.

MisconceptionThe chain always converges to the true distribution immediately.

CorrectionMCMC chains require a "burn-in" or "warm-up" phase to discard early samples before the chain reaches its stationary distribution.

References & Further Reading

Markov Chains and Mixing Timestextbook
By Levin, D.A. and Peres, Y
View Resource →
Handbook of Markov Chain Monte Carlotextbook
By Brooks, S. et al
View Resource →

Markov Chain Monte Carlo

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

Where is it used?

How It Compares

MCMC vs Variational Inference vs Exact/Conjugate Inference

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

Optimal scaling of random-walk Metropolis

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Gaussian Mixtures and EM