Bias-Variance Tradeoff

Bias-Variance Tradeoff

The bias-variance tradeoff explains why model capacity matters. Expected prediction error can be viewed as a combination of bias, variance, and irreducible noise.

  • Bias: Error introduced by approximating a complex real-world process with a simple model (e.g. fitting a straight line to quadratic data). High bias leads to underfitting.

  • Variance: Error introduced by the model's sensitivity to small fluctuations in the training dataset. High variance leads to overfitting, where the model learns the noise rather than the signal.

  • Irreducible Error (σ2\sigma^2): The noise floor inherent in the data generating process, which no model can ever eliminate.

Best use

Provides a clear diagnostic roadmap for improving models (e.g., add features if bias is high; add data/regularization if variance is high).

Watch out for

In practice, computing exact bias and variance terms is impossible because the true underlying distribution f(x)f(x) is unknown.

i

Intuition

How to think about this algorithm

Imagine a target board at an archery range:

  1. Low Bias, Low Variance: Arrows are tightly clustered in the bullseye. (The ideal model: accurate and consistent).

  2. High Bias, Low Variance: Arrows are tightly clustered, but far off target in the corner. (Underfitting: consistent, but consistently wrong).

  3. Low Bias, High Variance: Arrows are spread widely across the entire board, but their average center is close to the bullseye. (Overfitting: highly inconsistent, chasing individual training noise).

  4. High Bias, High Variance: Arrows are scattered and completely off target. (The worst case).

As model capacity increases, training error usually falls. Validation error often falls at first, then rises when the model starts fitting noise. The best model is usually near the bottom of that validation curve.

Interactive Diagram

Bias-Variance Tradeoff (Polynomial Fitting)

Click plot space to add coordinate points. Slide model degree to modify polynomial capacity. Analyze fitting properties and the loss chart.

Polynomial Fit
Training Nodes
Click plot space to place observations
Training MSE: 0.4261
Cross-Val LOOCV: 3.4644
Polynomial Degree (d)3

Degree 1 is linear. Degrees 2-3 are quadratic/cubic. Higher degrees curve exponentially to capture outlier training coordinates.

Generalization Curves
Key InsightLow degrees suffer from high bias (underfitting). High degrees fit points perfectly but suffer from high variance (overfitting). The cross-validation curve isolates the ideal tradeoff capacity.

The Logic

Mathematical core for bias-variance tradeoff

1. Mathematical Decomposition

Let the true data-generating process be y=f(x)+ϵy = f(x) + \epsilon, where E[ϵ]=0E[\epsilon] = 0 and Var(ϵ)=σ2\text{Var}(\epsilon) = \sigma^2 (irreducible noise).

If we fit a model f^(x)\hat{f}(x) on a training set, the expected squared error at a query point xx is:

E[(yf^(x))2]=Bias[f^(x)]2+Variance[f^(x)]+σ2E\left[(y - \hat{f}(x))^2\right] = \text{Bias}\left[\hat{f}(x)\right]^2 + \text{Variance}\left[\hat{f}(x)\right] + \sigma^2

Where:

2. Bias Term

The difference between the expected prediction of our model and the true function:

Bias[f^(x)]=E[f^(x)]f(x)\text{Bias}\left[\hat{f}(x)\right] = E\left[\hat{f}(x)\right] - f(x)

3. Variance Term

The variance of the model's prediction over different training set samples:

Variance[f^(x)]=E[(f^(x)E[f^(x)])2]\text{Variance}\left[\hat{f}(x)\right] = E\left[\left(\hat{f}(x) - E\left[\hat{f}(x)\right]\right)^2\right]

Code Example

bias-variance_tradeoff.py · scikit-learn example

Python
model_fitting.py
1import numpy as np
2from sklearn.preprocessing import PolynomialFeatures
3from sklearn.linear_model import LinearRegression
4from sklearn.pipeline import make_pipeline
5
6# Create a polynomial regression model pipeline
7def get_poly_model(degree):
8    return make_pipeline(
9        PolynomialFeatures(degree=degree),
10        LinearRegression()
11    )
12
13# Fit model on training coordinates (X, y)
14# model = get_poly_model(degree=3)
15# model.fit(X, y)
16

Strengths

  • Provides a clear diagnostic roadmap for improving models (e.g., add features if bias is high; add data/regularization if variance is high).

  • Separates avoidable modeling error from the irreducible noise floor σ2\sigma^2.

  • Guides feature selection, model capacity decisions, and validation strategy.

!

Limitations

  • In practice, computing exact bias and variance terms is impossible because the true underlying distribution f(x)f(x) is unknown.

  • Modern deep learning exhibits a 'double descent' phenomenon where extremely overparameterized models bypass the classical tradeoff and generalize well.

  • Does not choose hyperparameters directly; validation or cross-validation is still required.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    The training and test sets are sampled from the identical underlying probability distribution.

  • 2

    The irreducible noise floor is stationary and independent of model parameters.

R

References

Books and papers for deeper study

  • Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning. 2nd edn. New York: Springer.