L1 & L2 Regularization

L1 & L2 Regularization

Regularization reduces overfitting by adding a penalty for overly large or complex model parameters. The model must now fit the data while keeping its weights under control.

  • L1 Regularization (Lasso): Adds a penalty proportional to the absolute values of the weights. This drives many parameters to exactly zero, producing sparse models and performing automatic feature selection.

  • L2 Regularization (Ridge / Weight Decay): Adds a penalty proportional to the squared values of the weights. This shrinks parameters toward zero but keeps them non-zero, distributing weights smoothly.

Best use

L1 performs feature selection, identifying the most predictive variables by zeroing out irrelevant ones.

Watch out for

Choosing the hyperparameter λ\lambda requires validation or cross-validation.

i

Intuition

How to think about this algorithm

Imagine fitting a model to noisy coordinates. Without penalties, the model can use extreme weights to chase every small fluctuation in the training set. That can reduce training loss while hurting generalization.

Regularization makes that behavior expensive. Geometrically:

  • L1 constraint forms a diamond/cross shape (w1+w2C|w_1| + |w_2| \le C). Contours of the unregularized loss are most likely to intersect the constraint diamond at its sharp corners, which lie directly on the axes (meaning one weight is exactly 00).

  • L2 constraint forms a circle/hypersphere (w12+w22C2w_1^2 + w_2^2 \le C^2). Loss contours intersect the circle smoothly at non-zero values on both axes.

Interactive Diagram

L1 vs L2 Parameter Constraints

Toggle L1 Lasso vs L2 Ridge, and slide C to shrink the parameters budget. Observe how contours intersect the boundaries.

Unregularized w*
Optimal Regularized w_reg
Constraint Bound (C)
Intersecting Contour
Penalty Geometry
Constraint Radius (C)3.5

Smaller C enforces tighter regularization, shrinking parameters. Larger C expands the budget.

Fitted Parameters:
Weight 1 (w₁): 0.630
Weight 2 (w₂): 2.870
Both weights active
Key InsightL1 regularization (diamond constraint) has sharp corners on the axes, so the optimum often lands with one coefficient exactly zero. L2 (circle constraint) shrinks parameters smoothly without that sparsity pressure.

The Logic

Mathematical core for l1 & l2 regularization

Let the original loss function (e.g., MSE) be L0(w)L_0(\mathbf{w}).

1. L1 Regularization (Lasso Loss)

Adds the L1L_1 norm penalty of the weight vector:

L(w)=L0(w)+λj=1dwj=L0(w)+λw1L(\mathbf{w}) = L_0(\mathbf{w}) + \lambda \sum_{j=1}^{d} |w_j| = L_0(\mathbf{w}) + \lambda \|\mathbf{w}\|_1

Where λ0\lambda \ge 0 is the regularization strength.

2. L2 Regularization (Ridge Regression Loss)

Adds the squared L2L_2 norm penalty of the weight vector:

L(w)=L0(w)+λj=1dwj2=L0(w)+λw22L(\mathbf{w}) = L_0(\mathbf{w}) + \lambda \sum_{j=1}^{d} w_j^2 = L_0(\mathbf{w}) + \lambda \|\mathbf{w}\|_2^2

3. Gradient Analysis (Weight Decay)

Under L2 regularization, the gradient update step for weight wjw_j becomes:

wjwjα(L0wj+2λwj)=(12αλ)wjαL0wjw_j \leftarrow w_j - \alpha \left( \frac{\partial L_0}{\partial w_j} + 2\lambda w_j \right) = (1 - 2\alpha\lambda)w_j - \alpha \frac{\partial L_0}{\partial w_j}

Where (12αλ)<1(1 - 2\alpha\lambda) < 1 shrinks the weight at every iteration (known as weight decay).

Code Example

l1_&_l2_regularization.py · scikit-learn example

Python
model_fitting.py
1import numpy as np
2from sklearn.linear_model import Lasso, Ridge
3
4# L1 Regularization (Lasso)
5# alpha parameters map directly to lambda strength
6lasso_reg = Lasso(alpha=0.1)
7lasso_reg.fit(X, y)
8
9# L2 Regularization (Ridge)
10ridge_reg = Ridge(alpha=1.0)
11ridge_reg.fit(X, y)
12

Strengths

  • L1 performs feature selection, identifying the most predictive variables by zeroing out irrelevant ones.

  • L2 handles collinearity (correlated features) effectively, sharing predictive weight among them.

  • Highly effective at improving generalization performance on small or noisy datasets.

!

Limitations

  • Choosing the hyperparameter λ\lambda requires validation or cross-validation.

  • Lasso (L1) cannot yield analytical closed-form solutions (requires coordinate descent optimization).

  • Extreme regularization causes high bias (underfitting), flattening the model predictions.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    Features are normalized or scaled before fitting to ensure regularizers penalize parameters equally.

  • 2

    In L1 Lasso, the underlying true model is sparse (most weights are zero).

R

References

Books and papers for deeper study

  • Tibshirani, R. (1996) 'Regression shrinkage and selection via the lasso', Journal of the Royal Statistical Society: Series B (Methodological), 58(1), pp. 267-288.