L1 & L2 Regularization
Regularization reduces overfitting by adding a penalty for overly large or complex model parameters. The model must now fit the data while keeping its weights under control.
-
L1 Regularization (Lasso): Adds a penalty proportional to the absolute values of the weights. This drives many parameters to exactly zero, producing sparse models and performing automatic feature selection.
-
L2 Regularization (Ridge / Weight Decay): Adds a penalty proportional to the squared values of the weights. This shrinks parameters toward zero but keeps them non-zero, distributing weights smoothly.
L1 performs feature selection, identifying the most predictive variables by zeroing out irrelevant ones.
Choosing the hyperparameter requires validation or cross-validation.
Intuition
How to think about this algorithm
Imagine fitting a model to noisy coordinates. Without penalties, the model can use extreme weights to chase every small fluctuation in the training set. That can reduce training loss while hurting generalization.
Regularization makes that behavior expensive. Geometrically:
-
L1 constraint forms a diamond/cross shape (). Contours of the unregularized loss are most likely to intersect the constraint diamond at its sharp corners, which lie directly on the axes (meaning one weight is exactly ).
-
L2 constraint forms a circle/hypersphere (). Loss contours intersect the circle smoothly at non-zero values on both axes.
L1 vs L2 Parameter Constraints
Toggle L1 Lasso vs L2 Ridge, and slide C to shrink the parameters budget. Observe how contours intersect the boundaries.
Smaller C enforces tighter regularization, shrinking parameters. Larger C expands the budget.
The Logic
Mathematical core for l1 & l2 regularization
Let the original loss function (e.g., MSE) be .
1. L1 Regularization (Lasso Loss)
Adds the norm penalty of the weight vector:
Where is the regularization strength.
2. L2 Regularization (Ridge Regression Loss)
Adds the squared norm penalty of the weight vector:
3. Gradient Analysis (Weight Decay)
Under L2 regularization, the gradient update step for weight becomes:
Where shrinks the weight at every iteration (known as weight decay).
Code Example
l1_&_l2_regularization.py · scikit-learn example
1import numpy as np
2from sklearn.linear_model import Lasso, Ridge
3
4# L1 Regularization (Lasso)
5# alpha parameters map directly to lambda strength
6lasso_reg = Lasso(alpha=0.1)
7lasso_reg.fit(X, y)
8
9# L2 Regularization (Ridge)
10ridge_reg = Ridge(alpha=1.0)
11ridge_reg.fit(X, y)
12Strengths
L1 performs feature selection, identifying the most predictive variables by zeroing out irrelevant ones.
L2 handles collinearity (correlated features) effectively, sharing predictive weight among them.
Highly effective at improving generalization performance on small or noisy datasets.
Limitations
Choosing the hyperparameter requires validation or cross-validation.
Lasso (L1) cannot yield analytical closed-form solutions (requires coordinate descent optimization).
Extreme regularization causes high bias (underfitting), flattening the model predictions.
Key Assumptions
Scope conditions and interpretation notes
- 1
Features are normalized or scaled before fitting to ensure regularizers penalize parameters equally.
- 2
In L1 Lasso, the underlying true model is sparse (most weights are zero).
References
Books and papers for deeper study
Tibshirani, R. (1996) 'Regression shrinkage and selection via the lasso', Journal of the Royal Statistical Society: Series B (Methodological), 58(1), pp. 267-288.