L1 & L2 Regularization

Difficulty:Intermediate

Reading Time:20 min

Track:

Practitioner

Preventing overfitting by adding parameter penalty constraints (L1 Lasso and L2 Ridge) to the loss objective.

Prerequisites

Linear Regression Applied ML Workflow & Concepts

ML PractitionerModule 4 of 17

L1 & L2 Regularization

24%

TL;DR

Regularization adds a penalty on the weights to the loss, trading a little extra bias for a lot less variance.
Ridge (L2) penalizes $\sum w_j^2$ , giving a closed-form solution $\hat\beta = (X^TX + \lambda I)^{-1}X^Ty$ that shrinks weights smoothly toward zero but never sets them exactly to $0$ .
Lasso (L1) penalizes $\sum |w_j|$ , which has corners on the axes — this geometry drives many weights to exactly $0$ , giving automatic feature selection and sparse models.
Elastic Net mixes L1 and L2 ( $\alpha$ controls the blend), keeping Lasso’s sparsity while sharing weight across correlated features like Ridge.
The strength $\lambda$ controls the bias-variance dial: $\lambda = 0$ recovers plain OLS, while $\lambda \to \infty$ shrinks all weights to (near) zero, causing severe underfitting.
Always standardize features before regularizing — otherwise the penalty unfairly punishes large-scale features more than small-scale ones.

Learning Objectives

Explain the purpose of regularization in reducing variance
Differentiate between L1 (Lasso) and L2 (Ridge) regularization mathematically and geometrically
Describe why L1 regularization leads to parameter sparsity and feature selection
List other regularization techniques such as dropout, early stopping, and data augmentation

Intuition

How to think conceptually about this topic

Imagine fitting a model to noisy coordinates. Without penalties, the model can use extreme weights to chase every small fluctuation in the training set. That can reduce training loss while hurting generalization.

Regularization makes that behavior expensive. Geometrically:

L1 constraint forms a diamond/cross shape ( $|w_1| + |w_2| \le C$ ). Contours of the unregularized loss are most likely to intersect the constraint diamond at its sharp corners, which lie directly on the axes (meaning one weight is exactly $0$ ).
L2 constraint forms a circle/hypersphere ( $w_1^2 + w_2^2 \le C^2$ ). Loss contours intersect the circle smoothly at non-zero values on both axes.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Regularization reduces overfitting by adding a penalty for overly large or complex model parameters. The model must now fit the data while keeping its weights under control.

L1 Regularization (Lasso): Adds a penalty proportional to the absolute values of the weights. This drives many parameters to exactly zero, producing sparse models and performing automatic feature selection.
L2 Regularization (Ridge / Weight Decay): Adds a penalty proportional to the squared values of the weights. This shrinks parameters toward zero but keeps them non-zero, distributing weights smoothly.

How It Compares

Ridge vs Lasso vs Elastic Net

Dimension	Ridge	Lasso	Elastic Net
Penalty term	L2: $\lambda \sum w_j^2$	L1: $\lambda \sum \|w_j\|$	Mix: $\lambda\left(\alpha \sum \|w_j\| + (1-\alpha)\sum w_j^2\right)$
Feature selection	No — shrinks but never reaches exactly $0$	Yes — drives many weights to exactly $0$	Yes — sparse, but less aggressively than pure Lasso
Handling correlated features	Excellent — splits weight evenly across correlated features ("grouping effect")	Unstable — arbitrarily picks one of a correlated group, zeroing the rest	Good — groups correlated features like Ridge while still allowing sparsity
Computational solution method	Closed-form: $(X^TX + \lambda I)^{-1}X^Ty$	No closed form — coordinate descent or LARS	No closed form — coordinate descent (combines both penalty gradients)
When to use	Many correlated features, want stable shrinkage, keep all features	Suspect a sparse true model, want automatic feature selection	High-dimensional + correlated features, want sparsity without Lasso’s instability

TakeawayReach for Ridge when you want stable shrinkage and believe most features matter a little; reach for Lasso when you believe only a few features truly matter and want them named explicitly; reach for Elastic Net when both correlated structure and sparsity are present, which is the common case in high-dimensional real-world data.

When to Use It

Reach for this when

You have more features than you trust, or more features than samples ( $d \ge n$ ), where plain OLS is unstable or impossible to compute.
Your features are correlated and OLS coefficients are unstable or wildly large in magnitude (a sign of high variance).
You want a model with built-in feature selection (Lasso/Elastic Net) to identify the small subset of variables that matter.
You are seeing a large gap between training and validation error — i.e. the model is overfitting — and need a principled way to add bias and reduce variance.

Avoid it when

You have abundant data relative to the number of features and OLS is already stable and low-variance — added bias would only hurt.
Interpretability of unbiased coefficients (e.g. for causal inference or formal statistical hypothesis testing) is essential, since regularized coefficients are intentionally biased.
You haven’t standardized your features — the penalty will unfairly shrink large-scale features more than small-scale ones, distorting which variables look "important".
The true relationship is highly non-linear and no amount of weight shrinkage will fix a fundamentally mis-specified linear model — consider tree ensembles or kernel methods instead.

Rules of thumb

Always standardize (zero mean, unit variance) features before applying L1/L2 penalties, since the penalty is scale-sensitive.
Select $\lambda$ (and the Elastic Net mixing parameter $\alpha$ ) via k-fold cross-validation, picking the largest $\lambda$ within one standard error of the minimum CV error ("one-standard-error rule") for a simpler, more robust model.
If you need feature selection but have many correlated predictors, prefer Elastic Net over pure Lasso to avoid its arbitrary, unstable feature-picking among correlated groups.

Implementation

Reference code implementation

Python

model_fitting.py

1import numpy as np
2from sklearn.linear_model import Lasso, Ridge
3
4# L1 Regularization (Lasso)
5# alpha parameters map directly to lambda strength
6lasso_reg = Lasso(alpha=0.1)
7lasso_reg.fit(X, y)
8
9# L2 Regularization (Ridge)
10ridge_reg = Ridge(alpha=1.0)
11ridge_reg.fit(X, y)
12

Strengths & Advantages

L1 performs feature selection, identifying the most predictive variables by zeroing out irrelevant ones.
L2 handles collinearity (correlated features) effectively, sharing predictive weight among them.
Highly effective at improving generalization performance on small or noisy datasets.

Limitations & Drawbacks

Choosing the hyperparameter $\lambda$ requires validation or cross-validation.
Lasso (L1) cannot yield analytical closed-form solutions (requires coordinate descent optimization).
Extreme regularization causes high bias (underfitting), flattening the model predictions.

Real-World Case Studies

Gene selection from microarray expression data

Genomics / bioinformatics

Scenario

A cancer classification study has gene-expression measurements for roughly 20,000 genes but only a few hundred patient samples ( $d \gg n$ ) — the classic high-dimensional, low-sample-size setting where ordinary least squares is not even computable because $X^TX$ is singular.

Approach

Researchers fit an L1-penalized (Lasso) regression/logistic model relating gene expression to a clinical outcome (e.g. tumor recurrence), sweeping $\lambda$ via cross-validation. Elastic Net is often preferred in practice over pure Lasso because many genes in the same biological pathway are highly co-expressed (correlated), and pure Lasso would arbitrarily select just one gene per pathway while ignoring its correlated partners.

Outcome

The penalty drives the vast majority of the ~20,000 gene coefficients to exactly zero, typically leaving a panel of only a few dozen genes (often 20-100) with non-zero weight — a result that is both statistically tractable and biologically interpretable as a candidate biomarker panel, at a modest cost in predictive accuracy compared to an (infeasible) unregularized fit.

Source: The Elements of Statistical Learning (Ch. 18, High-Dimensional Problems) — Hastie, T., Tibshirani, R. and Friedman, J.

Common Misconceptions

MisconceptionRegularization is used to reduce model training error.

CorrectionRegularization adds a penalty that deliberately increases training error (adds bias) in order to decrease generalization/validation error on unseen data (reduces variance).

MisconceptionL1 and L2 regularization cannot be combined.

CorrectionThey are combined in a method called Elastic Net regularization, which balances the benefits of both sparsity and weight distribution.

References & Further Reading

Regression Shrinkage and Selection via the Lassotextbook
By Tibshirani, R
View Resource →
Deep Learning (Chapter 7)textbook
By Goodfellow, I. et al
View Resource →

L1 & L2 Regularization

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

How It Compares

Ridge vs Lasso vs Elastic Net

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

Gene selection from microarray expression data

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Linear Regression

Logistic Regression

Applied ML Workflow & Concepts