Linear Regression

Difficulty:Intermediate

Reading Time:25 min

Track:

Practitioner

A baseline model that predicts a continuous numeric value by fitting a straight line through the data.

ML PractitionerModule 2 of 17

Linear Regression

12%

TL;DR

Linear regression predicts a continuous target as $\hat{y} = w^T x + b$ and fits $w, b$ by minimizing mean squared error.
Squared loss is convex, so there is a unique global optimum given by the closed-form normal equation $\hat{w} = (X^T X)^{-1} X^T y$ .
Coefficients are directly interpretable: each one is the expected change in $y$ per unit change in a feature, holding the others fixed.
It is a fast, strong baseline, but assumes linearity in the parameters and is sensitive to outliers and multicollinearity.
When features are correlated or you want feature selection, reach for the regularized cousins Ridge (L2) and Lasso (L1).

Learning Objectives

Formulate the linear regression prediction equation
Define the Mean Squared Error (MSE) loss function
Derive the closed-form normal equation for ordinary least squares
Explain the impact of outliers and multicollinearity on linear regression performance

Intuition

How to think conceptually about this topic

In the interactive visualization, linear regression is the "ruler problem." Each data point has a vertical residual: the distance between the observed value and the line's prediction. Squaring these residuals makes large misses exponentially more expensive. Therefore, a single extreme outlier can pull the entire fitted line toward it. This represents exactly what the squared-error objective forces the model to do.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Linear regression is one of the most fundamental algorithms in statistics and machine learning. It models the relationship between a dependent continuous variable (the target) and one or more independent variables (the features) by fitting a linear equation to the observed data.

Pedagogically, linear regression exposes the entire machine learning workflow: choosing a model family (linear scores), defining a loss function (mean squared error), and solving for parameters (via closed-form ordinary least squares or gradient descent) that make that loss as small as possible.

Where is it used?

Linear regression is used when you want to predict continuous numeric values, such as predicting housing prices based on square footage, forecasting product demand based on seasonal advertising spend, estimating temperature, or calculating a customer risk score.

How It Compares

OLS vs Ridge vs Lasso

Dimension	OLS	Ridge	Lasso
Penalty term	None	L2: $\lambda\sum w_j^2$	L1: $\lambda\sum \|w_j\|$
Handles multicollinearity	Poorly — coefficients can blow up	Well — shrinks correlated weights together	Well — tends to keep one of a correlated group
Feature selection	No	No — shrinks but never to exactly zero	Yes — drives some weights to exactly $0$
Closed-form solution	Yes	Yes: $(X^T X + \lambda I)^{-1}X^T y$	No — needs coordinate/iterative descent
Use it when	Few, roughly independent features	Many correlated features	You suspect a sparse true model

TakeawayStart with OLS as a baseline; switch to Ridge when features are correlated, and to Lasso when you also want automatic feature selection.

When to Use It

Reach for this when

You expect an approximately linear relationship between features and target, or you can engineer features (logs, polynomials) that make it linear.
You need an interpretable model whose coefficients you can explain to stakeholders.
You want a fast, low-variance baseline to benchmark more complex models against.

Avoid it when

The relationship is strongly non-linear and resists feature engineering — prefer trees or neural networks.
Features are highly collinear — coefficients become unstable; use Ridge/Lasso instead.
The data has heavy outliers or non-constant error variance (heteroscedasticity) that violate OLS assumptions — consider robust or weighted regression.

Rules of thumb

Standardize features before comparing coefficient magnitudes.
If a variance inflation factor (VIF) exceeds ~10, treat multicollinearity seriously and regularize.
Always plot residuals against fitted values to check linearity and constant variance.

Implementation

Reference code implementation

Python

model_fitting.py

1import numpy as np
2from sklearn.linear_model import LinearRegression
3from sklearn.metrics import mean_squared_error
4
5# Predict a numeric score from study hours
6hours = np.array([[1.1], [2.1], [3.0], [4.0], [5.2], [6.1], [7.0], [8.4]])
7score = np.array([2.2, 2.8, 4.1, 4.6, 5.9, 6.7, 7.6, 8.7])
8
9lin = LinearRegression()
10lin.fit(hours, score)
11score_hat = lin.predict(hours)
12
13print("Linear slope:", lin.coef_[0])
14print("Linear intercept:", lin.intercept_)
15print("MSE:", mean_squared_error(score, score_hat))

Strengths & Advantages

Extremely easy to interpret; coefficients show the direct impact of each feature.
Very fast to train and predict, serving as an excellent baseline.
Has a closed-form analytical solution.

Limitations & Drawbacks

Assumes a linear relationship; fails if data patterns are non-linear.
Highly sensitive to outliers, which pull the regression line disproportionally.
Math breaks down if input features are highly correlated (multicollinearity).

Real-World Case Studies

Allocating an advertising budget across channels

Marketing analytics

Scenario

A firm spends on TV, radio, and newspaper advertising across 200 markets and wants to know which channels actually drive sales (the Advertising dataset popularized in An Introduction to Statistical Learning). Looked at individually, all three channels appear positively correlated with sales.

Approach

Fit a multiple linear regression $\text{sales} \sim \text{TV} + \text{radio} + \text{newspaper}$ , then inspect each coefficient, its statistical significance, and the overall fit — checking whether channels that look useful alone remain useful once the others are controlled for.

Outcome

TV and radio have strong, statistically significant positive coefficients, while the newspaper coefficient becomes small and not significant once TV and radio are included; the model explains roughly 90% of the variance in sales ( $R^2 \approx 0.90$ ). The actionable lesson: a feature that correlates with the target on its own (newspaper) can be revealed as redundant once confounders are accounted for — so budget is better shifted toward radio.

Source: An Introduction to Statistical Learning (Ch. 3, Advertising data) — James, G., Witten, D., Hastie, T. and Tibshirani, R.

Common Misconceptions

MisconceptionLinear regression can only fit straight lines.

CorrectionLinear regression is linear in parameters, meaning you can fit curves by transforming the features (e.g. polynomial features like

x^2

MisconceptionA high R-squared value always means the model is good.

CorrectionR-squared can be artificially inflated by adding irrelevant features. It does not indicate whether the model is overfitted.

References & Further Reading

The Elements of Statistical Learningtextbook
By Hastie, T., Tibshirani, R. and Friedman, J
View Resource →
An Introduction to Statistical Learningtextbook
By James, G., Witten, D., Hastie, T. and Tibshirani, R
View Resource →

Linear Regression

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

Where is it used?

How It Compares

OLS vs Ridge vs Lasso

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

Allocating an advertising budget across channels

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Logistic Regression

L1 & L2 Regularization