Linear Regression

Linear & Logistic Regression

Linear and logistic regression are often the first models worth fitting because they expose the entire machine-learning workflow without hiding it behind many layers. You choose a simple model family, measure its mistakes with a loss function, and solve for parameters that make that loss small.

Linear regression predicts a continuous value such as price, demand, temperature, or risk score. Logistic regression predicts the probability of a class such as churn/no churn, fraud/not fraud, pass/fail, or disease/no disease. The important shared idea is the linear score:

z=wTx+bz = w^T x + b

Linear regression uses that score directly as the prediction. Logistic regression passes it through a sigmoid so the output is constrained to the interval [0,1][0, 1].

Where is it used?

Use linear regression when the target is numeric and roughly changes in a straight-line way with the features. Use logistic regression when the target is categorical but you still want interpretable coefficients and probabilities. In practice, both are strong baselines: if a complex model only barely beats them, the extra complexity may not be buying much.

Best use

Incredibly easy to understand. You can look at the final weights and know exactly how much each feature influenced the prediction.

Watch out for

It assumes the relationship between variables is a perfectly straight line. If the real world is curved or complex, these models will fail.

i

Intuition

How to think about this algorithm

In the interactive lab, linear regression is the ruler problem. Each point has a vertical residual: the gap between the observed value and the line's prediction. Squaring those gaps makes large misses expensive, so one outlier can visibly pull the fitted line. That is not a UI trick; it is exactly what the squared-error objective asks the model to do.

Logistic regression is the boundary problem. A linear score says which side of a boundary a point sits on. The sigmoid converts distance from that boundary into confidence: far on one side means probability near 1, far on the other side means probability near 0, and the boundary itself is the uncertain region around 0.5.

The useful mental model is this: regression is not "drawing a line"; it is choosing parameters that minimize a specific loss under a specific assumption about the shape of the relationship.

Interactive Diagram

Method of Least Squares Regression

Manually adjust slope and intercept parameters. Residual squares visualize error variance directly. Minimise total square areas to solve optimal parameters.

Residual Error
Residual Square
Observation
[Click plot space to add data points]
y = 0.65x + 1.80
Total Squared Loss: 0.77
Slope (m)0.65
Intercept (c)1.80
Key InsightEvery vertical line represents a residual. Drawing a literal square attached to it highlights how squaring error heavily penalises outlier nodes.

The Logic

Mathematical core for linear & logistic regression

1. Linear regression model

For a row of features xix_i, the model predicts:

y^i=wTxi+b\hat{y}_i = w^T x_i + b

The residual is the signed error:

ri=yiy^ir_i = y_i - \hat{y}_i

Ordinary least squares chooses parameters that minimize mean squared error:

L(w,b)=1ni=1n(yi(wTxi+b))2\mathcal{L}(w,b) = \frac{1}{n}\sum_{i=1}^n (y_i - (w^T x_i + b))^2

With a full-rank design matrix, the closed-form solution is:

w^=(XTX)1XTy\hat{w} = (X^T X)^{-1}X^T y

2. Logistic regression model

For binary labels yi{0,1}y_i \in \{0,1\}, logistic regression starts with the same linear score:

zi=wTxi+bz_i = w^T x_i + b

Then it maps the score to a probability:

p^i=P(yi=1xi)=σ(zi)=11+ezi\hat{p}_i = P(y_i=1|x_i) = \sigma(z_i) = \frac{1}{1 + e^{-z_i}}

The decision boundary at threshold tt is where:

σ(wTx+b)=t\sigma(w^T x + b) = t

For the common threshold t=0.5t=0.5, this simplifies to:

wTx+b=0w^T x + b = 0

3. Cross-entropy loss

Squared error is not the right objective for class probabilities. Logistic regression uses binary cross-entropy:

L(w,b)=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]\mathcal{L}(w,b) = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i\log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]

Its gradient has a compact form:

wL=1nXT(p^y)\nabla_w \mathcal{L} = \frac{1}{n}X^T(\hat{p} - y)

This is why the algorithm is so useful pedagogically: the update direction is literally driven by predicted probability minus observed label.

Code Example

linear_&_logistic_regression.py · scikit-learn example

Python
model_fitting.py
1import numpy as np
2from sklearn.linear_model import LinearRegression, LogisticRegression
3from sklearn.metrics import mean_squared_error, log_loss
4
5# Linear regression: predict a numeric score from one feature.
6hours = np.array([[1.1], [2.1], [3.0], [4.0], [5.2], [6.1], [7.0], [8.4]])
7score = np.array([2.2, 2.8, 4.1, 4.6, 5.9, 6.7, 7.6, 8.7])
8
9lin = LinearRegression()
10lin.fit(hours, score)
11score_hat = lin.predict(hours)
12
13print("Linear slope:", lin.coef_[0])
14print("Linear intercept:", lin.intercept_)
15print("MSE:", mean_squared_error(score, score_hat))
16
17# Logistic regression: predict probability of passing from two features.
18# Features: [hours studied, practice-test average]
19X = np.array([
20    [1.2, 2.0],
21    [2.0, 3.2],
22    [2.8, 2.4],
23    [3.6, 4.0],
24    [5.2, 5.1],
25    [6.4, 5.8],
26    [7.1, 7.4],
27    [8.2, 6.8],
28    [8.8, 8.5],
29])
30y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1])
31
32clf = LogisticRegression()
33clf.fit(X, y)
34prob = clf.predict_proba(X)[:, 1]
35
36print("Logistic weights:", clf.coef_[0])
37print("Logistic intercept:", clf.intercept_[0])
38print("Cross-entropy:", log_loss(y, prob))
39print("New example pass probability:", clf.predict_proba([[5.5, 6.0]])[0, 1])

Strengths

  • Incredibly easy to understand. You can look at the final weights and know exactly how much each feature influenced the prediction.

  • Extremely fast to train, making it the perfect baseline model to try before moving on to complex neural networks.

!

Limitations

  • It assumes the relationship between variables is a perfectly straight line. If the real world is curved or complex, these models will fail.

  • Highly sensitive to outliers. A single extreme data point can drag the entire line out of place.

  • If two of your input features are highly correlated (like 'years alive' and 'age'), the math can break down.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    The conditional mean of the target is approximately linear in the features.

  • 2

    Residual variance is roughly constant across the prediction range.

  • 3

    Residuals are not strongly correlated after accounting for the features.

R

References

Books and papers for deeper study

  • Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edn. New York: Springer.