Dimensionality Reduction

Difficulty:Advanced

Reading Time:20 min

Track:

Practitioner

Techniques to shrink massive datasets down to their most important core features, making them easier to visualize and faster to process.

Prerequisites

Clustering (K-Means, EM, GMM)

ML PractitionerModule 14 of 17

Dimensionality Reduction

82%

TL;DR

PCA finds new axes (principal components) that are orthogonal directions of maximum variance in the data.
Those directions are the eigenvectors of the covariance matrix $\Sigma$ ; each eigenvalue $\lambda_i$ equals the variance captured along its component.
The explained variance ratio of component $i$ is $\lambda_i / \sum_j \lambda_j$ ; keep the top $k$ components until the cumulative ratio crosses your threshold (e.g. 95%).
Always center the data (and usually standardize it) first, otherwise components chase the mean or the largest-unit feature instead of the true structure.
PCA is linear and great for compression and decorrelation; for nonlinear visualization reach for t-SNE or UMAP.

Learning Objectives

Explain the mathematical purpose of covariance matrices in Principal Component Analysis
Formulate the eigenvalue problem and relate eigenvalues to explained variance
Distinguish between PCA (linear projection) and non-linear methods (like t-SNE or UMAP)
Interpret the explained variance ratio of principal components

Intuition

How to think conceptually about this topic

Imagine you are trying to take a photograph of a complex 3D object, like a bicycle. If you take the photo from the front, it just looks like a thin line (a tire and some handlebars). You've lost almost all the information about what the object is.

But if you walk around to the side and take a photo, you capture the wheels, the frame, the pedals, and the seat. You have successfully compressed a 3D object into a 2D photograph while keeping the maximum amount of useful visual information.

PCA does exactly this, but with math. It mathematically rotates your data until it finds the absolute best "camera angle" that captures the widest, most informative view of your dataset.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Imagine you have a spreadsheet with 1,000 columns describing a house (square footage, number of windows, color of the front door, distance to the nearest coffee shop, etc.). "Dimensionality Reduction" algorithms figure out how to compress those 1,000 columns down to, say, 10 columns, without losing the core "meaning" of the data.

The most famous technique is Principal Component Analysis (PCA). PCA looks at all your data and mathematically figures out which combinations of features actually matter. For example, it might figure out that "number of bedrooms," "number of bathrooms," and "square footage" all basically measure the same thing: "House Size." It combines them into one new super-feature, throwing away the redundant noise.

Where is it used?

It is heavily used for data visualization. Humans can't visualize a 1,000-dimensional graph, but PCA can compress that data down to 2 or 3 dimensions so we can actually look at it on a screen. It's also used to compress images, speed up facial recognition systems, and clean up messy data before feeding it into other machine learning models.

How It Compares

PCA vs t-SNE vs UMAP

Dimension	PCA	t-SNE	UMAP
Linear or nonlinear	Linear projection	Nonlinear manifold	Nonlinear manifold
Structure preserved	Global variance / distances	Mostly local neighborhoods	Local with some global structure
Deterministic?	Yes (up to sign)	No — random init, stochastic	No — random init, stochastic
Main use	Compression & decorrelation	2D/3D visualization	2D/3D visualization
New points / inverse transform	Easy — apply the linear map; invertible	No native transform for new points	Supports transforming new points

TakeawayUse PCA when you need a fast, reversible, deterministic compression of linearly correlated features; use t-SNE or UMAP only to visualize cluster structure, never as preprocessing whose axes you intend to interpret.

When to Use It

Reach for this when

Features are numerous and linearly correlated, and you want to compress them into a few uncorrelated components.
You need a fast, deterministic, reversible transform that also works on new incoming data (e.g. as a preprocessing step before a classifier).
You want to decorrelate features or remove redundancy / noise before feeding another model.

Avoid it when

The structure is strongly nonlinear (curved manifolds like a Swiss roll) — PCA flattens it; use Kernel PCA, t-SNE, UMAP, or autoencoders.
You need interpretable original features — principal components are abstract linear combinations, not real-world variables.
You only care about a pretty 2D cluster picture — t-SNE/UMAP usually separate clusters more clearly for visualization.

Rules of thumb

Always center, and almost always standardize (z-score) features before PCA, especially when units differ.
Choose $k$ from the cumulative explained-variance curve — a 95% or 99% threshold is a common default.
Treat t-SNE and UMAP as visualization tools only; their axes have no consistent meaning and distances between far-apart clusters are not reliable.

Implementation

Reference code implementation

Python

model_fitting.py

1import numpy as np
2from sklearn.decomposition import PCA
3
4# Create a fake dataset: 100 rows, 5 columns
5X = np.random.randn(100, 5)
6
7# Tell PCA to keep enough components to explain 95% of the data's variance
8pca = PCA(n_components=0.95)
9X_reduced = pca.fit_transform(X)
10
11print(f"Original shape: {X.shape}")
12print(f"Reduced shape: {X_reduced.shape}")
13print(f"How much information each new column holds: {pca.explained_variance_ratio_}")

Strengths & Advantages

It is a brilliant, mathematically proven way to remove redundant, highly correlated columns from your data.
It drastically speeds up other machine learning algorithms by giving them less, but higher-quality, data to process.
It allows you to visualize incredibly complex datasets on a standard 2D screen.

Limitations & Drawbacks

PCA assumes that the relationships in your data are straight lines (linear). If your data is curved or twisted, standard PCA will fail.
The new 'super-features' it creates are mathematically abstract. You might compress 10 columns into 'Component 1', but it becomes very hard to explain to a human what 'Component 1' actually represents in the real world.

Real-World Case Studies

Eigenfaces: compressing face images for recognition

Computer vision / face recognition

Scenario

Turk and Pentland needed to recognize human faces from grayscale images. A modest $128 \times 128$ image is a point in a 16,384-dimensional pixel space — far too high-dimensional to compare or classify directly, and dominated by redundant, correlated pixels.

Approach

Center the training faces by subtracting the average face, then run PCA on the image set. The top eigenvectors of the covariance matrix — the "eigenfaces" — span a low-dimensional "face space." Each face is then represented by its coordinates (weights) along the top $k$ eigenfaces, and recognition reduces to a nearest-neighbour comparison in that compact space.

Outcome

A small handful of eigenfaces captured most of the variation across faces: roughly 7 of the top eigenfaces sufficed to characterize the face set, and in their experiments about 40 eigenfaces were enough for reliable recognition — collapsing the original 16,384-dimensional representation by over two orders of magnitude while preserving the discriminative variance. On their test database the system recognized faces with around 96% accuracy under varying lighting.

Source: Eigenfaces for Recognition — Turk, M. and Pentland, A.

Common Misconceptions

MisconceptionPCA is a feature selection technique.

CorrectionFeature selection chooses a subset of the original features. PCA is feature extraction; it creates entirely new features that are linear combinations of the original ones.

MisconceptionPCA does not require centering the data.

CorrectionWithout centering the columns to have zero mean, the first principal component would point in the direction of the mean vector instead of the direction of maximum variance.

References & Further Reading

Principal Component Analysistextbook
By Jolliffe, I.T
View Resource →
UMAP: Uniform Manifold Approximation and Projectiontextbook
By McInnes, L., Healy, J. and Melville, J
View Resource →

Dimensionality Reduction

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

Where is it used?

How It Compares

PCA vs t-SNE vs UMAP

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

Eigenfaces: compressing face images for recognition

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Clustering (K-Means, EM, GMM)