Clustering Methods

Clustering (K-Means, EM, GMM)

Difficulty:Intermediate

Reading Time:25 min

Track:

Practitioner

Algorithms that automatically group similar data points together without needing human labels.

ML PractitionerModule 12 of 17

Clustering (K-Means, EM, GMM)

71%

TL;DR

K-Means partitions data into $K$ clusters by minimizing inertia (within-cluster sum of squares), the total squared distance from each point to its assigned centroid.
It runs Lloyd's algorithm: alternate between assigning each point to its nearest centroid and updating each centroid to the mean of its assigned points.
For fixed assignments, the cluster mean is the provably optimal centroid — which is exactly why the update step uses the mean.
The objective is non-convex, so K-Means converges only to a local optimum; the result depends on initialization — use k-means++ and multiple restarts.
It assumes spherical, similarly sized clusters and is sensitive to feature scale, so standardize features first; for arbitrary shapes or noise prefer DBSCAN.
Choose $K$ with diagnostics like the elbow method (inertia vs $K$ ) or silhouette analysis — K-Means will not pick $K$ for you.

Learning Objectives

Explain the mechanics of the K-Means clustering algorithm
Evaluate cluster quality using inertia and silhouette scores
Compare K-Means and Gaussian Mixture Models (GMM)
Formulate the steps of the Expectation-Maximization (EM) algorithm

Intuition

How to think conceptually about this topic

K-Means: Imagine you have a map of a city with pins for every coffee shop, and you want to open 3 new distribution centers. K-Means draws hard borders on the map. Every coffee shop is assigned to exactly one distribution center—whichever one is closest in a straight line.

GMMs: Now imagine the distribution centers share delivery zones. A coffee shop right on the border isn't forced to choose just one. A GMM uses "soft boundaries." It might say, "This coffee shop is 70% likely to be served by Center A, and 30% likely to be served by Center B." It embraces the uncertainty of the real world.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Clustering algorithms are the detectives of machine learning. You hand them a massive pile of completely unlabelled data, and they automatically organize it into distinct, meaningful groups based on hidden patterns.

K-Means is the most famous clustering algorithm. It works by dropping a set number of "center points" (centroids) into the data and shifting them around until they sit perfectly in the middle of distinct, spherical clusters.

Gaussian Mixture Models (GMMs) are the smarter, more flexible upgrade to K-Means. Instead of assuming every cluster is a perfect circle, GMMs assume the data is made up of several overlapping bell curves (Gaussian distributions). This allows them to find clusters that are stretched out, dense in the middle, or overlapping. To figure out where these bell curves are, GMMs use a clever mathematical trick called the Expectation-Maximization (EM) algorithm.

Where is it used?

Clustering is used everywhere you need to find hidden structure. It's used in marketing to automatically segment customers into different buying personas, in finance to detect unusual spending patterns (anomalies), in biology to group genes with similar behaviors, and in computer vision to separate the foreground of an image from the background.

How It Compares

K-Means vs Hierarchical vs DBSCAN

Dimension	K-Means	Hierarchical (Agglomerative)	DBSCAN
Must specify number of clusters?	Yes — choose $K$ up front	No — cut the dendrogram at any level afterward	No — emerges from density (set $\varepsilon$ , minPts)
Cluster shape assumption	Spherical, similarly sized (convex)	Depends on linkage; can capture nested structure	Arbitrary shapes — follows dense regions
Handling of outliers / noise	Poor — every point is forced into a cluster	Poor — outliers attach to the nearest merge	Excellent — labels sparse points as noise
Scalability	High — roughly $O(nKi)$ , scales to large $n$	Low — naive $O(n^2)$ to $O(n^3)$ memory/time	Moderate — $O(n \log n)$ with spatial index

TakeawayReach for K-Means when you expect a known number of compact, roughly spherical clusters and need speed; use hierarchical clustering to explore nested structure without fixing

K

; choose DBSCAN when clusters have arbitrary shapes or the data contains noise you want flagged rather than absorbed.

When to Use It

Reach for this when

You expect a known, modest number of compact, roughly spherical clusters of similar size.
You need a fast, scalable partitioning that handles large datasets in near-linear time per iteration.
You want a simple, interpretable baseline for segmentation (e.g. grouping customers or quantizing colors).

Avoid it when

Clusters have arbitrary, elongated, or nested shapes (e.g. concentric rings) — prefer DBSCAN or spectral clustering.
The data contains heavy outliers or noise you do not want forced into clusters — DBSCAN flags them instead.
Clusters have very different densities or sizes, which violate the equal-variance, spherical assumption.
You cannot meaningfully scale features, since unscaled large-range features will dominate the Euclidean distance.

Rules of thumb

Always standardize features (zero mean, unit variance) before K-Means so no single feature dominates the distance.
Use k-means++ initialization and several restarts (sklearn n_init), keeping the run with the lowest inertia.
Pick $K$ with the elbow method on inertia and corroborate with silhouette scores rather than a single heuristic.
Remember inertia always decreases as $K$ grows, so never choose $K$ by minimizing inertia alone.

Implementation

Reference code implementation

Python

model_fitting.py

1import numpy as np
2from sklearn.mixture import GaussianMixture
3from sklearn.cluster import KMeans
4
5# 2 Distinct geometric spatial groups
6X = np.array([
7    [1.5, 2.0], [1.1, 1.8], [2.1, 2.2], 
8    [8.0, 8.5], [8.2, 8.1], [8.8, 9.0]
9])
10
11kmeans = KMeans(n_clusters=2, random_state=42)
12kmeans_labels = kmeans.fit_predict(X)
13
14gmm = GaussianMixture(n_components=2, covariance_type='diag', random_state=42)
15gmm.fit(X)
16
17print(f"Computed GMM Probabilities for structural coordinate [2.0, 2.0]: {gmm.predict_proba([[2.0, 2.0]])[0]}")

Strengths & Advantages

It works completely unsupervised. You don't need to spend hundreds of hours manually labeling data for the AI to learn.
GMMs provide 'soft' assignments, giving you a percentage probability of belonging to a cluster rather than a rigid Yes/No.
It can reveal hidden patterns in your data that human analysts might completely miss.

Limitations & Drawbacks

You usually have to tell the algorithm exactly how many clusters ( $K$ ) to look for before it starts, which involves a lot of guesswork.
K-Means is terrible at finding clusters that aren't perfectly spherical (like a ring of data surrounding another cluster).
The final result heavily depends on where the algorithm randomly places its starting points. A bad start leads to a bad result.

Real-World Case Studies

Color quantization: compressing an image to K colors

Computer vision / image compression

Scenario

A photograph stored in 24-bit color can contain up to $2^{24} \approx 16.7$ million distinct colors. Displaying or transmitting it cheaply calls for a much smaller palette while keeping the image visually faithful.

Approach

Treat each pixel as a point in 3D RGB space and run K-Means with $K = 64$ over the pixels. Each centroid becomes a palette color, and every pixel is recolored to its nearest centroid — minimizing the within-cluster squared color distance, which is exactly perceived reconstruction error in RGB.

Outcome

The palette collapses from ~16.7 million possible colors down to 64 colors — a $> 99.99\%$ reduction in distinct colors — while the image remains visually almost indistinguishable from the original. Storing a 64-color palette needs only 6 bits per pixel for the index versus 24 bits for full color, roughly a 4x reduction in raw pixel storage. This is the canonical color-quantization demo shipped in scikit-learn.

Source: scikit-learn: Color Quantization using K-Means — scikit-learn developers

Common Misconceptions

MisconceptionK-Means will automatically choose the best number of clusters

K

CorrectionK-Means requires the user to specify

K

beforehand. Techniques like the Elbow Method or Silhouette analysis are needed to guide this choice.

MisconceptionK-Means works well on all cluster shapes.

CorrectionK-Means assumes clusters are spherical and of similar size. It performs poorly on elongated, nested, or highly irregular cluster shapes.

References & Further Reading

Pattern Recognition and Machine Learningtextbook
By Bishop, C. M
View Resource →
Data Clustering: Algorithms and Applicationstextbook
By Aggarwal, C. C. and Reddy, C. K
View Resource →

← Previous Topic

Gradient Boosting

Next Topic →

Gaussian Mixtures and EM