Clustering Methods

Clustering (K-Means, EM, GMM)

Clustering algorithms are the detectives of machine learning. You hand them a massive pile of completely unlabelled data, and they automatically organize it into distinct, meaningful groups based on hidden patterns.

K-Means is the most famous clustering algorithm. It works by dropping a set number of "center points" (centroids) into the data and shifting them around until they sit perfectly in the middle of distinct, spherical clusters.

Gaussian Mixture Models (GMMs) are the smarter, more flexible upgrade to K-Means. Instead of assuming every cluster is a perfect circle, GMMs assume the data is made up of several overlapping bell curves (Gaussian distributions). This allows them to find clusters that are stretched out, dense in the middle, or overlapping. To figure out where these bell curves are, GMMs use a clever mathematical trick called the Expectation-Maximization (EM) algorithm.

Where is it used?

Clustering is used everywhere you need to find hidden structure. It's used in marketing to automatically segment customers into different buying personas, in finance to detect unusual spending patterns (anomalies), in biology to group genes with similar behaviors, and in computer vision to separate the foreground of an image from the background.

Best use

It works completely unsupervised. You don't need to spend hundreds of hours manually labeling data for the AI to learn.

Watch out for

You usually have to tell the algorithm exactly how many clusters (KK) to look for before it starts, which involves a lot of guesswork.

i

Intuition

How to think about this algorithm

K-Means: Imagine you have a map of a city with pins for every coffee shop, and you want to open 3 new distribution centers. K-Means draws hard borders on the map. Every coffee shop is assigned to exactly one distribution center—whichever one is closest in a straight line.

GMMs: Now imagine the distribution centers share delivery zones. A coffee shop right on the border isn't forced to choose just one. A GMM uses "soft boundaries." It might say, "This coffee shop is 70% likely to be served by Center A, and 30% likely to be served by Center B." It embraces the uncertainty of the real world.

Interactive Diagram

Lloyd's Clustering Optimization

Click Iterate step-by-step. Centroids move to mean coordinates of assigned clusters, updating Voronoi borders. Click canvas to add custom points.

Centroid 1
Centroid 2
Centroid 3
[Click plot space to add data points]
Key InsightStep 1: Assign (points snap to closest centroid). Step 2: Update (centroids slide to mean position). Convergence is reached when updates are zero.

The Logic

Mathematical core for clustering (k-means, em, gmm)

1. K-Means Objective Function

K-Means tries to minimize the "Within-Cluster Sum of Squares" (WCSS). In plain English: it wants the distance between every data point xix_i and its assigned center point μj\mu_j to be as small as possible:

J=j=1KiCjxiμj2J = \sum_{j=1}^{K} \sum_{i \in C_j} \|x_i - \mu_j\|^2

2. Gaussian Mixture Models (GMM)

A GMM assumes that your data was generated by KK different Gaussian (bell curve) distributions mixed together. The total probability of seeing a specific data point xx is the sum of the probabilities from each of those individual bell curves:

P(x)=k=1KπkN(xμk,Σk)P(x) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x | \mu_k, \Sigma_k)

3. Expectation-Maximization (EM)

Because GMMs are complex, you can't solve them with a single equation. Instead, the EM algorithm takes turns guessing and checking:

E-Step (Expectation): Guess which cluster each data point belongs to. It calculates the "responsibility" γ(znk)\gamma(z_{nk})—the probability that cluster kk is responsible for creating data point xnx_n:

γ(znk)=πkN(xnμk,Σk)j=1KπjN(xnμj,Σj)\gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}

M-Step (Maximization): Now that we have a good guess of who belongs to what, we update the center (mean) and shape (variance) of the clusters to better fit the data:

Nk=n=1Nγ(znk)N_k = \sum_{n=1}^{N} \gamma(z_{nk})

μknew=1Nkn=1Nγ(znk)xn\mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma(z_{nk}) x_n

Code Example

clustering_(k-means,_em,_gmm).py · scikit-learn example

Python
model_fitting.py
1import numpy as np
2from sklearn.mixture import GaussianMixture
3from sklearn.cluster import KMeans
4
5# 2 Distinct geometric spatial groups
6X = np.array([
7    [1.5, 2.0], [1.1, 1.8], [2.1, 2.2], 
8    [8.0, 8.5], [8.2, 8.1], [8.8, 9.0]
9])
10
11kmeans = KMeans(n_clusters=2, random_state=42)
12kmeans_labels = kmeans.fit_predict(X)
13
14gmm = GaussianMixture(n_components=2, covariance_type='diag', random_state=42)
15gmm.fit(X)
16
17print(f"Computed GMM Probabilities for structural coordinate [2.0, 2.0]: {gmm.predict_proba([[2.0, 2.0]])[0]}")

Strengths

  • It works completely unsupervised. You don't need to spend hundreds of hours manually labeling data for the AI to learn.

  • GMMs provide 'soft' assignments, giving you a percentage probability of belonging to a cluster rather than a rigid Yes/No.

  • It can reveal hidden patterns in your data that human analysts might completely miss.

!

Limitations

  • You usually have to tell the algorithm exactly how many clusters (KK) to look for before it starts, which involves a lot of guesswork.

  • K-Means is terrible at finding clusters that aren't perfectly spherical (like a ring of data surrounding another cluster).

  • The final result heavily depends on where the algorithm randomly places its starting points. A bad start leads to a bad result.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    K-Means works best when clusters are compact and roughly spherical.

  • 2

    GMMs assume clusters can be approximated by a mixture of Gaussian distributions.

R

References

Books and papers for deeper study

  • Bishop, C. M. (2006) Pattern Recognition and Machine Learning. New York: Springer.