Support Vector Machines

Difficulty:Advanced

Reading Time:20 min

Track:

Practitioner

A margin-based classifier that chooses the separating boundary with the largest distance to the nearest training points.

Prerequisites

Logistic Regression

ML PractitionerModule 8 of 17

Support Vector Machines

47%

TL;DR

An SVM is a margin-based classifier: among all boundaries that separate the classes, it picks the one whose distance to the nearest points is largest.
The decision boundary is determined only by the support vectors — the points on or inside the margin. Moving other points does not change the fit.
Maximizing the margin is equivalent to minimizing $\frac{1}{2}\lVert w \rVert^2$ subject to $y_i(w^T x_i + b) \ge 1$ — a convex quadratic program with a global optimum.
The kernel trick replaces dot products $x_i^T x_j$ with a kernel $K(x_i, x_j)$ , producing nonlinear boundaries without ever forming the high-dimensional features.
The penalty $C$ controls the soft margin: large $C$ punishes misclassifications hard (narrow margin, risk of overfit); small $C$ tolerates errors for a wider, smoother margin.

Learning Objectives

Explain the concept of margins and why SVM maximizes the margin
Distinguish between hard-margin and soft-margin SVM formulations
Explain the role of support vectors in defining the decision boundary
Describe the Kernel Trick and compute the Radial Basis Function (RBF) kernel

Intuition

How to think conceptually about this topic

The intuition is geometric. A boundary that barely separates the training data is fragile: a small measurement error can flip the prediction. A wider margin is more stable because new points can move slightly without crossing the boundary.

When a straight boundary is not enough, kernels let the SVM compute dot products in a richer feature space without explicitly constructing every transformed feature. In the original input space, that can produce curved decision boundaries.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Support Vector Machines (SVMs) are margin-based classifiers. If many separating lines classify the training data correctly, an SVM chooses the one with the largest margin: the greatest distance to the closest training examples.

Those closest examples are the support vectors. They matter because moving them changes the boundary, while points far from the margin often have no effect on the fitted classifier.

Where is it used?

SVMs are useful when the dataset is medium-sized and the feature representation is strong. They are historically common in text classification, handwritten digit recognition, and biological classification tasks.

How It Compares

SVM vs Logistic Regression vs K-Nearest Neighbors

Dimension	SVM	Logistic Regression	K-Nearest Neighbors
Decision boundary	Maximum-margin hyperplane (curved via kernels)	Single linear hyperplane	Implicit, locally defined by neighbors
Margin / regularization	Explicit margin maximization; $C$ controls slack	Penalized log-loss; L1/L2 regularization	No margin; smoothing controlled by $k$
Kernels / nonlinearity	Native via the kernel trick (RBF, polynomial)	Needs manual feature engineering	Naturally nonlinear from local geometry
Scalability to large $N$	Poor — kernel cost grows with pairwise similarities	Excellent — trains fast, scales well	Slow at predict time; stores all data
Probability outputs	Not native; needs Platt scaling/calibration	Native, well-calibrated probabilities	Estimated from neighbor vote fractions

TakeawayReach for SVM on medium-sized, high-dimensional data with a clear margin; logistic regression when you need fast training and calibrated probabilities; KNN when the boundary is irregular and the dataset is small.

When to Use It

Reach for this when

You have medium-sized data with a fairly clear margin between classes — SVMs shine where the separation is geometric.
The problem is high-dimensional, such as text classification, where features often outnumber samples and a kernel SVM stays effective.
You need a nonlinear boundary but want a convex objective with a global optimum — use an RBF or polynomial kernel.

Avoid it when

The dataset is very large ( $N$ in the hundreds of thousands or more) — kernel training scales poorly; prefer logistic regression or linear models.
You need well-calibrated probabilities out of the box — SVMs only give scores and require extra calibration.
Features are on wildly different scales and you cannot standardize them — distance-based kernels become dominated by large-magnitude features.

Rules of thumb

Always scale or standardize features before training; SVMs are not scale-invariant.
Start with a linear kernel for high-dimensional sparse data (e.g. text); try RBF when you need nonlinearity.
Tune $C$ and $\gamma$ jointly with cross-validated grid search; they interact strongly.

Implementation

Reference code implementation

Python

model_fitting.py

1import numpy as np
2from sklearn.svm import SVC
3
4X = np.array([[1, 2], [2, 3], [1.5, 1.8], [8, 8], [9, 10], [8.5, 9.2]])
5y = np.array([0, 0, 0, 1, 1, 1])
6
7clf = SVC(kernel='rbf', C=1.0)
8clf.fit(X, y)
9
10print(f"Number of Computed Support Vectors: {clf.n_support_}")

Strengths & Advantages

The convex training objective has a global optimum, which makes the optimization behavior easier to reason about than many neural models.
Kernels allow nonlinear decision boundaries while keeping the optimization problem in terms of pairwise similarities.
It is highly effective even when you have more features (columns) than actual data points (rows).

Limitations & Drawbacks

Kernel SVMs can be slow and memory-heavy on very large datasets because they depend on many pairwise similarities.
They do not naturally produce calibrated probabilities without an additional calibration step.
Performance is sensitive to kernel choice, feature scaling, and hyperparameters such as C and gamma.

Real-World Case Studies

Handwritten digit recognition on USPS / MNIST

Computer vision

Scenario

Recognizing handwritten digits (0-9) from scanned images was a benchmark problem where each image is a high-dimensional pixel vector and classes overlap in subtle, nonlinear ways. SVMs with nonlinear kernels became a leading approach in the 1990s, competitive with the best neural networks of the era.

Approach

Train a kernel SVM (polynomial or RBF) on the pixel vectors using a one-vs-rest or one-vs-one scheme for the ten digit classes, letting the kernel induce a nonlinear boundary without explicit feature construction.

Outcome

On the USPS digit benchmark in Cortes & Vapnik (1995), the soft-margin support-vector network achieved a test error of about 4.0% (roughly 96% accuracy), matching or beating contemporary neural-network classifiers and establishing SVMs as state-of-the-art for the task.

Source: Support-Vector Networks — Cortes, C. and Vapnik, V.

Common Misconceptions

MisconceptionSVMs only work for binary classification.

CorrectionSVMs can be extended to multi-class classification using One-vs-One (OvO) or One-vs-Rest (OvR) strategies.

MisconceptionAdding more features will always make SVM slower to evaluate.

CorrectionThanks to the Kernel Trick, the evaluation time depends on the number of support vectors rather than the dimensionality of the high-dimensional space.

References & Further Reading

A Training Algorithm for Optimal Margin Classifierstextbook
By Boser, B.E., Guyon, I.M. and Vapnik, V.N
View Resource →
Understanding Machine Learning: From Theory to Algorithmstextbook
By Shalev-Shwartz, S. and Ben-David, S
View Resource →

Support Vector Machines

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

Where is it used?

How It Compares

SVM vs Logistic Regression vs K-Nearest Neighbors

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

Handwritten digit recognition on USPS / MNIST

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Linear Regression

Logistic Regression