Support Vector Machines

Support Vector Machines

Support Vector Machines (SVMs) are margin-based classifiers. If many separating lines classify the training data correctly, an SVM chooses the one with the largest margin: the greatest distance to the closest training examples.

Those closest examples are the support vectors. They matter because moving them changes the boundary, while points far from the margin often have no effect on the fitted classifier.

Where is it used?

SVMs are useful when the dataset is medium-sized and the feature representation is strong. They are historically common in text classification, handwritten digit recognition, and biological classification tasks.

Best use

The convex training objective has a global optimum, which makes the optimization behavior easier to reason about than many neural models.

Watch out for

Kernel SVMs can be slow and memory-heavy on very large datasets because they depend on many pairwise similarities.

i

Intuition

How to think about this algorithm

The intuition is geometric. A boundary that barely separates the training data is fragile: a small measurement error can flip the prediction. A wider margin is more stable because new points can move slightly without crossing the boundary.

When a straight boundary is not enough, kernels let the SVM compute dot products in a richer feature space without explicitly constructing every transformed feature. In the original input space, that can produce curved decision boundaries.

Interactive Diagram

Hyperplane Margins & Support Vectors (SVM)

Click Mode to choose Class A or B, then click the grid workspace to place custom nodes. Adjust C to allow margin slack vs strict separation.

Hyperplane
Margin Gutter
Support Vector Target
Click plot space to place observations
Full Margin Width: 3.861
Class Placement
Penalty Cost (C)10

Lower C increases margin width (tolerating classification errors). Higher C enforces strict separation boundaries.

Key InsightOnly observations lying on the margin boundaries (support vectors) determine the separating boundary. Moving other points has zero effect.

The Logic

Mathematical core for support vector machines

1. The Margin

For separable data, the hard-margin SVM solves:

yi(wTxi+b)1iy_i (w^T x_i + b) \ge 1 \quad \forall i

2. The Kernel Trick

Real data is rarely perfectly separable, so practical SVMs use slack variables and a penalty parameter CC to trade off margin width against classification mistakes. Kernels replace inner products xiTxjx_i^T x_j with a function K(xi,xj)K(x_i, x_j):

The most popular kernel is the Radial Basis Function (RBF), which measures the distance between two points xix_i and xjx_j and creates smooth, curved boundaries:

K(xi,xj)=exp(γxixj2)K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)

Code Example

support_vector_machines.py · scikit-learn example

Python
model_fitting.py
1import numpy as np
2from sklearn.svm import SVC
3
4X = np.array([[1, 2], [2, 3], [1.5, 1.8], [8, 8], [9, 10], [8.5, 9.2]])
5y = np.array([0, 0, 0, 1, 1, 1])
6
7clf = SVC(kernel='rbf', C=1.0)
8clf.fit(X, y)
9
10print(f"Number of Computed Support Vectors: {clf.n_support_}")

Strengths

  • The convex training objective has a global optimum, which makes the optimization behavior easier to reason about than many neural models.

  • Kernels allow nonlinear decision boundaries while keeping the optimization problem in terms of pairwise similarities.

  • It is highly effective even when you have more features (columns) than actual data points (rows).

!

Limitations

  • Kernel SVMs can be slow and memory-heavy on very large datasets because they depend on many pairwise similarities.

  • They do not naturally produce calibrated probabilities without an additional calibration step.

  • Performance is sensitive to kernel choice, feature scaling, and hyperparameters such as C and gamma.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    Features are scaled so distances and dot products are meaningful.

  • 2

    The selected kernel and regularization parameter match the data geometry.

R

References

Books and papers for deeper study

  • Boser, B.E., Guyon, I.M. and Vapnik, V.N. (1992) 'A training algorithm for optimal margin classifiers', in Proceedings of the fifth annual workshop on Computational learning theory. Pittsburgh, Pennsylvania: ACM, pp. 144-152.