Support Vector Machines

Imagine drawing a line on a piece of paper to separate blue dots from red dots. You could draw many different lines that all technically separate them. But which line is best? The answer is the one that sits as far as possible from both groups, so even if a new dot lands slightly off-centre, it still gets sorted correctly. That is exactly the idea behind a Support Vector Machine.

What is a Support Vector Machine?

A Support Vector Machine (often called an SVM) is a model that finds the boundary between two categories by placing it as far away as possible from both groups. The gap between the boundary and the closest points on either side is called the margin (which means the safety zone). A wider margin means the model is less likely to make a mistake when it sees a new example.

The data points closest to the boundary, which are the ones that actually define where the boundary goes, are called support vectors (which means the examples “supporting” or holding up the boundary line). Once those points are found, everything else in the training data is irrelevant.

A simple way to think about it

Imagine you are drawing a safety corridor down the middle of a road to separate oncoming traffic. You want the corridor to be as wide as possible, so cars on both sides have the most room for error.

The SVM does the same thing with data. It searches for the widest possible corridor that separates two categories. The edges of the corridor are defined by the closest examples from each category, the support vectors. The actual decision boundary runs down the middle of that corridor.

When the data cannot be separated by a straight line (for example, one category is surrounded by the other), SVM uses a mathematical trick to solve the problem. It mentally re-imagines the data in a higher number of dimensions, where a flat boundary does separate the categories. This trick is called a kernel (which means a function that measures similarity between examples in a transformed space, without actually doing the full transformation). The most common kernel is called the RBF kernel (Radial Basis Function), and it works well on a wide range of problems.

How it works, step by step

Find all the data points that are closest to the boundary between the two categories
Draw the widest possible corridor between the two categories, using those closest points as guides
Place the decision boundary down the middle of that corridor
For non-linear problems, use a kernel to transform the data so a flat boundary can separate it
For a new example, check which side of the boundary it falls on and return that category

See it visually

The solid line is the decision boundary. Dashed lines are the margin edges. Ringed points are the support vectors, the only training examples that define the boundary.

The solid black line is the decision boundary. The dashed lines on either side mark the edges of the margin (the safety corridor). The ringed points are the support vectors: the only examples that matter once training is done. All the other training points are irrelevant to the final boundary.

The maths (do not panic)

Here is the formula that makes this work. We will break down every part.

\[\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to } y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1\]

In plain English: The model finds the weights ($\mathbf{w}$) that define the boundary line. Making $|\mathbf{w}|^2$ small is mathematically equivalent to making the margin wide. The second part of the formula says that every example must end up on the correct side of the margin, not just on the correct side of the boundary.

Show more detail

The distance from the boundary to the nearest point on either side turns out to be $1/\|\mathbf{w}\|$. So the total margin width is $2/\|\mathbf{w}\|$. Making the margin as wide as possible is therefore the same as making $\|\mathbf{w}\|^2$ as small as possible. Minimising $\|\mathbf{w}\|^2$ is a standard shape of maths problem (called a convex quadratic programme) that can be solved reliably. The solution is found using a technique called Lagrange multipliers. Each training example gets an associated number $\alpha_i \geq 0$. After solving, most of these numbers turn out to be zero. Only the support vectors (the examples on the margin edges) end up with $\alpha_i > 0$. That is why only those points define the final boundary. Real-world data is rarely perfectly separable. The soft-margin version of SVM allows some examples to cross the margin, controlled by the setting $C$: $$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i}\xi_i \quad \text{subject to } y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i,\; \xi_i \geq 0$$ The $\xi_i$ values (called slack variables) measure how much each example is allowed to violate the margin. The setting $C$ controls the trade-off: a large $C$ punishes violations heavily and creates a narrow margin. A small $C$ tolerates more violations in exchange for a wider, more robust margin.

Run the code yourself

This code trains an SVM on a breast cancer dataset. Notice that we scale (which means adjust) the features before training. This step is essential with SVMs and must not be skipped.

Step 1: Open Google Colab and create a new notebook. (Or use Jupyter if you followed the Get Started guide.)

Step 2: Copy this code into a cell:

from sklearn.datasets import load_breast_cancer       # medical dataset: 569 tumours, 2 classes
from sklearn.svm import SVC                           # the Support Vector Classifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler      # rescales each feature to a similar range
from sklearn.metrics import accuracy_score

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale the features: SVMs are very sensitive to features that are on very different scales
# For example, one feature might range from 0 to 1 and another from 0 to 1000
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # learn the scaling from training data and apply it
X_test  = scaler.transform(X_test)        # apply the same scaling to test data (never learn from test data)

# Train the SVM using the RBF kernel (good for non-linear problems)
model = SVC(kernel='rbf', C=1.0, random_state=42)
model.fit(X_train, y_train)

# Check accuracy on examples the model never saw during training
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions) * 100:.1f}%")
print(f"Number of support vectors per class: {model.n_support_}")

Step 3: Press Shift + Enter to run it.

You should see:

Accuracy: 97.4%
Number of support vectors per class: [37 60]

What each line does:

StandardScaler(): creates a tool that will rescale each feature to have a mean of 0 and a spread of 1
scaler.fit_transform(X_train): learns the right scaling from the training data and applies it
scaler.transform(X_test): applies the same scaling to test data (you must never learn the scaling from test data, only from training data)
SVC(kernel='rbf', C=1.0): creates an SVM with the RBF kernel and a moderate setting for how strictly to enforce the margin
model.n_support_: shows how many support vectors the final model uses from each category

What just happened?

The model found the widest possible boundary between malignant and benign tumours in a high-dimensional space. It only needed 97 support vectors from the entire training set of 455 examples. Once those 97 critical points were found, all the other training examples became irrelevant. That is what makes SVMs elegant: a small number of well-chosen examples define the entire decision boundary.

Quick recap

An SVM finds the decision boundary that creates the widest possible gap between two categories
The support vectors are the only examples that matter once training is done. Everything else is ignored.
Kernels (like the RBF kernel) allow SVMs to handle problems where a straight-line boundary is not enough
Always rescale your features before training an SVM. Without scaling, the model will perform poorly.
SVMs work especially well on problems with many features and relatively few examples

← XGBoost Next → K-Nearest Neighbours