K-Nearest Neighbours

Have you ever asked a friend “should I watch this show?” and they said “well, your three closest friends all love it, so you probably will too”? That is exactly how this algorithm works.

What is K-Nearest Neighbours?

K-Nearest Neighbours (KNN) is a way for a computer to make decisions by looking at similar examples it has already seen. When it needs to classify something new, it finds the $k$ most similar items in its memory and takes a vote.

New word: classify means sorting something into a group or category. For example, deciding if an email is spam or not spam is a classification task.

New word: k is just a number you choose. If $k = 3$, the algorithm looks at the 3 most similar examples. If $k = 5$, it looks at 5.

A simple way to think about it

Imagine you move to a new town and want to know if a neighbourhood is friendly. You do not know anyone yet.

So you walk around and find the 5 houses closest to the one you are considering. You knock on each door and ask: “Is this a friendly street?” Four say yes, one says no. You conclude: probably friendly.

That is KNN. You did not memorise a rule. You just looked at nearby examples and took a vote.

The only two things you need to decide are:

How many neighbours to ask (k): Ask too few and one unusual person can sway you. Ask too many and you include people who live far away and are not really relevant.
How to measure “nearby”: In real life that means distance in metres. In machine learning it usually means measuring how different two sets of numbers are.

How it works, step by step

The algorithm stores every training example it has seen, including its label (like “spam” or “not spam”).
A new, unlabelled example arrives.
The algorithm measures the distance from that new example to every stored example.
It picks the $k$ closest examples (the nearest neighbours).
It takes a vote among those $k$ neighbours and predicts the most common answer.

See it visually

The green diamond is the new unknown point. The dashed circle shows the 3 nearest neighbours. Two are blue and one is orange, so the algorithm votes: Class A wins. The new point gets labelled blue.

The maths (do not panic)

Here is how the algorithm measures “distance” between two points:

\[d(\mathbf{x}, \mathbf{x'}) = \sqrt{\sum_{j=1}^{p}(x_j - x'_j)^2}\]

In plain English: To find how far apart two examples are, subtract each feature value, square the differences so negatives become positive, add them all up, then take the square root. This is just the ruler distance you learned in school, extended to as many features as you have.

Show more detail

This formula is called Euclidean distance. It is the same formula you use to find the straight-line distance between two points on a map. There are other ways to measure distance too. Manhattan distance adds up the differences without squaring them first. Cosine similarity measures the angle between two examples rather than their separation. Different tasks call for different measures, but Euclidean is the most common starting point. One important warning: if one feature uses very large numbers (like salary in thousands) and another uses very small numbers (like age), the large-number feature will dominate the distance calculation unfairly. Always rescale your features before using KNN so every feature contributes equally.

Run the code yourself

This code teaches KNN to identify which species of flower an iris belongs to, based on measurements of its petals and sepals. The model never writes a rule. It simply memorises all the training flowers and checks which ones are most similar when a new flower arrives.

Step 1: Open Google Colab and create a new notebook.

Step 2: Copy this code into a cell:

# Import the tools we need
from sklearn.datasets import load_iris                  # a built-in dataset of 150 flowers
from sklearn.neighbors import KNeighborsClassifier      # the KNN model
from sklearn.model_selection import train_test_split    # splits data into train and test
from sklearn.preprocessing import StandardScaler        # rescales features to be fair
from sklearn.metrics import accuracy_score              # measures how often we are right

# Load the Iris dataset (150 flowers, 4 measurements each, 3 species)
data = load_iris()
X, y = data.data, data.target

# Split data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Rescale features so no single measurement dominates the distance calculation
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # learn the scale from training data
X_test = scaler.transform(X_test)         # apply the same scale to test data

# Create a KNN model that votes among 5 nearest neighbours
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)               # "training" just stores all the data

# Ask the model to predict species for the test flowers
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions) * 100:.1f}%")

Step 3: Press Shift + Enter to run it.

You should see:

Accuracy: 100.0%

What each line does:

load_iris(): loads a famous dataset of 150 iris flowers with 4 petal/sepal measurements each
train_test_split(...): shuffles the data and sets aside 20% to test the model later
StandardScaler(): rescales each feature so distances are fair across all measurements
KNeighborsClassifier(n_neighbors=5): creates a KNN model that uses 5 neighbours for each vote
model.fit(X_train, y_train): stores all training flowers in memory (no actual learning happens yet)
model.predict(X_test): for each test flower, finds the 5 nearest training flowers and takes a vote
accuracy_score(...): compares predictions to the true species labels and reports the percentage correct

What just happened?

The model memorised 120 training flowers. When it saw each of the 30 test flowers, it found the 5 most similar training flowers and predicted the most common species among them. It got every single one right. Try changing n_neighbors to 1 or to 20 and see how the accuracy changes. That is the key trade-off: too few neighbours and one unusual example can throw you off; too many and you start including flowers that are not really similar at all.

Quick recap

KNN makes predictions by finding the $k$ most similar training examples and taking a majority vote among them.
There is no training phase. The algorithm simply stores all examples and does its work when a new one arrives.
Always rescale your features first, otherwise features with large numbers will dominate the distance calculation.
Choosing $k$ is the main decision: small $k$ reacts to individual examples, large $k$ looks at broader patterns.
It works surprisingly well on small datasets where the pattern in the data is clear and consistent.

← Support Vector Machines Next → Naive Bayes