K-Nearest Neighbours

Have you ever asked a friend “should I watch this show?” and they said “well, your three closest friends all love it, so you probably will too”? That is exactly how this algorithm works.


What is K-Nearest Neighbours?

K-Nearest Neighbours (KNN) is a way for a computer to make decisions by looking at similar examples it has already seen. When it needs to classify something new, it finds the $k$ most similar items in its memory and takes a vote.

New word: classify means sorting something into a group or category. For example, deciding if an email is spam or not spam is a classification task.

New word: k is just a number you choose. If $k = 3$, the algorithm looks at the 3 most similar examples. If $k = 5$, it looks at 5.


A simple way to think about it

Imagine you move to a new town and want to know if a neighbourhood is friendly. You do not know anyone yet.

So you walk around and find the 5 houses closest to the one you are considering. You knock on each door and ask: “Is this a friendly street?” Four say yes, one says no. You conclude: probably friendly.

That is KNN. You did not memorise a rule. You just looked at nearby examples and took a vote.

The only two things you need to decide are:


How it works, step by step

  1. The algorithm stores every training example it has seen, including its label (like “spam” or “not spam”).
  2. A new, unlabelled example arrives.
  3. The algorithm measures the distance from that new example to every stored example.
  4. It picks the $k$ closest examples (the nearest neighbours).
  5. It takes a vote among those $k$ neighbours and predicts the most common answer.

See it visually

Feature 1 Feature 2 Class A Class B Query (k=3) 2 blue, 1 orange → predicts Class A

The green diamond is the new unknown point. The dashed circle shows the 3 nearest neighbours. Two are blue and one is orange, so the algorithm votes: Class A wins. The new point gets labelled blue.


The maths (do not panic)

Here is how the algorithm measures “distance” between two points:

\[d(\mathbf{x}, \mathbf{x'}) = \sqrt{\sum_{j=1}^{p}(x_j - x'_j)^2}\]

In plain English: To find how far apart two examples are, subtract each feature value, square the differences so negatives become positive, add them all up, then take the square root. This is just the ruler distance you learned in school, extended to as many features as you have.

Show more detail This formula is called Euclidean distance. It is the same formula you use to find the straight-line distance between two points on a map. There are other ways to measure distance too. Manhattan distance adds up the differences without squaring them first. Cosine similarity measures the angle between two examples rather than their separation. Different tasks call for different measures, but Euclidean is the most common starting point. One important warning: if one feature uses very large numbers (like salary in thousands) and another uses very small numbers (like age), the large-number feature will dominate the distance calculation unfairly. Always rescale your features before using KNN so every feature contributes equally.

Run the code yourself

This code teaches KNN to identify which species of flower an iris belongs to, based on measurements of its petals and sepals. The model never writes a rule. It simply memorises all the training flowers and checks which ones are most similar when a new flower arrives.

Step 1: Open Google Colab and create a new notebook.

Step 2: Copy this code into a cell:

# Import the tools we need
from sklearn.datasets import load_iris                  # a built-in dataset of 150 flowers
from sklearn.neighbors import KNeighborsClassifier      # the KNN model
from sklearn.model_selection import train_test_split    # splits data into train and test
from sklearn.preprocessing import StandardScaler        # rescales features to be fair
from sklearn.metrics import accuracy_score              # measures how often we are right

# Load the Iris dataset (150 flowers, 4 measurements each, 3 species)
data = load_iris()
X, y = data.data, data.target

# Split data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Rescale features so no single measurement dominates the distance calculation
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # learn the scale from training data
X_test = scaler.transform(X_test)         # apply the same scale to test data

# Create a KNN model that votes among 5 nearest neighbours
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)               # "training" just stores all the data

# Ask the model to predict species for the test flowers
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions) * 100:.1f}%")

Step 3: Press Shift + Enter to run it.

You should see:

Accuracy: 100.0%

What each line does:

What just happened?

The model memorised 120 training flowers. When it saw each of the 30 test flowers, it found the 5 most similar training flowers and predicted the most common species among them. It got every single one right. Try changing n_neighbors to 1 or to 20 and see how the accuracy changes. That is the key trade-off: too few neighbours and one unusual example can throw you off; too many and you start including flowers that are not really similar at all.


Quick recap


← Support Vector Machines Next → Naive Bayes