K-Means Clustering
Spotify groups its 600 million users into listener types so it can recommend the right playlists. Nobody labelled each user manually. The algorithm found the groups on its own. That is clustering, and K-Means is the most popular way to do it.
What is K-Means Clustering?
K-Means Clustering is a way to automatically group data into $k$ clusters, where similar items end up in the same group. You do not tell it what the groups should be. You just say how many groups you want, and the algorithm figures out the rest.
New word: unsupervised means the algorithm works without labels. It does not know what the “right answer” is. It just looks for natural groups in the data.
New word: centroid is the centre point of a cluster. Think of it as the average member of the group.
A simple way to think about it
Imagine dropping 100 grains of rice onto a table at random. You want to sort them into 3 piles.
Here is what K-Means does. It places 3 imaginary pins on the table at random positions. Each grain of rice joins the pile belonging to the nearest pin. Then each pin moves to the middle of its pile. Then every grain re-checks which pin is now closest and rejoins accordingly. The pins keep moving until they stop.
The result is 3 tidy piles where each grain is grouped with the grains most similar to it. Nobody told the algorithm which grains belong together. It worked it out by repeatedly asking: “which centre is closest?”
The only question you have to answer beforehand is: how many piles do you want? That number is $k$.
How it works, step by step
- Choose how many clusters you want ($k$) and place $k$ starting centre points at random positions.
- Assign every data point to the nearest centre.
- Move each centre to the average position of all the points assigned to it.
- Repeat steps 2 and 3 until no point changes its cluster assignment.
- The algorithm has converged. The clusters are ready.
See it visually
K=3: each colour is a cluster, stars are centroids
Each colour is a cluster. The star in the middle of each cluster is the centroid, the average position of all points in that group. Every dot has been assigned to the nearest star.
The maths (do not panic)
K-Means tries to minimise this:
\[\arg\min_S \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_i} \|\mathbf{x} - \boldsymbol{\mu}_i\|^2\]In plain English: For each cluster $S_i$, measure the distance from every point in it to the cluster’s centre $\boldsymbol{\mu}_i$. Square those distances and add them all up. K-Means tries to find the grouping that makes this total as small as possible. Smaller means every point is close to its cluster’s centre, which means tighter, cleaner clusters.
Show more detail
Finding the perfect grouping is extremely hard to do in one go. K-Means takes a shortcut by alternating between two steps, each of which improves the answer. In the assignment step, every point is sent to its nearest centre. This is the best possible assignment given the current centres. In the update step, each centre moves to the average position of its assigned points. This is the best possible position for a centre given the current assignment. Because each step makes the total squared distance smaller (or keeps it the same), and because the total cannot go below zero, the loop is guaranteed to stop eventually. One practical tip: always run K-Means several times with different random starting positions and keep the best result. This reduces the chance of landing in a poor solution caused by unlucky starting positions.Run the code yourself
This code groups iris flowers into 3 clusters without looking at the species labels. Then it checks how well those discovered groups match the true species.
Step 1: Open Google Colab and create a new notebook.
Step 2: Copy this code into a cell:
# Import the tools we need
from sklearn.cluster import KMeans # the clustering model
from sklearn.datasets import load_iris # 150 flowers, 4 measurements each
from sklearn.preprocessing import StandardScaler # rescale features before clustering
from sklearn.metrics import adjusted_rand_score # measures how good the clusters are
# Load iris data - we will NOT use the species labels during clustering
iris = load_iris()
X = iris.data
y_true = iris.target # we will only use this at the end to check our results
# Rescale features: K-Means uses distances, so fair scaling matters
X_scaled = StandardScaler().fit_transform(X)
# Create and run K-Means: find 3 clusters, try 10 different starting positions
km = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
km.fit(X_scaled) # runs the assign-then-move loop until clusters stop changing
# Report how tight the clusters are and how well they match true species
print(f"Inertia: {km.inertia_:.2f}") # lower = tighter clusters
print(f"Adjusted Rand Score: {adjusted_rand_score(y_true, km.labels_):.3f}") # 1.0 = perfect match
Step 3: Press Shift + Enter to run it.
You should see:
Inertia: 139.82
Adjusted Rand Score: 0.730
What each line does:
load_iris(): loads 150 flower measurements (the labels are set aside and not used during clustering)StandardScaler().fit_transform(X): rescales all measurements so none dominate the distance calculationKMeans(n_clusters=3, init='k-means++'): creates a K-Means model, with smarter-than-random starting positionsn_init=10: runs the whole process 10 times with different starts, then keeps the best resultkm.fit(X_scaled): runs the assign-and-move loop until the clusters stop changingkm.inertia_: the total squared distance from each point to its cluster centre (lower is better)adjusted_rand_score(...): compares the discovered clusters to the true species labels. A score of 1.0 is a perfect match and 0 is no better than random.
What just happened?
The algorithm never saw the flower species labels. It just looked at 4 measurements and grouped similar flowers together. The Adjusted Rand Score of 0.73 means its groupings match the true species much better than chance. It is not perfect because two of the three iris species look very similar in measurement space, but it found real structure without any guidance.
Quick recap
- K-Means groups data into $k$ clusters by repeatedly assigning each point to the nearest centre and then moving the centre to the average of its group.
- You must choose $k$ before you start. A common approach is to try several values and pick the one where adding more clusters stops helping much.
- Always rescale your features first. K-Means uses distances, so features with large numbers will dominate otherwise.
- It works best when the natural groups in the data are roughly round and similar in size.
- It is usually the first clustering method to try on any new dataset because it is fast and easy to understand.