Linear Regression

Every time a real estate website shows you an estimated house price, it is using something very close to what you are about to learn. The bigger the house, the higher the price. Linear Regression finds that relationship in data and uses it to make predictions.

What is Linear Regression?

Linear Regression is a way for a computer to learn the relationship between two things (like house size and house price)and then use that relationship to predict new values it has never seen before.

Here is the key idea: if you plot your data on a graph (size on one axis, price on the other), you will see a pattern. Bigger houses cost more. Linear Regression draws the single best straight line through all those points. Once you have that line, you can predict the price of any house just by finding where its size lands on the line.

New word, model: In machine learning, a “model” is just the thing that makes predictions. Here, the model is the straight line it learned.

A simple way to think about it

Imagine you are tutoring 10 students. You notice that students who study more hours tend to score higher on tests. You write down each student’s study hours and their test score.

Now a new student asks: “If I study for 6 hours, what score might I get?”

You look at your notes, spot the pattern, and estimate. That is exactly what Linear Regression does, but with maths, so it is precise and consistent every time.

The line it draws has two numbers:

Slope: how much the prediction goes up for each extra unit of input (for every extra hour of study, score goes up by X points)
Intercept: the starting point (a student who studied 0 hours would still get Y points, on average)

Once those two numbers are locked in, you can answer any “what if” question instantly.

How it works, step by step

You collect data (pairs of inputs and outputs (size and price, study hours and scores, age and salary))
The algorithm draws a line through your data and measures how far off it is from every point
It adjusts the line to reduce those errors
It keeps adjusting, thousands of times, until the line fits as well as possible
Training is done. The line is fixed and ready to use for new predictions

See it visually

The blue dots are your actual data. The red line is what Linear Regression learned. You can use that line to predict the price of any house size, even sizes that were not in your original data.

The maths (do not panic)

Here is the formula the line uses to make a prediction:

\[\hat{y} = w \cdot x + b\]

In plain English: The predicted value ($\hat{y}$) equals your input ($x$) multiplied by a weight ($w$, which is the slope), plus a bias ($b$, which is the intercept). The algorithm’s job is to find the best values of $w$ and $b$ from your data.

Show more detail

To find the best line, the algorithm minimises something called **Mean Squared Error (MSE)**. MSE measures the average of the squared gaps between the line's predictions and the real values: $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$ Think of it as: take the gap between each prediction and the real answer, square it (so positive and negative gaps both count), then average them all. A perfect line would have an MSE of zero. The algorithm adjusts $w$ and $b$ to make MSE as small as possible.

Run the code yourself

You are going to train a model that predicts house prices based on size. The model will study seven examples and then predict the price of an 1600 sq ft house it has never seen.

Step 1: Open Google Colab and create a new notebook. (Or use Jupyter if you followed the Get Started guide.)

Step 2: Copy this code into a cell:

# Import the tools we need
import numpy as np                                 # numpy helps us work with lists of numbers
from sklearn.linear_model import LinearRegression  # this is the model we will train

# Our training data
# Each house has a size (in sq ft) and a price (in $thousands)
house_sizes  = np.array([500, 750, 1000, 1250, 1500, 1750, 2000]).reshape(-1, 1)
house_prices = np.array([150, 200,  250,  310,  350,  400,  450])

# Create a blank Linear Regression model (it knows nothing yet)
model = LinearRegression()

# Train the model: show it all 7 houses so it can learn the pattern
model.fit(house_sizes, house_prices)

# See what it learned
print(f"Slope  (how much price rises per sq ft): {model.coef_[0]:.4f}")
print(f"Intercept (base price when size is zero): {model.intercept_:.2f}")

# Now ask the model to predict a house it has never seen: 1600 sq ft
predicted_price = model.predict([[1600]])[0]
print(f"Predicted price for 1600 sq ft: ${predicted_price:.1f}k")

Step 3: Press Shift + Enter to run it.

You should see:

Slope  (how much price rises per sq ft): 0.1714
Intercept (base price when size is zero): 62.86
Predicted price for 1600 sq ft: $337.1k

What each line does:

import numpy as np: loads a tool called numpy that makes working with lists of numbers fast and easy
import LinearRegression: loads the pre-built Linear Regression model from a library called scikit-learn
house_sizes = ...: creates our list of house sizes. The .reshape(-1, 1) just puts them into the shape scikit-learn expects
model = LinearRegression(): creates a blank model that knows nothing yet
model.fit(...): this is the training step. The model studies all 7 examples and finds the best slope and intercept
model.coef_[0]: the slope the model learned (about 0.17 means every extra sq ft adds $170)
model.intercept_: the intercept it learned (around $62,860 at size zero)
model.predict([[1600]]): use the learned line to predict the price of a 1600 sq ft house

What just happened?

The model studied 7 houses and spotted the pattern: roughly $171 more for every extra square foot. You never told it that rule. It worked it out on its own. Now it can estimate the price of any house size, including ones it has never seen. That is exactly what machine learning means: the computer learns the rule from the data instead of you writing it by hand.

Quick recap

Linear Regression draws the best straight line through your data to find a relationship between input and output
The line has two numbers: a slope (how fast the output rises) and an intercept (the starting point)
Once trained, you can give it any new input and it will predict an output
It works best when the relationship is roughly a straight line (bigger input = proportionally bigger output)
It is usually the first thing you try in any prediction problem because it is fast, simple, and easy to understand

← ML Foundations Next → Logistic Regression