Picture a student preparing for an exam. You notice that students who study more hours tend to score higher. If you plot hours on the x-axis and scores on the y-axis, the points form a rough upward-sloping cloud. Draw the best possible straight line through that cloud, and you have linear regression.

The line lets you make predictions: "A student who studies 7 hours will probably score around 78." You're not memorizing all 30 students' exact results — you're learning the underlying relationship. That ability to generalize from examples to predictions is the heart of machine learning.

Linear regression is the simplest supervised learning algorithm and the foundation upon which nearly every more complex method is built. Understanding it deeply unlocks intuition for neural networks, regularization, and beyond.

Why Linear Regression Works

The core idea is to find a line — or a hyperplane in multiple dimensions — that best explains the relationship between input features and a continuous output. "Best" means the line that minimizes prediction error across all training examples.

The equation for simple linear regression is:

y = mx + b

Where y is the predicted output, x is the input feature, m is the slope (how much y changes per unit of x), and b is the intercept (the y-value when x is zero).

The model learns m and b by minimizing Mean Squared Error (MSE):

MSE = (1/n) * Σ (y_actual - y_predicted)²

Squaring the errors penalizes large mistakes more than small ones, and makes the math differentiable so gradient descent can be applied.

Gradient Descent Intuition

Imagine you are blindfolded on a hilly landscape, and your goal is to reach the lowest valley. You can only feel the slope beneath your feet. At each step, you move in the downhill direction. The size of your step is the learning rate. Too large a step and you overshoot the valley. Too small and it takes forever. Gradient descent is exactly this process — iteratively adjusting the model parameters in the direction that reduces error.

Scikit-learn uses a closed-form solution (the Normal Equation) that finds the optimal parameters directly, so you don't need to tune learning rate for standard linear regression. Gradient descent matters when the dataset is too large for the closed-form solution.

From Scratch vs Scikit-Learn

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Dataset: hours studied vs exam score
np.random.seed(42)
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                  1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 3])
scores = hours * 7.5 + 30 + np.random.normal(0, 5, len(hours))

X = hours.reshape(-1, 1)
y = scores

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Scikit-learn approach ---
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"Slope (m):     {model.coef_[0]:.4f}")
print(f"Intercept (b): {model.intercept_:.4f}")
print(f"MSE:           {mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE:          {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"R² Score:      {r2_score(y_test, y_pred):.4f}")

Output:

Slope (m):     7.6231
Intercept (b): 29.4817
MSE:           22.3156
RMSE:           4.7240
R² Score:       0.9312

# --- From-scratch NumPy approach ---
def linear_regression_numpy(X, y):
    """Closed-form solution: w = (X^T X)^(-1) X^T y"""
    X_b = np.c_[np.ones((len(X), 1)), X]  # Add bias column
    theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
    return theta[0], theta[1]  # intercept, slope

intercept, slope = linear_regression_numpy(X_train, y_train)
print(f"\nNumPy from-scratch:")
print(f"Slope:     {slope:.4f}")
print(f"Intercept: {intercept:.4f}")

# Predict and score
X_test_b = np.c_[np.ones((len(X_test), 1)), X_test]
y_pred_np = X_test_b @ np.array([intercept, slope])
print(f"R² Score:  {r2_score(y_test, y_pred_np):.4f}")

Output:

NumPy from-scratch:
Slope:     7.6231
Intercept: 29.4817
R² Score:  0.9312

Both approaches produce identical results. Scikit-learn is preferred in practice; the NumPy version builds intuition.

Multiple Linear Regression

When you have more than one feature — say, hours studied AND hours of sleep — the equation extends to:

y = b0 + b1*x1 + b2*x2 + ... + bn*xn

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
X_house = housing.data[['MedInc', 'AveRooms', 'AveOccup', 'HouseAge']]
y_house = housing.target

X_tr, X_te, y_tr, y_te = train_test_split(X_house, y_house, test_size=0.2, random_state=42)

mlr = LinearRegression()
mlr.fit(X_tr, y_tr)

coeff_df = pd.DataFrame({'Feature': X_house.columns, 'Coefficient': mlr.coef_})
print(coeff_df.to_string(index=False))
print(f"\nR² on test set: {mlr.score(X_te, y_te):.4f}")

Output:

   Feature  Coefficient
    MedInc      0.4517
  AveRooms     -0.0523
  AveOccup     -0.0031
  HouseAge      0.0098

R² on test set: 0.5143

Each coefficient tells you: holding all other features constant, how much does the target change per one-unit increase in this feature?

Interpreting R²

R² (R-squared) is the proportion of variance in the target that the model explains.

R² = 1.0: Perfect prediction — the model explains all variation
R² = 0.95: The model explains 95% of the variance; 5% is unexplained
R² = 0.0: The model does no better than simply predicting the mean every time
R² < 0.0: The model is worse than predicting the mean (this can happen on test data with a poorly fit model)

An R² of 0.93 for predicting exam scores from study hours is excellent. An R² of 0.51 for predicting California house prices from just four features is reasonable given that many important factors (location, neighborhood, school district) are omitted.

Assumptions of Linear Regression

Linear regression works well when these conditions hold:

Assumption	What It Means	How to Check
Linearity	Relationship between X and y is linear	Scatter plot, residual plot
Independence	Observations are independent of each other	Domain knowledge
Homoscedasticity	Residuals have constant variance	Residuals vs fitted values plot
Normality of residuals	Residuals are normally distributed	Q-Q plot
No multicollinearity	Features are not highly correlated with each other	VIF (Variance Inflation Factor)

When these assumptions are violated, linear regression still often produces useful approximate results — but you should consider transformations or a different model.

When to Use Linear Regression

Linear regression is the right first choice when:

Your target variable is continuous (price, temperature, score)
You expect a roughly linear relationship between features and target
Interpretability matters — you need to explain why the model predicts what it does
You have limited data — linear models generalize better than complex ones on small datasets
You need a fast baseline to compare other models against

Key Takeaways

Linear regression finds the line (or hyperplane) that minimizes MSE between predictions and actual values
The slope tells you the change in output per unit change in input; the intercept is the baseline value
The NumPy closed-form solution and scikit-learn produce identical results
R² measures the proportion of variance explained — 0.93 means 93% of variation is captured by the model
Multiple linear regression extends naturally to any number of features
Always check the five core assumptions before trusting the model's predictions

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

35 minLesson 5 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values

Linear Regression

The Student Analogy

Why Linear Regression Works

The equation for simple linear regression is:

y = mx + b

Where y is the predicted output, x is the input feature, m is the slope (how much y changes per unit of x), and b is the intercept (the y-value when x is zero).

The model learns m and b by minimizing Mean Squared Error (MSE):

MSE = (1/n) * Σ (y_actual - y_predicted)²

Squaring the errors penalizes large mistakes more than small ones, and makes the math differentiable so gradient descent can be applied.

Gradient Descent Intuition

From Scratch vs Scikit-Learn

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Dataset: hours studied vs exam score
np.random.seed(42)
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                  1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 3])
scores = hours * 7.5 + 30 + np.random.normal(0, 5, len(hours))

X = hours.reshape(-1, 1)
y = scores

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Scikit-learn approach ---
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"Slope (m):     {model.coef_[0]:.4f}")
print(f"Intercept (b): {model.intercept_:.4f}")
print(f"MSE:           {mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE:          {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"R² Score:      {r2_score(y_test, y_pred):.4f}")

Output:

Slope (m):     7.6231
Intercept (b): 29.4817
MSE:           22.3156
RMSE:           4.7240
R² Score:       0.9312

# --- From-scratch NumPy approach ---
def linear_regression_numpy(X, y):
    """Closed-form solution: w = (X^T X)^(-1) X^T y"""
    X_b = np.c_[np.ones((len(X), 1)), X]  # Add bias column
    theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
    return theta[0], theta[1]  # intercept, slope

intercept, slope = linear_regression_numpy(X_train, y_train)
print(f"\nNumPy from-scratch:")
print(f"Slope:     {slope:.4f}")
print(f"Intercept: {intercept:.4f}")

# Predict and score
X_test_b = np.c_[np.ones((len(X_test), 1)), X_test]
y_pred_np = X_test_b @ np.array([intercept, slope])
print(f"R² Score:  {r2_score(y_test, y_pred_np):.4f}")

Output:

NumPy from-scratch:
Slope:     7.6231
Intercept: 29.4817
R² Score:  0.9312

Both approaches produce identical results. Scikit-learn is preferred in practice; the NumPy version builds intuition.

Multiple Linear Regression

When you have more than one feature — say, hours studied AND hours of sleep — the equation extends to:

y = b0 + b1*x1 + b2*x2 + ... + bn*xn

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
X_house = housing.data[['MedInc', 'AveRooms', 'AveOccup', 'HouseAge']]
y_house = housing.target

X_tr, X_te, y_tr, y_te = train_test_split(X_house, y_house, test_size=0.2, random_state=42)

mlr = LinearRegression()
mlr.fit(X_tr, y_tr)

coeff_df = pd.DataFrame({'Feature': X_house.columns, 'Coefficient': mlr.coef_})
print(coeff_df.to_string(index=False))
print(f"\nR² on test set: {mlr.score(X_te, y_te):.4f}")

Output:

   Feature  Coefficient
    MedInc      0.4517
  AveRooms     -0.0523
  AveOccup     -0.0031
  HouseAge      0.0098

R² on test set: 0.5143

Each coefficient tells you: holding all other features constant, how much does the target change per one-unit increase in this feature?

Interpreting R²

R² (R-squared) is the proportion of variance in the target that the model explains.

R² = 1.0: Perfect prediction — the model explains all variation
R² = 0.95: The model explains 95% of the variance; 5% is unexplained
R² = 0.0: The model does no better than simply predicting the mean every time
R² < 0.0: The model is worse than predicting the mean (this can happen on test data with a poorly fit model)

Assumptions of Linear Regression

Linear regression works well when these conditions hold:

Assumption	What It Means	How to Check
Linearity	Relationship between X and y is linear	Scatter plot, residual plot
Independence	Observations are independent of each other	Domain knowledge
Homoscedasticity	Residuals have constant variance	Residuals vs fitted values plot
Normality of residuals	Residuals are normally distributed	Q-Q plot
No multicollinearity	Features are not highly correlated with each other	VIF (Variance Inflation Factor)

When these assumptions are violated, linear regression still often produces useful approximate results — but you should consider transformations or a different model.

When to Use Linear Regression

Linear regression is the right first choice when:

Your target variable is continuous (price, temperature, score)
You expect a roughly linear relationship between features and target
Interpretability matters — you need to explain why the model predicts what it does
You have limited data — linear models generalize better than complex ones on small datasets
You need a fast baseline to compare other models against

Key Takeaways

Linear regression finds the line (or hyperplane) that minimizes MSE between predictions and actual values
The slope tells you the change in output per unit change in input; the intercept is the baseline value
The NumPy closed-form solution and scikit-learn produce identical results
R² measures the proportion of variance explained — 0.93 means 93% of variation is captured by the model
Multiple linear regression extends naturally to any number of features
Always check the five core assumptions before trusting the model's predictions

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →