Linear Regression Deep Dive

Linear regression is the most foundational algorithm in machine learning. It's often dismissed as "too simple" by people who don't realize it underpins concepts used in every neural network and advanced model you'll ever build. Understand it deeply, and every other algorithm becomes easier to grasp.

What Linear Regression Does

Linear regression finds the best-fit line through a set of data points to predict a continuous output.

The equation you probably remember from school: y = mx + b

In ML, we write it slightly differently:

y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

y  = prediction (what we want to know)
x  = features (what we know)
w  = weights (what the model learns)
b  = bias/intercept (baseline value)

The goal of training is to find the weights w and bias b that minimize the prediction error across all training examples.

Why "Best Fit"?

Given the same data, infinitely many lines could pass through or near the points. "Best fit" means the line that minimizes the Mean Squared Error (MSE):

MSE = (1/n) × Σ(yᵢ - ŷᵢ)²

yᵢ  = actual value
ŷᵢ  = predicted value
n   = number of samples

We square the errors for two reasons:

Positive and negative errors don't cancel out
Large errors get penalized more heavily

Implementing Linear Regression from Scratch

Before using scikit-learn, let's build it manually to understand what's happening:

import numpy as np
import matplotlib.pyplot as plt

# House size (sqft) → Price ($1000s)
X = np.array([1000, 1200, 1500, 1800, 2000, 2200, 2500, 2800, 3000])
y = np.array([150, 180, 210, 270, 300, 330, 360, 410, 450])

# Gradient Descent — how the model learns
def train_linear_regression(X, y, learning_rate=0.00001, epochs=1000):
    w, b = 0.0, 0.0
    n = len(X)

    for epoch in range(epochs):
        # Predictions
        y_pred = w * X + b

        # Compute gradients (calculus — how to adjust w and b)
        dw = (-2/n) * np.dot(X, (y - y_pred))
        db = (-2/n) * np.sum(y - y_pred)

        # Update weights
        w -= learning_rate * dw
        b -= learning_rate * db

        if epoch % 200 == 0:
            mse = np.mean((y - y_pred) ** 2)
            print(f"Epoch {epoch}: MSE = {mse:.2f}, w = {w:.4f}, b = {b:.2f}")

    return w, b

w, b = train_linear_regression(X, y)
print(f"\nFinal equation: Price = {w:.2f} × SqFt + {b:.2f}")

Using Scikit-Learn (The Real-World Way)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Multiple features: size, bedrooms, age
X = np.array([
    [1200, 2, 10], [1500, 3, 5],  [1800, 3, 8],
    [2200, 4, 3],  [2500, 4, 1],  [1000, 2, 20],
    [3000, 5, 2],  [1700, 3, 15], [2800, 4, 7],
])
y = np.array([180, 220, 275, 350, 390, 140, 480, 250, 420])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred)):.1f}K")
print(f"\nCoefficients:")
features = ['Size (sqft)', 'Bedrooms', 'Age (years)']
for feat, coef in zip(features, model.coef_):
    print(f"  {feat}: {coef:.2f}")
print(f"  Intercept: {model.intercept_:.2f}")

Understanding the Coefficients

The coefficient for each feature tells you: "For a 1-unit increase in this feature, how much does the prediction change?"

Example output:

Coefficients:
  Size (sqft): 0.12     → Each extra sqft adds $120 to price
  Bedrooms: 18.5        → Each bedroom adds $18,500
  Age (years): -4.2     → Each year of age reduces price by $4,200

This interpretability is one of linear regression's greatest strengths. You can explain every prediction.

When Linear Regression Works (and When It Doesn't)

Works well when:

The relationship between features and target is roughly linear
Features have relatively low correlation with each other (no multicollinearity)
Errors are normally distributed
You need an interpretable model

Struggles when:

The true relationship is non-linear (e.g., exponential growth)
There are complex interactions between features
Outliers dominate the data (consider Huber regression instead)
You have thousands of features (use regularization — coming next lesson)

R² Score: Measuring Model Quality

R² (R-squared) tells you: "What fraction of the variance in the target does the model explain?"

R² = 1 - (Sum of Squared Residuals / Total Sum of Squares)

R² = 1.0  → Perfect prediction
R² = 0.8  → Model explains 80% of variance
R² = 0.5  → Model explains 50% — mediocre
R² = 0.0  → No better than predicting the mean
R² < 0    → Worse than predicting the mean (something is wrong)

Key Takeaway

Linear regression is elegant because of its simplicity. Every prediction is a weighted sum of your inputs — completely transparent, fast to compute, and often surprisingly accurate.

More importantly, the concepts you just learned — weights, gradients, MSE, R² — appear everywhere in machine learning. You've just built the foundation you'll stand on for the rest of this course.