Overfitting & Regularization

Overfitting is the most common failure mode in machine learning. A model that overfits performs beautifully on training data but fails in production. Understanding it — and knowing how to prevent it — is fundamental to building ML systems that actually work.

What Is Overfitting?

A model overfits when it learns the training data too well — including its noise and random fluctuations — rather than the underlying patterns.

Underfitting:  Model is too simple, misses the pattern
               Training accuracy: 60%, Test accuracy: 58%
               
Good fit:      Model captures the real pattern
               Training accuracy: 90%, Test accuracy: 88%
               
Overfitting:   Model memorized training data including noise
               Training accuracy: 99%, Test accuracy: 65%

Imagine studying for an exam by memorizing all previous exams verbatim. You'll score 100% on practice tests but fail when new questions appear. That's overfitting.

Detecting Overfitting

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Generate some data
np.random.seed(42)
X = np.linspace(0, 4, 100).reshape(-1, 1)
y = np.sin(X.ravel()) + 0.2 * np.random.randn(100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Compare models of increasing complexity
for degree in [1, 3, 15]:
    model = Pipeline([
        ('poly', PolynomialFeatures(degree)),
        ('linear', LinearRegression())
    ])
    model.fit(X_train, y_train)
    
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    print(f"Degree {degree:2d}: Train R²={train_score:.3f}, Test R²={test_score:.3f}")

# Output:
# Degree  1: Train R²=0.611, Test R²=0.574  ← Underfitting
# Degree  3: Train R²=0.882, Test R²=0.871  ← Good fit
# Degree 15: Train R²=0.987, Test R²=0.421  ← Overfitting

The gap between training and test performance is the overfitting signal. Large gap = overfitting.

The Learning Curve: A Diagnostic Tool

from sklearn.model_selection import learning_curve

def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, test_scores = learning_curve(
        model, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10)
    )
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
    plt.plot(train_sizes, test_scores.mean(axis=1), label='Cross-validation score')
    plt.fill_between(train_sizes, 
                     train_scores.mean(axis=1) - train_scores.std(axis=1),
                     train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
    plt.title(title)
    plt.xlabel('Training Set Size')
    plt.ylabel('Score')
    plt.legend()
    plt.show()

Reading learning curves:

Overfitting: Large gap between training and CV scores, CV score plateaus
Underfitting: Both scores low and close together, both could benefit from more data
Good fit: Both scores high, gap is small

Regularization Techniques

Regularization adds a penalty for complexity, forcing the model to stay simpler.

L2 Regularization (Ridge Regression)

Adds the sum of squared weights to the loss function:

Loss = MSE + λ × Σ(wᵢ²)

Large weights are penalized → model uses smaller weights → smoother predictions.

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# Alpha = λ — the regularization strength
# Higher alpha = more regularization = simpler model

alphas = [0, 0.01, 0.1, 1.0, 10.0, 100.0]

for alpha in alphas:
    model = Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=10)),
        ('ridge', Ridge(alpha=alpha))
    ])
    model.fit(X_train, y_train)
    
    train = model.score(X_train, y_train)
    test = model.score(X_test, y_test)
    print(f"α={alpha:6.2f}: Train={train:.3f}, Test={test:.3f}")

Use Ridge when you believe all features are relevant but want to prevent any single feature from dominating.

L1 Regularization (Lasso Regression)

Adds the sum of absolute weights to the loss function:

Loss = MSE + λ × Σ|wᵢ|

Key difference: Lasso zeros out irrelevant features — it performs automatic feature selection.

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Many weights become exactly 0
print("Zero weights:", (lasso.coef_ == 0).sum())
print("Non-zero weights:", (lasso.coef_ != 0).sum())

Use Lasso when you suspect many features are irrelevant — it automatically selects the important ones.

Elastic Net (L1 + L2 Combined)

from sklearn.linear_model import ElasticNet

# l1_ratio: 0 = Ridge, 1 = Lasso, 0.5 = equal mix
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)

Best of both worlds: some feature selection + stable with correlated features.

Regularization for Neural Networks

Dropout

Randomly "drops" neurons during training — forces the network to learn redundant representations.

import torch.nn as nn

class RegularizedNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(10, 64),
            nn.ReLU(),
            nn.Dropout(p=0.3),    # Drop 30% of neurons during training
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Dropout(p=0.3),
            nn.Linear(32, 1)
        )
    
    def forward(self, x):
        return self.net(x)

Dropout is only active during training — call model.eval() to disable it for inference.

Early Stopping

Stop training when validation performance stops improving:

from sklearn.neural_network import MLPRegressor

model = MLPRegressor(
    hidden_layer_sizes=(100, 50),
    early_stopping=True,       # Enable early stopping
    validation_fraction=0.1,   # 10% of training data for validation
    n_iter_no_change=10,       # Stop after 10 epochs with no improvement
    max_iter=1000
)

The Bias-Variance Tradeoff

The theoretical framework behind overfitting:

Total Error = Bias² + Variance + Irreducible Noise

Bias: Error from wrong assumptions (underfitting — model too simple)
Variance: Error from sensitivity to training data (overfitting — model too complex)

High bias → model can't capture the pattern → need more complexity
High variance → model is too sensitive to training data → need regularization
Goal: find the sweet spot with low bias AND low variance

The practical checklist when you see overfitting:

Get more training data (most effective, often not feasible)
Reduce model complexity (fewer layers, lower degree polynomial)
Add regularization (L1, L2, dropout)
Use early stopping
Feature selection — remove noisy features

Next lesson: Cross-Validation Techniques — evaluating models reliably on limited data.