Overfitting & Regularization
Overfitting & Regularization
Overfitting is the most common failure mode in machine learning. A model that overfits performs beautifully on training data but fails in production. Understanding it — and knowing how to prevent it — is fundamental to building ML systems that actually work.
What Is Overfitting?
A model overfits when it learns the training data too well — including its noise and random fluctuations — rather than the underlying patterns.
Underfitting: Model is too simple, misses the pattern
Training accuracy: 60%, Test accuracy: 58%
Good fit: Model captures the real pattern
Training accuracy: 90%, Test accuracy: 88%
Overfitting: Model memorized training data including noise
Training accuracy: 99%, Test accuracy: 65%
Imagine studying for an exam by memorizing all previous exams verbatim. You'll score 100% on practice tests but fail when new questions appear. That's overfitting.
Detecting Overfitting
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
# Generate some data
np.random.seed(42)
X = np.linspace(0, 4, 100).reshape(-1, 1)
y = np.sin(X.ravel()) + 0.2 * np.random.randn(100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Compare models of increasing complexity
for degree in [1, 3, 15]:
model = Pipeline([
('poly', PolynomialFeatures(degree)),
('linear', LinearRegression())
])
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Degree {degree:2d}: Train R²={train_score:.3f}, Test R²={test_score:.3f}")
# Output:
# Degree 1: Train R²=0.611, Test R²=0.574 ← Underfitting
# Degree 3: Train R²=0.882, Test R²=0.871 ← Good fit
# Degree 15: Train R²=0.987, Test R²=0.421 ← Overfitting
The gap between training and test performance is the overfitting signal. Large gap = overfitting.
The Learning Curve: A Diagnostic Tool
from sklearn.model_selection import learning_curve
def plot_learning_curve(model, X, y, title):
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10)
)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Cross-validation score')
plt.fill_between(train_sizes,
train_scores.mean(axis=1) - train_scores.std(axis=1),
train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
plt.title(title)
plt.xlabel('Training Set Size')
plt.ylabel('Score')
plt.legend()
plt.show()
Reading learning curves:
- Overfitting: Large gap between training and CV scores, CV score plateaus
- Underfitting: Both scores low and close together, both could benefit from more data
- Good fit: Both scores high, gap is small
Regularization Techniques
Regularization adds a penalty for complexity, forcing the model to stay simpler.
L2 Regularization (Ridge Regression)
Adds the sum of squared weights to the loss function:
Loss = MSE + λ × Σ(wᵢ²)
Large weights are penalized → model uses smaller weights → smoother predictions.
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
# Alpha = λ — the regularization strength
# Higher alpha = more regularization = simpler model
alphas = [0, 0.01, 0.1, 1.0, 10.0, 100.0]
for alpha in alphas:
model = Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=10)),
('ridge', Ridge(alpha=alpha))
])
model.fit(X_train, y_train)
train = model.score(X_train, y_train)
test = model.score(X_test, y_test)
print(f"α={alpha:6.2f}: Train={train:.3f}, Test={test:.3f}")
Use Ridge when you believe all features are relevant but want to prevent any single feature from dominating.
L1 Regularization (Lasso Regression)
Adds the sum of absolute weights to the loss function:
Loss = MSE + λ × Σ|wᵢ|
Key difference: Lasso zeros out irrelevant features — it performs automatic feature selection.
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# Many weights become exactly 0
print("Zero weights:", (lasso.coef_ == 0).sum())
print("Non-zero weights:", (lasso.coef_ != 0).sum())
Use Lasso when you suspect many features are irrelevant — it automatically selects the important ones.
Elastic Net (L1 + L2 Combined)
from sklearn.linear_model import ElasticNet
# l1_ratio: 0 = Ridge, 1 = Lasso, 0.5 = equal mix
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
Best of both worlds: some feature selection + stable with correlated features.
Regularization for Neural Networks
Dropout
Randomly "drops" neurons during training — forces the network to learn redundant representations.
import torch.nn as nn
class RegularizedNetwork(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(10, 64),
nn.ReLU(),
nn.Dropout(p=0.3), # Drop 30% of neurons during training
nn.Linear(64, 32),
nn.ReLU(),
nn.Dropout(p=0.3),
nn.Linear(32, 1)
)
def forward(self, x):
return self.net(x)
Dropout is only active during training — call model.eval() to disable it for inference.
Early Stopping
Stop training when validation performance stops improving:
from sklearn.neural_network import MLPRegressor
model = MLPRegressor(
hidden_layer_sizes=(100, 50),
early_stopping=True, # Enable early stopping
validation_fraction=0.1, # 10% of training data for validation
n_iter_no_change=10, # Stop after 10 epochs with no improvement
max_iter=1000
)
The Bias-Variance Tradeoff
The theoretical framework behind overfitting:
Total Error = Bias² + Variance + Irreducible Noise
Bias: Error from wrong assumptions (underfitting — model too simple)
Variance: Error from sensitivity to training data (overfitting — model too complex)
- High bias → model can't capture the pattern → need more complexity
- High variance → model is too sensitive to training data → need regularization
- Goal: find the sweet spot with low bias AND low variance
The practical checklist when you see overfitting:
- Get more training data (most effective, often not feasible)
- Reduce model complexity (fewer layers, lower degree polynomial)
- Add regularization (L1, L2, dropout)
- Use early stopping
- Feature selection — remove noisy features
Next lesson: Cross-Validation Techniques — evaluating models reliably on limited data.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises