Overfitting in Machine Learning: How to Detect and Fix It
Overfitting explained — how to detect it with learning curves, fix it with regularization, dropout, and cross-validation, and build ML models that generalize to new data.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Overfitting in Machine Learning: How to Detect and Fix It
The first model I built that achieved 99% accuracy was a disaster.
It was a classification model for a client's dataset. I reported the result proudly. When they applied it to new data in production, accuracy dropped to 64%. The model had memorized the training data rather than learning anything generalizable.
Overfitting is the most common problem in machine learning and the one that causes the most expensive real-world failures. The reason: high training accuracy is easy to achieve. What matters is performance on data the model has never seen.
This guide gives you the conceptual framework for understanding overfitting, practical tools for detecting it early, and the full toolkit of remedies — from regularization to data augmentation.
The Core Intuition
Imagine trying to predict a stock's price. You have 10 years of daily prices — 2,500 data points.
Underfitting model: "The stock goes up 0.1% per year on average." Simple, but it misses all the real patterns — seasonal effects, momentum, market correlations.
Overfitting model: A 2,500-term polynomial that exactly fits every data point in the training set. It "explains" the training data perfectly but predicts complete nonsense for tomorrow's price.
Good model: One that captures real patterns (seasonal trends, momentum) without memorizing noise (random daily fluctuations). It makes useful predictions on days it hasn't seen.
The key insight: we don't care how well the model performs on data it was trained on. We care how it performs on data it hasn't seen.
Detecting Overfitting
Learning Curves
The definitive diagnostic tool. Plot training and validation performance over training time:
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
import numpy as np
def plot_learning_curves(model, X, y, cv=5):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
cv=cv,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy',
n_jobs=-1
)
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, label='Training accuracy', color='blue')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
plt.plot(train_sizes, val_mean, label='Validation accuracy', color='orange')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.15, color='orange')
plt.xlabel('Training set size')
plt.ylabel('Accuracy')
plt.title('Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
model = RandomForestClassifier(n_estimators=100, random_state=42)
plot_learning_curves(model, X, y)
Reading the learning curves:
Overfitting:
Training accuracy: 0.98
Validation accuracy: 0.73
Large gap = model is too complex
Underfitting:
Training accuracy: 0.71
Validation accuracy: 0.69
Both low = model is too simple
Good fit:
Training accuracy: 0.87
Validation accuracy: 0.84
Small gap, both acceptable
Validation Curve
Shows how model performance changes with a key hyperparameter:
from sklearn.model_selection import validation_curve
param_range = [1, 5, 10, 20, 50, 100, 200]
train_scores, val_scores = validation_curve(
RandomForestClassifier(random_state=42),
X, y,
param_name='n_estimators',
param_range=param_range,
cv=5,
scoring='accuracy'
)
plt.plot(param_range, train_scores.mean(axis=1), label='Train')
plt.plot(param_range, val_scores.mean(axis=1), label='Validation')
plt.xlabel('n_estimators')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
Remedies
1. Regularization
Regularization adds a penalty to the loss function for large weights, discouraging the model from fitting noise:
L2 Regularization (Ridge / Weight Decay):
# scikit-learn (uses 'C' = 1/lambda, so smaller C = more regularization)
from sklearn.linear_model import LogisticRegression, Ridge
lr_model = LogisticRegression(C=0.1, random_state=42) # Stronger regularization
ridge = Ridge(alpha=1.0) # alpha = lambda (higher = more regularization)
# PyTorch: weight_decay parameter in optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
L1 Regularization (Lasso):
from sklearn.linear_model import Lasso, LogisticRegression
lasso = Lasso(alpha=0.01) # Higher alpha = more regularization, more sparsity
lr_l1 = LogisticRegression(penalty='l1', C=0.1, solver='liblinear')
Elastic Net (combines L1 and L2):
from sklearn.linear_model import ElasticNet
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5) # l1_ratio: balance between L1 and L2
2. Dropout (Neural Networks)
Dropout randomly disables neurons during training, preventing co-adaptation:
import torch.nn as nn
class RegularizedNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_size, hidden_size),
nn.ReLU(),
nn.Dropout(p=0.5), # Randomly zero 50% of neurons during training
nn.Linear(hidden_size, hidden_size),
nn.ReLU(),
nn.Dropout(p=0.3), # Less aggressive in later layers
nn.Linear(hidden_size, output_size)
)
def forward(self, x):
return self.network(x)
Key: Dropout is only active during training. model.train() enables it; model.eval() disables it (all neurons active, predictions are deterministic).
3. Early Stopping
Stop training when validation performance stops improving:
class EarlyStopping:
def __init__(self, patience=10, min_delta=1e-4):
self.patience = patience
self.min_delta = min_delta
self.counter = 0
self.best_loss = float('inf')
def should_stop(self, val_loss):
if val_loss < self.best_loss - self.min_delta:
self.best_loss = val_loss
self.counter = 0
return False
else:
self.counter += 1
return self.counter >= self.patience
# Usage in training loop
early_stopper = EarlyStopping(patience=15)
for epoch in range(200):
train_loss = train_one_epoch(model, train_loader)
val_loss = evaluate(model, val_loader)
if early_stopper.should_stop(val_loss):
print(f"Early stopping at epoch {epoch}")
break
4. Data Augmentation
Augmentation increases effective dataset size by creating modified versions of training examples:
# Image augmentation with torchvision
from torchvision import transforms
aggressive_augmentation = transforms.Compose([
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomVerticalFlip(p=0.3),
transforms.RandomRotation(30),
transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1),
transforms.RandomGrayscale(p=0.1),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Tabular data augmentation: add Gaussian noise
import numpy as np
def augment_tabular(X, noise_scale=0.05):
noise = np.random.normal(0, noise_scale * X.std(axis=0), X.shape)
return X + noise
5. Cross-Validation
Instead of a single train/test split, use k-fold cross-validation:
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import GradientBoostingClassifier
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = GradientBoostingClassifier()
scores = cross_validate(
model, X, y,
cv=cv,
scoring=['accuracy', 'f1'],
return_train_score=True
)
print(f"Train Accuracy: {scores['train_accuracy'].mean():.4f} ± {scores['train_accuracy'].std():.4f}")
print(f"Val Accuracy: {scores['test_accuracy'].mean():.4f} ± {scores['test_accuracy'].std():.4f}")
print(f"Gap: {(scores['train_accuracy'].mean() - scores['test_accuracy'].mean()):.4f}")
6. Reduce Model Complexity
Simpler models are less prone to overfitting:
# Random Forest: control tree depth and minimum samples
rf_constrained = RandomForestClassifier(
n_estimators=100,
max_depth=5, # Limit tree depth
min_samples_leaf=10, # Each leaf needs 10+ samples
min_samples_split=20, # Splits need 20+ samples
random_state=42
)
# Neural Network: use fewer layers and neurons
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Linear(20, 32), # Fewer neurons
nn.ReLU(),
nn.Linear(32, 2) # No hidden layers to overfit
)
7. Batch Normalization
Reduces internal covariate shift and acts as a mild regularizer:
class BNRegularizedNet(nn.Module):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Linear(20, 128),
nn.BatchNorm1d(128), # Normalize layer outputs
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 64),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Linear(64, 2)
)
The Bias-Variance Tradeoff Visualized
High Bias (Underfitting) Sweet Spot High Variance (Overfitting)
Train: 0.65 Train: 0.88 Train: 0.98
Val: 0.64 Val: 0.85 Val: 0.72
Gap: 0.01 Gap: 0.03 Gap: 0.26
← Too simple Just right Too complex →
Model: Linear Random Forest Deep Neural Net
Regression (few with moderate with no
features) regularization regularization
Practical Checklist
When you have overfitting:
□ Check training vs. validation accuracy gap
- Gap < 5%: acceptable
- Gap 5-15%: mild overfitting — start here
- Gap > 15%: significant overfitting — needs attention
□ Start with: more data (or data augmentation)
□ Add regularization (weight_decay, dropout)
□ Try early stopping
□ Reduce model complexity (fewer layers, shallower trees)
□ Check for data leakage (future data in training features?)
□ Use cross-validation for more reliable evaluation
Conclusion
Overfitting is not a failure — it's information. A model that overfits tells you it has enough capacity to learn; it just needs constraints or more data to generalize correctly. The learning curve is your best diagnostic tool: plot it for every model.
The remedies are a progression: start with regularization (easy to add), try dropout for neural nets, and reach for more data or reduced complexity when regularization isn't enough.
The goal is never the best training accuracy. The goal is the best performance on data the model hasn't seen.
For the broader ML workflow context, see our scikit-learn tutorial and machine learning beginners guide.
Frequently Asked Questions
What is overfitting in machine learning?
Overfitting is when a model learns training data too specifically — including noise and random variations — rather than the underlying pattern. Results in high training accuracy but poor performance on new data. It's the most common ML failure mode and the reason held-out test evaluation is essential.
What is the bias-variance tradeoff?
The fundamental tension between two error sources: bias (error from wrong simplifying assumptions — underfitting) and variance (error from sensitivity to training data fluctuations — overfitting). Simple models: high bias, low variance. Complex models: low bias, high variance. The goal is minimizing total error.
What is the difference between overfitting and underfitting?
Overfitting: high training accuracy, much lower validation accuracy (large gap). Underfitting: both training and validation accuracy are low. Quick diagnostic: large train/val gap = overfitting; both low = underfitting; both similar and acceptable = good fit.
Does getting more training data help with overfitting?
Yes — more data is one of the most effective overfitting remedies. With more examples, the model can't memorize them all and must generalize. Data augmentation creates modified versions of existing data when collection is expensive. More data doesn't help if data quality is poor or the model architecture is fundamentally wrong.
What is L1 vs L2 regularization?
L1 (Lasso): drives some weights to exactly zero — useful for feature selection. L2 (Ridge/weight decay): shrinks all weights but rarely to zero — standard general regularizer. Elastic Net combines both. For neural networks, L2 via weight_decay is the standard; for linear models, both have specific uses.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Best Machine Learning Courses in 2025: Ranked After Taking Them All
The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.
Computer Vision Tutorial: Build an Image Classifier from Scratch
Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.
Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs
Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.
Kaggle Competition Guide: How to Rank in the Top 10% Every Time
Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.