Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

Overfitting in Machine Learning: How to Detect and Fix It

Overfitting explained — how to detect it with learning curves, fix it with regularization, dropout, and cross-validation, and build ML models that generalize to new data.

A
AiTechWorlds Team
May 27, 2026 8 min read
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Overfitting in Machine Learning: How to Detect and Fix It

The first model I built that achieved 99% accuracy was a disaster.

It was a classification model for a client's dataset. I reported the result proudly. When they applied it to new data in production, accuracy dropped to 64%. The model had memorized the training data rather than learning anything generalizable.

Overfitting is the most common problem in machine learning and the one that causes the most expensive real-world failures. The reason: high training accuracy is easy to achieve. What matters is performance on data the model has never seen.

This guide gives you the conceptual framework for understanding overfitting, practical tools for detecting it early, and the full toolkit of remedies — from regularization to data augmentation.


The Core Intuition

Imagine trying to predict a stock's price. You have 10 years of daily prices — 2,500 data points.

Underfitting model: "The stock goes up 0.1% per year on average." Simple, but it misses all the real patterns — seasonal effects, momentum, market correlations.

Overfitting model: A 2,500-term polynomial that exactly fits every data point in the training set. It "explains" the training data perfectly but predicts complete nonsense for tomorrow's price.

Good model: One that captures real patterns (seasonal trends, momentum) without memorizing noise (random daily fluctuations). It makes useful predictions on days it hasn't seen.

The key insight: we don't care how well the model performs on data it was trained on. We care how it performs on data it hasn't seen.


Detecting Overfitting

Learning Curves

The definitive diagnostic tool. Plot training and validation performance over training time:

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
import numpy as np

def plot_learning_curves(model, X, y, cv=5):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, 
        cv=cv,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy',
        n_jobs=-1
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, label='Training accuracy', color='blue')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
    plt.plot(train_sizes, val_mean, label='Validation accuracy', color='orange')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.15, color='orange')
    
    plt.xlabel('Training set size')
    plt.ylabel('Accuracy')
    plt.title('Learning Curves')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

model = RandomForestClassifier(n_estimators=100, random_state=42)
plot_learning_curves(model, X, y)

Reading the learning curves:

Overfitting:
  Training accuracy: 0.98
  Validation accuracy: 0.73
  Large gap = model is too complex

Underfitting:
  Training accuracy: 0.71
  Validation accuracy: 0.69
  Both low = model is too simple

Good fit:
  Training accuracy: 0.87
  Validation accuracy: 0.84
  Small gap, both acceptable

Validation Curve

Shows how model performance changes with a key hyperparameter:

from sklearn.model_selection import validation_curve

param_range = [1, 5, 10, 20, 50, 100, 200]
train_scores, val_scores = validation_curve(
    RandomForestClassifier(random_state=42),
    X, y,
    param_name='n_estimators',
    param_range=param_range,
    cv=5,
    scoring='accuracy'
)

plt.plot(param_range, train_scores.mean(axis=1), label='Train')
plt.plot(param_range, val_scores.mean(axis=1), label='Validation')
plt.xlabel('n_estimators')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Remedies

1. Regularization

Regularization adds a penalty to the loss function for large weights, discouraging the model from fitting noise:

L2 Regularization (Ridge / Weight Decay):

# scikit-learn (uses 'C' = 1/lambda, so smaller C = more regularization)
from sklearn.linear_model import LogisticRegression, Ridge

lr_model = LogisticRegression(C=0.1, random_state=42)   # Stronger regularization
ridge = Ridge(alpha=1.0)  # alpha = lambda (higher = more regularization)

# PyTorch: weight_decay parameter in optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

L1 Regularization (Lasso):

from sklearn.linear_model import Lasso, LogisticRegression

lasso = Lasso(alpha=0.01)  # Higher alpha = more regularization, more sparsity
lr_l1 = LogisticRegression(penalty='l1', C=0.1, solver='liblinear')

Elastic Net (combines L1 and L2):

from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio: balance between L1 and L2

2. Dropout (Neural Networks)

Dropout randomly disables neurons during training, preventing co-adaptation:

import torch.nn as nn

class RegularizedNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(p=0.5),  # Randomly zero 50% of neurons during training
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(p=0.3),  # Less aggressive in later layers
            nn.Linear(hidden_size, output_size)
        )
    
    def forward(self, x):
        return self.network(x)

Key: Dropout is only active during training. model.train() enables it; model.eval() disables it (all neurons active, predictions are deterministic).

3. Early Stopping

Stop training when validation performance stops improving:

class EarlyStopping:
    def __init__(self, patience=10, min_delta=1e-4):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = float('inf')
    
    def should_stop(self, val_loss):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            return False
        else:
            self.counter += 1
            return self.counter >= self.patience

# Usage in training loop
early_stopper = EarlyStopping(patience=15)

for epoch in range(200):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)
    
    if early_stopper.should_stop(val_loss):
        print(f"Early stopping at epoch {epoch}")
        break

4. Data Augmentation

Augmentation increases effective dataset size by creating modified versions of training examples:

# Image augmentation with torchvision
from torchvision import transforms

aggressive_augmentation = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomVerticalFlip(p=0.3),
    transforms.RandomRotation(30),
    transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
    transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1),
    transforms.RandomGrayscale(p=0.1),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Tabular data augmentation: add Gaussian noise
import numpy as np

def augment_tabular(X, noise_scale=0.05):
    noise = np.random.normal(0, noise_scale * X.std(axis=0), X.shape)
    return X + noise

5. Cross-Validation

Instead of a single train/test split, use k-fold cross-validation:

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import GradientBoostingClassifier

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = GradientBoostingClassifier()

scores = cross_validate(
    model, X, y, 
    cv=cv, 
    scoring=['accuracy', 'f1'],
    return_train_score=True
)

print(f"Train Accuracy: {scores['train_accuracy'].mean():.4f} ± {scores['train_accuracy'].std():.4f}")
print(f"Val Accuracy:   {scores['test_accuracy'].mean():.4f} ± {scores['test_accuracy'].std():.4f}")
print(f"Gap: {(scores['train_accuracy'].mean() - scores['test_accuracy'].mean()):.4f}")

6. Reduce Model Complexity

Simpler models are less prone to overfitting:

# Random Forest: control tree depth and minimum samples
rf_constrained = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,          # Limit tree depth
    min_samples_leaf=10,  # Each leaf needs 10+ samples
    min_samples_split=20, # Splits need 20+ samples
    random_state=42
)

# Neural Network: use fewer layers and neurons
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(20, 32),    # Fewer neurons
            nn.ReLU(),
            nn.Linear(32, 2)      # No hidden layers to overfit
        )

7. Batch Normalization

Reduces internal covariate shift and acts as a mild regularizer:

class BNRegularizedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(20, 128),
            nn.BatchNorm1d(128),   # Normalize layer outputs
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )

The Bias-Variance Tradeoff Visualized

High Bias (Underfitting)     Sweet Spot         High Variance (Overfitting)
         
Train:   0.65               Train:  0.88        Train:   0.98
Val:     0.64               Val:    0.85        Val:     0.72
Gap:     0.01               Gap:    0.03        Gap:     0.26
         
← Too simple                 Just right          Too complex →

Model:   Linear             Random Forest       Deep Neural Net
         Regression (few     with moderate       with no
         features)           regularization      regularization

Practical Checklist

When you have overfitting:

□ Check training vs. validation accuracy gap
  - Gap < 5%: acceptable
  - Gap 5-15%: mild overfitting — start here
  - Gap > 15%: significant overfitting — needs attention

□ Start with: more data (or data augmentation)
□ Add regularization (weight_decay, dropout)
□ Try early stopping
□ Reduce model complexity (fewer layers, shallower trees)
□ Check for data leakage (future data in training features?)
□ Use cross-validation for more reliable evaluation

Conclusion

Overfitting is not a failure — it's information. A model that overfits tells you it has enough capacity to learn; it just needs constraints or more data to generalize correctly. The learning curve is your best diagnostic tool: plot it for every model.

The remedies are a progression: start with regularization (easy to add), try dropout for neural nets, and reach for more data or reduced complexity when regularization isn't enough.

The goal is never the best training accuracy. The goal is the best performance on data the model hasn't seen.

For the broader ML workflow context, see our scikit-learn tutorial and machine learning beginners guide.


Frequently Asked Questions

What is overfitting in machine learning?

Overfitting is when a model learns training data too specifically — including noise and random variations — rather than the underlying pattern. Results in high training accuracy but poor performance on new data. It's the most common ML failure mode and the reason held-out test evaluation is essential.

What is the bias-variance tradeoff?

The fundamental tension between two error sources: bias (error from wrong simplifying assumptions — underfitting) and variance (error from sensitivity to training data fluctuations — overfitting). Simple models: high bias, low variance. Complex models: low bias, high variance. The goal is minimizing total error.

What is the difference between overfitting and underfitting?

Overfitting: high training accuracy, much lower validation accuracy (large gap). Underfitting: both training and validation accuracy are low. Quick diagnostic: large train/val gap = overfitting; both low = underfitting; both similar and acceptable = good fit.

Does getting more training data help with overfitting?

Yes — more data is one of the most effective overfitting remedies. With more examples, the model can't memorize them all and must generalize. Data augmentation creates modified versions of existing data when collection is expensive. More data doesn't help if data quality is poor or the model architecture is fundamentally wrong.

What is L1 vs L2 regularization?

L1 (Lasso): drives some weights to exactly zero — useful for feature selection. L2 (Ridge/weight decay): shrinks all weights but rarely to zero — standard general regularizer. Elastic Net combines both. For neural networks, L2 via weight_decay is the standard; for linear models, both have specific uses.

Share this article:

Frequently Asked Questions

Overfitting occurs when a model learns the training data too specifically — including its noise and random variations — rather than the underlying pattern. An overfit model performs well on training data but poorly on new, unseen data. A classic example: a polynomial with 100 terms can fit 100 training points perfectly (zero training error) but predicts nonsense between those points. The model has memorized the training examples rather than learning the generalizable pattern. Overfitting is the most common failure mode in ML and the reason why evaluation on held-out test data is essential — training accuracy is meaningless without validation accuracy to compare it to.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!