Picture a student preparing for a national exam. Every night she drills the past papers from one specific school — same format, same question styles, same tricks. On exam day from that school, she scores 98%. Then she sits a paper from a different school. The format shifts, the phrasing changes, and she scores 61%.

She did not learn the subject. She memorized one version of it.

This is exactly what happens when you evaluate a machine learning model on the same data it was trained on, or even on a single fixed test split that you peeked at during development. The model learns the quirks of your particular split, not the underlying pattern. Cross-validation is the answer — it forces the model to prove itself on multiple unseen exam papers.

The Problem with a Single Train/Test Split

When you randomly split your data once (say 80/20), you introduce randomness into your evaluation. That specific 20% might be unusually easy or unusually hard. If you tune hyperparameters based on this single split, you are now leaking information about the test set into your development process.

The result: your reported accuracy is optimistic. The model has been — subtly or directly — fit to that test set. In the real world, performance drops.

K-Fold Cross-Validation

K-Fold solves this by rotating the test set. The algorithm:

Shuffle the dataset and split it into K equal folds (groups).
For each fold i:
- Use fold i as the test set.
- Use the remaining K-1 folds as the training set.
- Train the model and record the score.
Report the mean and standard deviation of all K scores.

Every data point serves as a test sample exactly once. The standard choices are K=5 (fast, still reliable) or K=10 (more accurate estimate, slower). The mean score is a much more trustworthy performance estimate than any single split.

Stratified K-Fold for Imbalanced Classes

Regular K-Fold splits randomly. With imbalanced data (e.g., 95% class 0, 5% class 1), a random fold might contain zero or very few minority-class samples — giving a useless evaluation.

Stratified K-Fold preserves the class ratio in every fold. If the overall dataset is 95/5, each fold will also be approximately 95/5. Always use stratified K-Fold for classification problems.

The Bias-Variance Tradeoff

This is one of the most important concepts in all of machine learning. Every model's error on unseen data comes from two sources:

Bias (Underfitting): The model is too simple to capture the true pattern. A linear model trying to fit a curved relationship will always be wrong — not because of the data, but because of its own rigid assumptions. High bias means high training error.

Variance (Overfitting): The model is too complex and memorizes the training data — including its noise. It fits the training set perfectly but fails on anything new. High variance means low training error but high test error.

The tradeoff: as you increase model complexity, bias decreases (good) but variance increases (bad). The sweet spot is a model complex enough to capture real patterns, but not so complex that it memorizes noise.

Model State	Training Error	Validation Error	Fix
Underfitting	High	High	More complex model, more features
Good fit	Low	Low	You're done
Overfitting	Very Low	High	Regularization, more data, simpler model

Learning Curves

A learning curve plots training and validation error as a function of training set size. It is the best diagnostic tool for bias vs variance.

Underfitting signature: Both training and validation error are high and converge close together. Adding data does not help much.
Overfitting signature: Training error is very low; validation error is much higher. A large gap persists. Adding more data will help.

Complete sklearn Example

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold, learning_curve
import matplotlib.pyplot as plt

# Load data
X, y = load_breast_cancer(return_X_y=True)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = LogisticRegression(max_iter=1000)

# --- Standard K-Fold (K=5) ---
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='accuracy')
print(f"5-Fold CV Accuracy: {cv_scores}")
print(f"Mean: {cv_scores.mean():.4f}  |  Std: {cv_scores.std():.4f}")
# Output:
# 5-Fold CV Accuracy: [0.9561 0.9737 0.9649 0.9561 0.9823]
# Mean: 0.9666  |  Std: 0.0096

# --- Stratified K-Fold ---
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
strat_scores = cross_val_score(model, X_scaled, y, cv=skf, scoring='f1')
print(f"\n10-Fold Stratified F1:  {strat_scores.mean():.4f} ± {strat_scores.std():.4f}")
# Output: 10-Fold Stratified F1:  0.9681 ± 0.0138

# --- Learning Curve ---
train_sizes, train_scores, val_scores = learning_curve(
    model, X_scaled, y, cv=skf,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='accuracy', n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
val_mean   = val_scores.mean(axis=1)

plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_mean, 'o-', label='Training Accuracy', color='steelblue')
plt.plot(train_sizes, val_mean,   'o-', label='Validation Accuracy', color='tomato')
plt.fill_between(train_sizes,
                 train_scores.mean(1) - train_scores.std(1),
                 train_scores.mean(1) + train_scores.std(1), alpha=0.1, color='steelblue')
plt.fill_between(train_sizes,
                 val_scores.mean(1) - val_scores.std(1),
                 val_scores.mean(1) + val_scores.std(1), alpha=0.1, color='tomato')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve — Logistic Regression (Breast Cancer)')
plt.legend()
plt.tight_layout()
plt.savefig('learning_curve.png', dpi=150)
plt.show()
# The two curves converge → this model is well-fitted, not overfitting

GridSearchCV: Hyperparameter Tuning + Cross-Validation Combined

When you tune hyperparameters, you need cross-validation during the tuning — not just after. GridSearchCV does both:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {
    'C':     [0.01, 0.1, 1, 10, 100],
    'gamma': ['scale', 'auto'],
    'kernel': ['rbf', 'linear']
}

grid_search = GridSearchCV(
    SVC(), param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='f1', n_jobs=-1, verbose=1
)
grid_search.fit(X_scaled, y)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV F1:  {grid_search.best_score_:.4f}")
# Output:
# Best params: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
# Best CV F1:  0.9789

GridSearchCV tests every combination of parameters using K-Fold internally. The best parameters are those that perform best on the held-out folds — not just training data. This is the correct, leak-free way to tune.

Key Takeaways

A single train/test split introduces variance into your evaluation. Cross-validation averages it away.
K=5 or K=10 are standard choices; K=10 gives a slightly better estimate at higher cost.
Always use Stratified K-Fold for classification to preserve class proportions.
Bias = underfitting (model too simple). Variance = overfitting (model too complex).
Learning curves are the clearest diagnostic: high gap means overfitting; both curves high means underfitting.
GridSearchCV integrates hyperparameter search with cross-validation so you never accidentally tune to your test set.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

32 minLesson 14 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min