Train, Validation & Test Splits | Machine Learning Fundamentals | AiTechWorlds

Train/Validation/Test Splits: Evaluating Models Without Lying to Yourself

The way you split your data determines how honest your evaluation is. Get it wrong and your model looks great in development, then quietly fails in production. This is one of the most misunderstood topics in applied ML.

The Three-Set System

All Data
    │
    ├── Training Set (60-70%)
    │   └── Model learns from this
    │
    ├── Validation Set (15-20%)
    │   └── You tune hyperparameters here
    │   └── Model selection happens here
    │
    └── Test Set (15-20%)
        └── Final evaluation — touch ONCE
        └── Simulates performance on unseen data

The test set is sacred. The moment you look at test performance and make a decision based on it, it becomes a second validation set — and you need a new test set.

Why You Need All Three

Train only: You have no idea how the model performs on unseen data.

Train + Test: You tune hyperparameters, look at test error, adjust your model. Now your test set has leaked into your model selection process.

Train + Validation + Test: You tune using validation, make final model selection, then touch test exactly once for the final reported number.

from sklearn.model_selection import train_test_split
import numpy as np

X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, 1000)

# Split off test set first — put it away
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Split remaining into train/validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
    # 0.25 of 0.8 = 0.2 of total → 60/20/20 split
)

print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}")

Cross-Validation: The Better Validation Strategy

With a fixed validation split, your results depend on which specific examples end up in validation. Cross-validation removes this variance.

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)

# K-fold cross-validation (k=5 means 5 different splits)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(
    model, X_train, y_train,
    cv=cv,
    scoring='accuracy'
)

print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
# Example: CV Accuracy: 0.847 ± 0.012
# The ±0.012 tells you how stable the estimate is

How it works:

Fold 1: [TEST] [TRAIN] [TRAIN] [TRAIN] [TRAIN]  → Score 1
Fold 2: [TRAIN] [TEST] [TRAIN] [TRAIN] [TRAIN]  → Score 2
Fold 3: [TRAIN] [TRAIN] [TEST] [TRAIN] [TRAIN]  → Score 3
Fold 4: [TRAIN] [TRAIN] [TRAIN] [TEST] [TRAIN]  → Score 4
Fold 5: [TRAIN] [TRAIN] [TRAIN] [TRAIN] [TEST]  → Score 5
                                                   
Final: Mean(Score 1-5) ± Std(Score 1-5)

Every example is in the test fold exactly once. You use all your data for both training and evaluation.

Stratified Splits for Classification

When classes are imbalanced, random splits can produce validation sets with different class ratios than training.

from sklearn.model_selection import StratifiedKFold

# Check class distribution
print("Original:", np.bincount(y) / len(y))

# Random split (bad for imbalanced data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("Train:", np.bincount(y_train) / len(y_train))
print("Test:", np.bincount(y_test) / len(y_test))

# Stratified split (preserves class ratios)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print("Train (stratified):", np.bincount(y_train) / len(y_train))
print("Test (stratified):", np.bincount(y_test) / len(y_test))

Always use stratify=y for classification problems with class imbalance.

Time Series: Never Shuffle

For time series data, shuffling destroys the temporal structure. Future data cannot be used to predict the past.

# WRONG: Shuffled split leaks future into past
X_train, X_test = train_test_split(X_ts, test_size=0.2, shuffle=True)  # DON'T

# CORRECT: Temporal split
split_point = int(len(X_ts) * 0.8)
X_train = X_ts[:split_point]
X_test = X_ts[split_point:]

# For time-series cross-validation
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X_ts):
    X_fold_train, X_fold_test = X_ts[train_idx], X_ts[test_idx]
    # Train size grows each fold, test is always "future"

TimeSeriesSplit:
Fold 1: [TRAIN────] [TEST─]
Fold 2: [TRAIN────────] [TEST─]
Fold 3: [TRAIN────────────] [TEST─]
Fold 4: [TRAIN────────────────] [TEST─]
Fold 5: [TRAIN────────────────────] [TEST─]

Choosing Split Ratios

Dataset Size     │  Recommended Split
─────────────────┼──────────────────────────────
< 1,000 rows     │  Use cross-validation (no fixed test)
1K – 10K rows    │  70/15/15 with CV on train+val
10K – 100K rows  │  70/15/15 or 80/10/10
> 100K rows      │  80/10/10 or 90/5/5

With large datasets, you need less validation data because the estimates are more stable. With small datasets, cross-validation is more efficient than a fixed validation split.

Nested Cross-Validation: Honest Hyperparameter Tuning

If you tune hyperparameters using cross-validation and then report that CV score as your final result, you've overfit to your validation process. The solution is nested CV.

from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.svm import SVC

# Outer loop: performance estimation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Inner loop: hyperparameter selection
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)

param_grid = {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]}

# Nested CV
clf = GridSearchCV(SVC(), param_grid, cv=inner_cv)
nested_scores = cross_val_score(clf, X, y, cv=outer_cv, scoring='accuracy')

print(f"Nested CV Accuracy: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")

Nested CV gives an unbiased estimate of how well your model selection procedure generalizes — not just how well your best model performs.

Data Leakage: The Silent Model Killer

Leakage means test information contaminates training. Your model looks amazing in evaluation, then fails completely in production.

Common leakage patterns:

# BAD: Fit scaler on full dataset before splitting
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses test stats!
X_train, X_test = train_test_split(X_scaled, test_size=0.2)

# GOOD: Fit scaler on training data only
X_train, X_test = train_test_split(X, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Apply training stats to test

# EVEN BETTER: Use Pipeline (prevents leakage automatically)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
# Pipeline correctly fits scaler only on training folds during CV
scores = cross_val_score(pipeline, X, y, cv=5)

Other leakage sources:

Imputing missing values before splitting
Feature engineering that uses statistics from the whole dataset
Including future features (for time series)
Including the target variable as a feature (target leakage)

The Complete Evaluation Workflow

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Step 1: Split off test set — don't touch it until the end
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Step 2: Compare models using CV on train+val
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    'Logistic Regression': Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

for name, model in models.items():
    scores = cross_val_score(model, X_train_val, y_train_val, cv=cv, scoring='f1')
    print(f"{name}: {scores.mean():.3f} ± {scores.std():.3f}")

# Step 3: Select best model, train on all train+val data
best_model = RandomForestClassifier(n_estimators=100, random_state=42)
best_model.fit(X_train_val, y_train_val)

# Step 4: Evaluate on test set — ONE TIME
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

This workflow gives you honest estimates at each stage and a clean final evaluation.

Next lesson: Decision Trees — the intuitive, interpretable foundation of ensemble methods.