Imagine a spam filter that has learned one perfect rule: mark every single email as spam. On a dataset where 95% of emails are spam, this filter achieves 95% accuracy. That sounds excellent. But your inbox is completely empty. Every legitimate email — your job offers, bank statements, messages from family — is silently discarded.

Accuracy told you 95%. Reality told you the filter is useless.

This is why choosing the right evaluation metric is not a technicality — it is the difference between a model that works and one that silently fails.

The Confusion Matrix: Ground Truth

Every classification model's performance starts here. For a binary problem (Positive / Negative), every prediction falls into one of four cells:

                 Predicted Positive   Predicted Negative
Actual Positive |   True Positive   |   False Negative  |
Actual Negative |   False Positive  |   True Negative   |

True Positive (TP): Model said positive. It was positive. Correct.
True Negative (TN): Model said negative. It was negative. Correct.
False Positive (FP): Model said positive. It was negative. Wrong. (Type I error)
False Negative (FN): Model said negative. It was positive. Wrong. (Type II error)

Every metric you will ever use is derived from these four numbers.

The Core Metrics — Formulas and Meaning

Accuracy answers: "Of all predictions, what fraction were correct?"

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Useful when classes are balanced. Misleading on imbalanced data (the spam filter problem).

Precision answers: "Of everything I labeled positive, how many actually were?"

Precision = TP / (TP + FP)

Prioritize precision when false positives are costly. Example: a cancer screening that triggers expensive, invasive follow-up surgery — you do not want false alarms.

Recall (Sensitivity) answers: "Of all actual positives, how many did I catch?"

Recall = TP / (TP + FN)

Prioritize recall when false negatives are costly. Example: a disease detection system where missing a sick patient has severe consequences.

F1 Score is the harmonic mean of precision and recall:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

F1 punishes extreme imbalance between precision and recall. A model with 100% precision and 0% recall gets F1 = 0, not 50%. Use F1 when you need a single score that balances both.

ROC Curve and AUC: Threshold-Independent Performance

Most classifiers output a probability (e.g., 0.73 = 73% chance of being positive). The decision threshold determines where you draw the line between positive and negative. At 0.5 by default.

The ROC Curve (Receiver Operating Characteristic) plots:

True Positive Rate (Recall) on the Y-axis
False Positive Rate (FP / (FP + TN)) on the X-axis

as the threshold varies from 0 to 1. A perfect model reaches the top-left corner (TPR=1, FPR=0). A random model follows the diagonal.

AUC (Area Under the Curve) summarizes the ROC curve in one number:

AUC = 1.0: Perfect model
AUC = 0.5: Random guessing (useless)
AUC = 0.85: Good, practical model

AUC is threshold-independent — it measures how well the model ranks positives above negatives, regardless of where you set the cutoff.

Regression Metrics

For problems with continuous outputs (predicting house prices, temperature, etc.):

MAE (Mean Absolute Error): Average absolute difference between predictions and actuals. Robust to outliers. Easy to interpret (in the same units as the target).

MAE = (1/n) × Σ |y_i - ŷ_i|

MSE (Mean Squared Error): Average squared difference. Penalizes large errors heavily. Used in many loss functions.

MSE = (1/n) × Σ (y_i - ŷ_i)²

RMSE (Root Mean Squared Error): Square root of MSE. Same units as the target, more sensitive to large errors than MAE.

RMSE = √MSE

R² (R-Squared / Coefficient of Determination): Fraction of variance in the target that the model explains. R²=1 is perfect; R²=0 means the model does no better than predicting the mean; R²<0 means the model is worse than predicting the mean.

R² = 1 - (SS_residual / SS_total)

Complete sklearn Example

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, roc_curve, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt

# Setup
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred  = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# --- Confusion Matrix ---
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Output:
# [[40  3]
#  [ 2 69]]

# --- Classification Report ---
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
      target_names=['Malignant', 'Benign']))
# Output:
#               precision  recall  f1-score  support
#    Malignant      0.95    0.93      0.94       43
#        Benign     0.96    0.97      0.97       71
#     accuracy                        0.96      114
#    macro avg      0.96    0.95      0.95      114
# weighted avg      0.96    0.96      0.96      114

# --- ROC-AUC ---
auc = roc_auc_score(y_test, y_proba)
print(f"\nROC-AUC Score: {auc:.4f}")
# Output: ROC-AUC Score: 0.9957

fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, color='steelblue', lw=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve — Breast Cancer Detection')
plt.legend()
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
plt.show()

Handling Imbalanced Classes

When one class dominates (e.g., fraud detection: 99% non-fraud, 1% fraud), most classifiers learn to ignore the minority class. Solutions:

class_weight='balanced': Automatically adjusts the loss function so that the minority class gets more weight during training.

model = LogisticRegression(class_weight='balanced', max_iter=1000)

Prefer F1 score or ROC-AUC over accuracy for evaluation on imbalanced data.

Metric Quick Reference Table

Metric	Formula	When to Prioritize	Real Example
Accuracy	(TP+TN)/Total	Balanced classes	MNIST digit recognition
Precision	TP/(TP+FP)	FP is costly	Surgery recommendation
Recall	TP/(TP+FN)	FN is costly	Disease screening
F1 Score	2×P×R/(P+R)	Imbalanced data, need balance	Fraud detection
ROC-AUC	Area under ROC	Compare models, threshold-free	Credit scoring
MAE	mean(\|y-ŷ\|)	Outliers present	Temperature forecast
RMSE	√MSE	Large errors matter more	Housing prices
R²	1-SS_res/SS_tot	Explain variance explained	Salary prediction

Key Takeaways

Accuracy lies on imbalanced data. Always check the confusion matrix first.
Precision vs Recall is a tradeoff driven by domain: what costs more, false alarms or missed detections?
F1 Score is the go-to single metric for imbalanced classification.
ROC-AUC measures ranking quality independent of threshold — ideal for model comparison.
For regression, MAE is interpretable; RMSE punishes outliers; R² shows explained variance.
Use class_weight='balanced' as a first-line fix for class imbalance.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

30 minLesson 15 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min