AiTechWorlds
AiTechWorlds
Imagine a spam filter that has learned one perfect rule: mark every single email as spam. On a dataset where 95% of emails are spam, this filter achieves 95% accuracy. That sounds excellent. But your inbox is completely empty. Every legitimate email — your job offers, bank statements, messages from family — is silently discarded.
Accuracy told you 95%. Reality told you the filter is useless.
This is why choosing the right evaluation metric is not a technicality — it is the difference between a model that works and one that silently fails.
Every classification model's performance starts here. For a binary problem (Positive / Negative), every prediction falls into one of four cells:
Predicted Positive Predicted Negative
Actual Positive | True Positive | False Negative |
Actual Negative | False Positive | True Negative |
Every metric you will ever use is derived from these four numbers.
Accuracy answers: "Of all predictions, what fraction were correct?"
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Useful when classes are balanced. Misleading on imbalanced data (the spam filter problem).
Precision answers: "Of everything I labeled positive, how many actually were?"
Precision = TP / (TP + FP)
Prioritize precision when false positives are costly. Example: a cancer screening that triggers expensive, invasive follow-up surgery — you do not want false alarms.
Recall (Sensitivity) answers: "Of all actual positives, how many did I catch?"
Recall = TP / (TP + FN)
Prioritize recall when false negatives are costly. Example: a disease detection system where missing a sick patient has severe consequences.
F1 Score is the harmonic mean of precision and recall:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F1 punishes extreme imbalance between precision and recall. A model with 100% precision and 0% recall gets F1 = 0, not 50%. Use F1 when you need a single score that balances both.
Most classifiers output a probability (e.g., 0.73 = 73% chance of being positive). The decision threshold determines where you draw the line between positive and negative. At 0.5 by default.
The ROC Curve (Receiver Operating Characteristic) plots:
as the threshold varies from 0 to 1. A perfect model reaches the top-left corner (TPR=1, FPR=0). A random model follows the diagonal.
AUC (Area Under the Curve) summarizes the ROC curve in one number:
AUC is threshold-independent — it measures how well the model ranks positives above negatives, regardless of where you set the cutoff.
For problems with continuous outputs (predicting house prices, temperature, etc.):
MAE (Mean Absolute Error): Average absolute difference between predictions and actuals. Robust to outliers. Easy to interpret (in the same units as the target).
MAE = (1/n) × Σ |y_i - ŷ_i|
MSE (Mean Squared Error): Average squared difference. Penalizes large errors heavily. Used in many loss functions.
MSE = (1/n) × Σ (y_i - ŷ_i)²
RMSE (Root Mean Squared Error): Square root of MSE. Same units as the target, more sensitive to large errors than MAE.
RMSE = √MSE
R² (R-Squared / Coefficient of Determination): Fraction of variance in the target that the model explains. R²=1 is perfect; R²=0 means the model does no better than predicting the mean; R²<0 means the model is worse than predicting the mean.
R² = 1 - (SS_residual / SS_total)
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, roc_curve, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt
# Setup
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
# --- Confusion Matrix ---
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Output:
# [[40 3]
# [ 2 69]]
# --- Classification Report ---
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=['Malignant', 'Benign']))
# Output:
# precision recall f1-score support
# Malignant 0.95 0.93 0.94 43
# Benign 0.96 0.97 0.97 71
# accuracy 0.96 114
# macro avg 0.96 0.95 0.95 114
# weighted avg 0.96 0.96 0.96 114
# --- ROC-AUC ---
auc = roc_auc_score(y_test, y_proba)
print(f"\nROC-AUC Score: {auc:.4f}")
# Output: ROC-AUC Score: 0.9957
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, color='steelblue', lw=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve — Breast Cancer Detection')
plt.legend()
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
plt.show()
When one class dominates (e.g., fraud detection: 99% non-fraud, 1% fraud), most classifiers learn to ignore the minority class. Solutions:
class_weight='balanced': Automatically adjusts the loss function so that the minority class gets more weight during training.
model = LogisticRegression(class_weight='balanced', max_iter=1000)
Prefer F1 score or ROC-AUC over accuracy for evaluation on imbalanced data.
| Metric | Formula | When to Prioritize | Real Example |
|---|---|---|---|
| Accuracy | (TP+TN)/Total | Balanced classes | MNIST digit recognition |
| Precision | TP/(TP+FP) | FP is costly | Surgery recommendation |
| Recall | TP/(TP+FN) | FN is costly | Disease screening |
| F1 Score | 2×P×R/(P+R) | Imbalanced data, need balance | Fraud detection |
| ROC-AUC | Area under ROC | Compare models, threshold-free | Credit scoring |
| MAE | mean(|y-ŷ|) | Outliers present | Temperature forecast |
| RMSE | √MSE | Large errors matter more | Housing prices |
| R² | 1-SS_res/SS_tot | Explain variance explained | Salary prediction |
class_weight='balanced' as a first-line fix for class imbalance.Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises