Accuracy, Precision, Recall & F1
Accuracy, Precision, Recall, and F1: Choosing the Right Metric
Accuracy is the most misused metric in machine learning. A fraud detection model that labels everything as "not fraud" achieves 99.9% accuracy — and is completely useless. Choosing the right metric is fundamental to building models that actually solve the right problem.
The Confusion Matrix
Every classification evaluation starts here.
Predicted
Positive | Negative
Actual Positive | TP | FN |
Negative | FP | TN |
TP = True Positive (correctly predicted positive)
TN = True Negative (correctly predicted negative)
FP = False Positive (predicted positive, actually negative) — Type I Error
FN = False Negative (predicted negative, actually positive) — Type II Error
from sklearn.metrics import (confusion_matrix, classification_report,
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, average_precision_score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42
)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Malignant', 'Benign'],
yticklabels=['Malignant', 'Benign'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
The Core Metrics
# Accuracy: (TP + TN) / Total
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
# Precision: TP / (TP + FP) — "of positives predicted, how many were correct?"
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.3f}")
# Recall (Sensitivity): TP / (TP + FN) — "of actual positives, how many did we catch?"
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.3f}")
# F1 Score: 2 × (Precision × Recall) / (Precision + Recall) — harmonic mean
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.3f}")
# Full report
print(classification_report(y_test, y_pred,
target_names=cancer.target_names))
Understanding the Trade-off
Precision and recall are in tension — improving one often hurts the other.
Low threshold (predict positive aggressively):
↑ Recall — catch more true positives
↓ Precision — more false positives
High threshold (predict positive conservatively):
↑ Precision — predictions are more reliable
↓ Recall — miss more true positives
# Control the trade-off with threshold
y_proba = model.predict_proba(X_test)[:, 1] # Probability of positive class
# Default threshold: 0.5
y_pred_default = (y_proba >= 0.5).astype(int)
# Lower threshold: catch more positives (better recall)
y_pred_low = (y_proba >= 0.3).astype(int)
# Higher threshold: be more careful (better precision)
y_pred_high = (y_proba >= 0.7).astype(int)
for name, preds in [('Default (0.5)', y_pred_default),
('Low (0.3)', y_pred_low),
('High (0.7)', y_pred_high)]:
p = precision_score(y_test, preds)
r = recall_score(y_test, preds)
f = f1_score(y_test, preds)
print(f"{name}: Precision={p:.3f}, Recall={r:.3f}, F1={f:.3f}")
Precision-Recall Curve
Shows performance across all possible thresholds — more informative than a single number.
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-')
plt.axvline(x=recall[np.argmax(2*precision*recall/(precision+recall + 1e-8))],
linestyle='--', color='r', label='Best F1 threshold')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()
# Area under the PR curve (Average Precision)
ap = average_precision_score(y_test, y_proba)
print(f"Average Precision: {ap:.3f}")
# Find threshold that maximizes F1
f1_scores = 2*precision*recall/(precision+recall + 1e-8)
best_idx = np.argmax(f1_scores[:-1]) # Last value has no threshold
best_threshold = thresholds[best_idx]
print(f"Threshold for best F1: {best_threshold:.3f}")
ROC AUC: Model Discrimination
The ROC curve plots True Positive Rate vs False Positive Rate. AUC (Area Under the Curve) measures how well the model discriminates between classes.
from sklearn.metrics import roc_curve, RocCurveDisplay, roc_auc_score
fpr, tpr, roc_thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'r--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
plt.show()
print(f"ROC AUC: {auc:.3f}")
Interpreting AUC:
- 0.5: Random (no skill)
- 0.7: Acceptable
- 0.8: Good
- 0.9: Excellent
- 1.0: Perfect (suspect data leakage)
Metrics for Imbalanced Data
# Simulate imbalanced dataset (99% negative, 1% positive)
from sklearn.datasets import make_classification
X_imb, y_imb = make_classification(
n_samples=10000,
weights=[0.99, 0.01], # 99% class 0, 1% class 1
random_state=42
)
# A model that predicts everything as negative:
y_all_negative = np.zeros_like(y_imb)
print("Model predicting ALL negative:")
print(f" Accuracy: {accuracy_score(y_imb, y_all_negative):.3f}") # 0.99 — misleading!
print(f" Precision: {precision_score(y_imb, y_all_negative, zero_division=0):.3f}") # 0
print(f" Recall: {recall_score(y_imb, y_all_negative):.3f}") # 0
print(f" F1: {f1_score(y_imb, y_all_negative):.3f}") # 0
For imbalanced problems, use:
- F1 Score instead of accuracy
- ROC AUC for model comparison
- Average Precision when positive class is rare
- Balanced accuracy: average recall per class
from sklearn.metrics import balanced_accuracy_score
balanced_acc = balanced_accuracy_score(y_test, y_pred)
print(f"Balanced Accuracy: {balanced_acc:.3f}")
Multiclass Metrics
from sklearn.datasets import load_iris
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
model_multi = RandomForestClassifier(random_state=42)
model_multi.fit(X_train, y_train)
y_pred_multi = model_multi.predict(X_test)
# Averaging strategies for multiclass
print("Macro (treat all classes equally):")
print(f" Precision: {precision_score(y_test, y_pred_multi, average='macro'):.3f}")
print(f" Recall: {recall_score(y_test, y_pred_multi, average='macro'):.3f}")
print(f" F1: {f1_score(y_test, y_pred_multi, average='macro'):.3f}")
print("\nWeighted (weight by class support):")
print(f" F1: {f1_score(y_test, y_pred_multi, average='weighted'):.3f}")
# Per-class metrics
print("\nFull classification report:")
print(classification_report(y_test, y_pred_multi, target_names=iris.target_names))
Regression Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred_reg = model.predict(X_test) # Hypothetical regression model
# MAE — average absolute error (same units as target)
mae = mean_absolute_error(y_test, y_pred_reg)
# RMSE — penalizes large errors more
rmse = np.sqrt(mean_squared_error(y_test, y_pred_reg))
# R² — proportion of variance explained (1 = perfect, 0 = baseline)
r2 = r2_score(y_test, y_pred_reg)
print(f"MAE: {mae:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"R²: {r2:.3f}")
The Business Metric Alignment
Every ML metric needs to connect to a real cost.
Medical diagnosis:
FN (missing cancer) >> FP (false alarm) → Maximize Recall
Acceptable: low precision, high recall
Spam filter:
FP (blocking real email) >> FN (spam gets through) → Maximize Precision
Acceptable: some spam passes, real email always delivered
Fraud detection:
Balance both → F1 or Custom cost function
Cost(FP) = cost of inconveniencing legitimate customer
Cost(FN) = cost of fraudulent transaction
Credit scoring:
Use probability scores, not just classes
Business sets threshold based on acceptable risk level
The metric you optimize during training should reflect real-world costs — not just the standard default.
Next lesson: Cross-Validation Techniques — evaluating models reliably on limited data.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises