Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →
18 minLesson 23 of 31
Model Evaluation

Accuracy, Precision, Recall & F1

Accuracy, Precision, Recall, and F1: Choosing the Right Metric

Accuracy is the most misused metric in machine learning. A fraud detection model that labels everything as "not fraud" achieves 99.9% accuracy — and is completely useless. Choosing the right metric is fundamental to building models that actually solve the right problem.

The Confusion Matrix

Every classification evaluation starts here.

                    Predicted
                  Positive | Negative
Actual  Positive |   TP    |   FN    |
        Negative |   FP    |   TN    |

TP = True Positive  (correctly predicted positive)
TN = True Negative  (correctly predicted negative)
FP = False Positive (predicted positive, actually negative) — Type I Error
FN = False Negative (predicted negative, actually positive) — Type II Error
from sklearn.metrics import (confusion_matrix, classification_report,
                              accuracy_score, precision_score, recall_score,
                              f1_score, roc_auc_score, average_precision_score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42
)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Malignant', 'Benign'],
            yticklabels=['Malignant', 'Benign'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

The Core Metrics

# Accuracy: (TP + TN) / Total
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

# Precision: TP / (TP + FP) — "of positives predicted, how many were correct?"
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.3f}")

# Recall (Sensitivity): TP / (TP + FN) — "of actual positives, how many did we catch?"
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.3f}")

# F1 Score: 2 × (Precision × Recall) / (Precision + Recall) — harmonic mean
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.3f}")

# Full report
print(classification_report(y_test, y_pred, 
                             target_names=cancer.target_names))

Understanding the Trade-off

Precision and recall are in tension — improving one often hurts the other.

Low threshold (predict positive aggressively):
  ↑ Recall — catch more true positives
  ↓ Precision — more false positives

High threshold (predict positive conservatively):
  ↑ Precision — predictions are more reliable
  ↓ Recall — miss more true positives
# Control the trade-off with threshold
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of positive class

# Default threshold: 0.5
y_pred_default = (y_proba >= 0.5).astype(int)

# Lower threshold: catch more positives (better recall)
y_pred_low = (y_proba >= 0.3).astype(int)

# Higher threshold: be more careful (better precision)  
y_pred_high = (y_proba >= 0.7).astype(int)

for name, preds in [('Default (0.5)', y_pred_default), 
                     ('Low (0.3)', y_pred_low),
                     ('High (0.7)', y_pred_high)]:
    p = precision_score(y_test, preds)
    r = recall_score(y_test, preds)
    f = f1_score(y_test, preds)
    print(f"{name}: Precision={p:.3f}, Recall={r:.3f}, F1={f:.3f}")

Precision-Recall Curve

Shows performance across all possible thresholds — more informative than a single number.

from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-')
plt.axvline(x=recall[np.argmax(2*precision*recall/(precision+recall + 1e-8))], 
            linestyle='--', color='r', label='Best F1 threshold')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()

# Area under the PR curve (Average Precision)
ap = average_precision_score(y_test, y_proba)
print(f"Average Precision: {ap:.3f}")

# Find threshold that maximizes F1
f1_scores = 2*precision*recall/(precision+recall + 1e-8)
best_idx = np.argmax(f1_scores[:-1])  # Last value has no threshold
best_threshold = thresholds[best_idx]
print(f"Threshold for best F1: {best_threshold:.3f}")

ROC AUC: Model Discrimination

The ROC curve plots True Positive Rate vs False Positive Rate. AUC (Area Under the Curve) measures how well the model discriminates between classes.

from sklearn.metrics import roc_curve, RocCurveDisplay, roc_auc_score

fpr, tpr, roc_thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'r--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
plt.show()

print(f"ROC AUC: {auc:.3f}")

Interpreting AUC:

  • 0.5: Random (no skill)
  • 0.7: Acceptable
  • 0.8: Good
  • 0.9: Excellent
  • 1.0: Perfect (suspect data leakage)

Metrics for Imbalanced Data

# Simulate imbalanced dataset (99% negative, 1% positive)
from sklearn.datasets import make_classification

X_imb, y_imb = make_classification(
    n_samples=10000,
    weights=[0.99, 0.01],  # 99% class 0, 1% class 1
    random_state=42
)

# A model that predicts everything as negative:
y_all_negative = np.zeros_like(y_imb)

print("Model predicting ALL negative:")
print(f"  Accuracy: {accuracy_score(y_imb, y_all_negative):.3f}")   # 0.99 — misleading!
print(f"  Precision: {precision_score(y_imb, y_all_negative, zero_division=0):.3f}")  # 0
print(f"  Recall: {recall_score(y_imb, y_all_negative):.3f}")        # 0
print(f"  F1: {f1_score(y_imb, y_all_negative):.3f}")                # 0

For imbalanced problems, use:

  • F1 Score instead of accuracy
  • ROC AUC for model comparison
  • Average Precision when positive class is rare
  • Balanced accuracy: average recall per class
from sklearn.metrics import balanced_accuracy_score

balanced_acc = balanced_accuracy_score(y_test, y_pred)
print(f"Balanced Accuracy: {balanced_acc:.3f}")

Multiclass Metrics

from sklearn.datasets import load_iris

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

model_multi = RandomForestClassifier(random_state=42)
model_multi.fit(X_train, y_train)
y_pred_multi = model_multi.predict(X_test)

# Averaging strategies for multiclass
print("Macro (treat all classes equally):")
print(f"  Precision: {precision_score(y_test, y_pred_multi, average='macro'):.3f}")
print(f"  Recall: {recall_score(y_test, y_pred_multi, average='macro'):.3f}")
print(f"  F1: {f1_score(y_test, y_pred_multi, average='macro'):.3f}")

print("\nWeighted (weight by class support):")
print(f"  F1: {f1_score(y_test, y_pred_multi, average='weighted'):.3f}")

# Per-class metrics
print("\nFull classification report:")
print(classification_report(y_test, y_pred_multi, target_names=iris.target_names))

Regression Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred_reg = model.predict(X_test)  # Hypothetical regression model

# MAE — average absolute error (same units as target)
mae = mean_absolute_error(y_test, y_pred_reg)

# RMSE — penalizes large errors more
rmse = np.sqrt(mean_squared_error(y_test, y_pred_reg))

# R² — proportion of variance explained (1 = perfect, 0 = baseline)
r2 = r2_score(y_test, y_pred_reg)

print(f"MAE: {mae:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"R²: {r2:.3f}")

The Business Metric Alignment

Every ML metric needs to connect to a real cost.

Medical diagnosis:
  FN (missing cancer) >> FP (false alarm) → Maximize Recall
  Acceptable: low precision, high recall

Spam filter:
  FP (blocking real email) >> FN (spam gets through) → Maximize Precision
  Acceptable: some spam passes, real email always delivered

Fraud detection:
  Balance both → F1 or Custom cost function
  Cost(FP) = cost of inconveniencing legitimate customer
  Cost(FN) = cost of fraudulent transaction

Credit scoring:
  Use probability scores, not just classes
  Business sets threshold based on acceptable risk level

The metric you optimize during training should reflect real-world costs — not just the standard default.

Next lesson: Cross-Validation Techniques — evaluating models reliably on limited data.

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →
!