Anomaly Detection
Anomaly Detection: Finding What Doesn't Belong
Anomaly detection — also called outlier detection — is about identifying data points that deviate significantly from expected patterns. It powers fraud detection, network intrusion systems, quality control in manufacturing, and predictive maintenance.
The core challenge: you often have very few examples of anomalies, or none at all during training. You're finding needles in haystacks, often without ever having seen a needle before.
Types of Anomaly Detection Problems
Supervised: You have labeled anomaly examples
Rare fraud transactions labeled as fraud
→ Treat as imbalanced classification
Semi-supervised: You have only normal examples
Train on normal behavior; anything different is anomalous
→ One-class SVM, Isolation Forest, Autoencoders
Unsupervised: No labels at all
Find patterns that don't fit the majority
→ Statistical methods, clustering-based
Most real-world anomaly detection is semi-supervised or unsupervised — anomalies are rare by definition, so labeled datasets are hard to build.
Statistical Approaches
Z-Score Method
import numpy as np
import pandas as pd
from scipy import stats
# Simple univariate outlier detection
data = np.array([10, 12, 11, 14, 13, 15, 12, 100, 11, 13]) # 100 is the outlier
z_scores = np.abs(stats.zscore(data))
threshold = 3.0 # Points more than 3 std from mean
outliers = np.where(z_scores > threshold)[0]
print(f"Outliers at indices: {outliers}")
print(f"Outlier values: {data[outliers]}")
Z-score works for Gaussian data; breaks down for skewed distributions.
IQR Method (Robust to Skew)
def iqr_outliers(data, factor=1.5):
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower = Q1 - factor * IQR
upper = Q3 + factor * IQR
return (data < lower) | (data > upper)
# Detect in each feature of a DataFrame
df = pd.DataFrame(np.random.randn(100, 5), columns=[f'feat_{i}' for i in range(5)])
df.iloc[5, 0] = 10 # Add outlier
for col in df.columns:
mask = iqr_outliers(df[col].values)
if mask.any():
print(f"{col}: {mask.sum()} outliers detected")
Isolation Forest
The most practical algorithm for high-dimensional anomaly detection. The key insight: anomalies are isolated by short paths in a random forest.
Normal point: buried in dense regions → needs many splits to isolate
Anomalous point: in sparse regions → isolated quickly with few splits
Score = average depth needed to isolate the point
Short average depth → anomaly
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate data with anomalies
X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.5, random_state=42)
X_anomalies = np.random.uniform(low=-4, high=4, size=(30, 2))
X = np.vstack([X_normal, X_anomalies])
# Fit Isolation Forest
iso_forest = IsolationForest(
n_estimators=100,
contamination=0.1, # Expected proportion of anomalies
random_state=42
)
predictions = iso_forest.fit_predict(X) # 1 = normal, -1 = anomaly
# Get anomaly scores
scores = iso_forest.score_samples(X) # More negative = more anomalous
# Results
n_anomalies = (predictions == -1).sum()
print(f"Detected {n_anomalies} anomalies")
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X[predictions == 1, 0], X[predictions == 1, 1],
c='blue', label='Normal', alpha=0.6)
plt.scatter(X[predictions == -1, 0], X[predictions == -1, 1],
c='red', marker='x', s=100, label='Anomaly', linewidths=2)
plt.legend()
plt.title('Isolation Forest: Anomaly Detection')
plt.show()
One-Class SVM
Learns a boundary around normal data; anything outside the boundary is an anomaly.
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# One-class SVM requires scaled data
ocsvm = Pipeline([
('scaler', StandardScaler()),
('svm', OneClassSVM(
kernel='rbf',
gamma='scale',
nu=0.05 # Upper bound on the fraction of anomalies
))
])
# Train ONLY on normal data
ocsvm.fit(X_normal)
# Predict on mixed data
predictions_svm = ocsvm.predict(X) # 1 = normal, -1 = anomaly
print(f"One-Class SVM detected {(predictions_svm == -1).sum()} anomalies")
One-Class SVM is powerful but slow on large datasets and sensitive to the nu and gamma parameters.
Local Outlier Factor (LOF)
LOF compares the local density of each point to its neighbors. Points with much lower density than their neighbors are anomalies.
from sklearn.neighbors import LocalOutlierFactor
# LOF can also detect contextual anomalies (local outliers)
lof = LocalOutlierFactor(
n_neighbors=20,
contamination=0.1,
novelty=False # Use False for fit_predict, True for predict on new data
)
predictions_lof = lof.fit_predict(X)
lof_scores = -lof.negative_outlier_factor_ # Higher = more anomalous
print(f"LOF detected {(predictions_lof == -1).sum()} anomalies")
LOF excels at finding local anomalies — points that are unusual for their neighborhood even if they aren't global outliers.
Real-World Application: Credit Card Fraud Detection
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.ensemble import IsolationForest
# Simulate imbalanced fraud dataset
np.random.seed(42)
n_normal = 10000
n_fraud = 200 # 2% fraud rate (realistic)
# Normal transactions: clustered around typical values
X_normal_tx = np.random.multivariate_normal(
mean=[0, 0, 0], cov=np.eye(3), size=n_normal
)
y_normal = np.zeros(n_normal)
# Fraudulent transactions: unusual patterns
X_fraud_tx = np.random.multivariate_normal(
mean=[3, -2, 4], cov=np.eye(3) * 2, size=n_fraud
)
y_fraud = np.ones(n_fraud)
X_all = np.vstack([X_normal_tx, X_fraud_tx])
y_all = np.concatenate([y_normal, y_fraud])
# Split
X_train, X_test, y_train, y_test = train_test_split(
X_all, y_all, test_size=0.2, stratify=y_all, random_state=42
)
# Fit Isolation Forest on training data (all classes, or just normal)
iso = IsolationForest(contamination=0.02, random_state=42)
iso.fit(X_train)
# Predict: Isolation Forest returns 1 (normal) or -1 (anomaly)
y_pred_raw = iso.predict(X_test)
y_pred = (y_pred_raw == -1).astype(int) # Convert to 0/1
print("Fraud Detection Results:")
print(classification_report(y_test.astype(int), y_pred,
target_names=['Normal', 'Fraud']))
# Get anomaly scores for ROC curve
scores = -iso.score_samples(X_test)
roc_auc = roc_auc_score(y_test, scores)
print(f"ROC AUC: {roc_auc:.3f}")
Autoencoder-Based Anomaly Detection
For complex, high-dimensional data (images, time series), autoencoders learn to compress and reconstruct normal data. High reconstruction error = anomaly.
import torch
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim, encoding_dim=8):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 32),
nn.ReLU(),
nn.Linear(32, encoding_dim),
nn.ReLU()
)
self.decoder = nn.Sequential(
nn.Linear(encoding_dim, 32),
nn.ReLU(),
nn.Linear(32, input_dim)
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
def train_autoencoder(X_normal_tensor, epochs=50):
model = Autoencoder(input_dim=X_normal_tensor.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()
for epoch in range(epochs):
optimizer.zero_grad()
output = model(X_normal_tensor)
loss = criterion(output, X_normal_tensor)
loss.backward()
optimizer.step()
return model
def reconstruction_error(model, X_tensor):
with torch.no_grad():
output = model(X_tensor)
errors = torch.mean((output - X_tensor) ** 2, dim=1)
return errors.numpy()
# Usage
X_tensor = torch.FloatTensor(X_normal_tx[:100])
model = train_autoencoder(X_tensor)
# High reconstruction error indicates anomaly
errors = reconstruction_error(model, torch.FloatTensor(X_all[:500]))
threshold = np.percentile(errors[:100], 95) # 95th percentile of normal errors
anomalies = errors > threshold
Evaluating Anomaly Detection
from sklearn.metrics import precision_recall_curve, average_precision_score
# If you have labels (lucky!)
from sklearn.metrics import roc_auc_score
scores = -iso.score_samples(X_test)
roc_auc = roc_auc_score(y_test, scores)
avg_precision = average_precision_score(y_test, scores)
print(f"ROC AUC: {roc_auc:.3f}")
print(f"Average Precision: {avg_precision:.3f}")
# Set threshold based on acceptable false positive rate
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, scores)
# Choose threshold where FPR is acceptable (e.g., <1%)
idx = np.argmax(fpr <= 0.01)
best_threshold = thresholds[idx]
print(f"Threshold for <1% FPR: {best_threshold:.3f}")
Choosing the Right Algorithm
| Algorithm | Speed | Scalability | Use When |
|---|---|---|---|
| IQR/Z-score | Very fast | Excellent | Simple univariate outliers |
| Isolation Forest | Fast | Excellent | General purpose, large datasets |
| One-Class SVM | Slow | Poor | Small datasets, non-linear boundaries |
| LOF | Medium | Medium | Local outliers, varying density |
| Autoencoder | Variable | Good | Images, time series, complex patterns |
Isolation Forest is the right default for most production anomaly detection problems — fast, scalable, and consistently effective.
Next lesson: Backpropagation Explained — understanding how neural networks actually learn.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises