Anomaly Detection | Machine Learning Fundamentals | AiTechWorlds

Anomaly Detection: Finding What Doesn't Belong

Anomaly detection — also called outlier detection — is about identifying data points that deviate significantly from expected patterns. It powers fraud detection, network intrusion systems, quality control in manufacturing, and predictive maintenance.

The core challenge: you often have very few examples of anomalies, or none at all during training. You're finding needles in haystacks, often without ever having seen a needle before.

Types of Anomaly Detection Problems

Supervised: You have labeled anomaly examples
  Rare fraud transactions labeled as fraud
  → Treat as imbalanced classification

Semi-supervised: You have only normal examples
  Train on normal behavior; anything different is anomalous
  → One-class SVM, Isolation Forest, Autoencoders

Unsupervised: No labels at all
  Find patterns that don't fit the majority
  → Statistical methods, clustering-based

Most real-world anomaly detection is semi-supervised or unsupervised — anomalies are rare by definition, so labeled datasets are hard to build.

Statistical Approaches

Z-Score Method

import numpy as np
import pandas as pd
from scipy import stats

# Simple univariate outlier detection
data = np.array([10, 12, 11, 14, 13, 15, 12, 100, 11, 13])  # 100 is the outlier

z_scores = np.abs(stats.zscore(data))
threshold = 3.0  # Points more than 3 std from mean

outliers = np.where(z_scores > threshold)[0]
print(f"Outliers at indices: {outliers}")
print(f"Outlier values: {data[outliers]}")

Z-score works for Gaussian data; breaks down for skewed distributions.

IQR Method (Robust to Skew)

def iqr_outliers(data, factor=1.5):
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1
    
    lower = Q1 - factor * IQR
    upper = Q3 + factor * IQR
    
    return (data < lower) | (data > upper)

# Detect in each feature of a DataFrame
df = pd.DataFrame(np.random.randn(100, 5), columns=[f'feat_{i}' for i in range(5)])
df.iloc[5, 0] = 10  # Add outlier

for col in df.columns:
    mask = iqr_outliers(df[col].values)
    if mask.any():
        print(f"{col}: {mask.sum()} outliers detected")

Isolation Forest

The most practical algorithm for high-dimensional anomaly detection. The key insight: anomalies are isolated by short paths in a random forest.

Normal point: buried in dense regions → needs many splits to isolate
Anomalous point: in sparse regions → isolated quickly with few splits

Score = average depth needed to isolate the point
Short average depth → anomaly

from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate data with anomalies
X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.5, random_state=42)
X_anomalies = np.random.uniform(low=-4, high=4, size=(30, 2))
X = np.vstack([X_normal, X_anomalies])

# Fit Isolation Forest
iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.1,  # Expected proportion of anomalies
    random_state=42
)
predictions = iso_forest.fit_predict(X)  # 1 = normal, -1 = anomaly

# Get anomaly scores
scores = iso_forest.score_samples(X)  # More negative = more anomalous

# Results
n_anomalies = (predictions == -1).sum()
print(f"Detected {n_anomalies} anomalies")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X[predictions == 1, 0], X[predictions == 1, 1], 
            c='blue', label='Normal', alpha=0.6)
plt.scatter(X[predictions == -1, 0], X[predictions == -1, 1], 
            c='red', marker='x', s=100, label='Anomaly', linewidths=2)
plt.legend()
plt.title('Isolation Forest: Anomaly Detection')
plt.show()

One-Class SVM

Learns a boundary around normal data; anything outside the boundary is an anomaly.

from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# One-class SVM requires scaled data
ocsvm = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', OneClassSVM(
        kernel='rbf',
        gamma='scale',
        nu=0.05  # Upper bound on the fraction of anomalies
    ))
])

# Train ONLY on normal data
ocsvm.fit(X_normal)

# Predict on mixed data
predictions_svm = ocsvm.predict(X)  # 1 = normal, -1 = anomaly

print(f"One-Class SVM detected {(predictions_svm == -1).sum()} anomalies")

One-Class SVM is powerful but slow on large datasets and sensitive to the nu and gamma parameters.

Local Outlier Factor (LOF)

LOF compares the local density of each point to its neighbors. Points with much lower density than their neighbors are anomalies.

from sklearn.neighbors import LocalOutlierFactor

# LOF can also detect contextual anomalies (local outliers)
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1,
    novelty=False  # Use False for fit_predict, True for predict on new data
)

predictions_lof = lof.fit_predict(X)
lof_scores = -lof.negative_outlier_factor_  # Higher = more anomalous

print(f"LOF detected {(predictions_lof == -1).sum()} anomalies")

LOF excels at finding local anomalies — points that are unusual for their neighborhood even if they aren't global outliers.

Real-World Application: Credit Card Fraud Detection

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.ensemble import IsolationForest

# Simulate imbalanced fraud dataset
np.random.seed(42)
n_normal = 10000
n_fraud = 200  # 2% fraud rate (realistic)

# Normal transactions: clustered around typical values
X_normal_tx = np.random.multivariate_normal(
    mean=[0, 0, 0], cov=np.eye(3), size=n_normal
)
y_normal = np.zeros(n_normal)

# Fraudulent transactions: unusual patterns
X_fraud_tx = np.random.multivariate_normal(
    mean=[3, -2, 4], cov=np.eye(3) * 2, size=n_fraud
)
y_fraud = np.ones(n_fraud)

X_all = np.vstack([X_normal_tx, X_fraud_tx])
y_all = np.concatenate([y_normal, y_fraud])

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_all, test_size=0.2, stratify=y_all, random_state=42
)

# Fit Isolation Forest on training data (all classes, or just normal)
iso = IsolationForest(contamination=0.02, random_state=42)
iso.fit(X_train)

# Predict: Isolation Forest returns 1 (normal) or -1 (anomaly)
y_pred_raw = iso.predict(X_test)
y_pred = (y_pred_raw == -1).astype(int)  # Convert to 0/1

print("Fraud Detection Results:")
print(classification_report(y_test.astype(int), y_pred, 
                             target_names=['Normal', 'Fraud']))

# Get anomaly scores for ROC curve
scores = -iso.score_samples(X_test)
roc_auc = roc_auc_score(y_test, scores)
print(f"ROC AUC: {roc_auc:.3f}")

Autoencoder-Based Anomaly Detection

For complex, high-dimensional data (images, time series), autoencoders learn to compress and reconstruct normal data. High reconstruction error = anomaly.

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dim=8):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.ReLU(),
            nn.Linear(32, encoding_dim),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 32),
            nn.ReLU(),
            nn.Linear(32, input_dim)
        )
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

def train_autoencoder(X_normal_tensor, epochs=50):
    model = Autoencoder(input_dim=X_normal_tensor.shape[1])
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.MSELoss()
    
    for epoch in range(epochs):
        optimizer.zero_grad()
        output = model(X_normal_tensor)
        loss = criterion(output, X_normal_tensor)
        loss.backward()
        optimizer.step()
    
    return model

def reconstruction_error(model, X_tensor):
    with torch.no_grad():
        output = model(X_tensor)
        errors = torch.mean((output - X_tensor) ** 2, dim=1)
    return errors.numpy()

# Usage
X_tensor = torch.FloatTensor(X_normal_tx[:100])
model = train_autoencoder(X_tensor)

# High reconstruction error indicates anomaly
errors = reconstruction_error(model, torch.FloatTensor(X_all[:500]))
threshold = np.percentile(errors[:100], 95)  # 95th percentile of normal errors
anomalies = errors > threshold

Evaluating Anomaly Detection

from sklearn.metrics import precision_recall_curve, average_precision_score

# If you have labels (lucky!)
from sklearn.metrics import roc_auc_score

scores = -iso.score_samples(X_test)
roc_auc = roc_auc_score(y_test, scores)
avg_precision = average_precision_score(y_test, scores)

print(f"ROC AUC: {roc_auc:.3f}")
print(f"Average Precision: {avg_precision:.3f}")

# Set threshold based on acceptable false positive rate
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, scores)
# Choose threshold where FPR is acceptable (e.g., <1%)
idx = np.argmax(fpr <= 0.01)
best_threshold = thresholds[idx]
print(f"Threshold for <1% FPR: {best_threshold:.3f}")

Choosing the Right Algorithm

Algorithm	Speed	Scalability	Use When
IQR/Z-score	Very fast	Excellent	Simple univariate outliers
Isolation Forest	Fast	Excellent	General purpose, large datasets
One-Class SVM	Slow	Poor	Small datasets, non-linear boundaries
LOF	Medium	Medium	Local outliers, varying density
Autoencoder	Variable	Good	Images, time series, complex patterns

Isolation Forest is the right default for most production anomaly detection problems — fast, scalable, and consistently effective.

Next lesson: Backpropagation Explained — understanding how neural networks actually learn.