AiTechWorlds
AiTechWorlds
Imagine you have a table covered with red and blue marbles, scattered randomly. Your job is to draw a single straight line between them so that all red marbles are on one side and all blue on the other. There are many possible lines that could work — so which one do you choose?
A nervous person might draw the line just barely separating them, so close to some marbles that a tiny nudge would cause a mistake. A confident separator draws the line exactly in the middle, as far away from both groups as possible. That confidence gap — the widest possible corridor between the two groups — is the margin. Support Vector Machines always find the most confident line: the one with the maximum margin.
Most classifiers just find a boundary. SVM finds the best boundary — the one that is farthest from both classes simultaneously. This matters because:
The intuition is powerful: if your decision boundary hugs the data too closely, small perturbations in new data will cause misclassification. A wide margin acts like a safety buffer.
Given labeled data points, SVM finds a hyperplane (a line in 2D, a plane in 3D, a higher-dimensional surface beyond that) that:
Those nearest points are called support vectors — they are the only data points that actually define the boundary. Remove any other point and the boundary stays the same. Remove a support vector and it shifts. This is why the method is named after them.
Mathematically:
The hyperplane is defined as:
w · x + b = 0
Where w is the weight vector (normal to the hyperplane) and b is the bias. The margin width equals 2 / ||w||. Maximizing the margin means minimizing ||w||, subject to all points being correctly classified.
Hard margin SVM requires perfect separation — every point must be on the correct side. This only works when the data is linearly separable, which is rare in practice.
Soft margin SVM introduces a tolerance parameter C:
Think of C as a strictness dial. Low C says "I'll accept a few mistakes for a more robust boundary."
Real data is often not linearly separable in its original space. The kernel trick projects data into a higher-dimensional space where a linear separator does exist — without explicitly computing the transformation (which would be computationally expensive).
Common Kernels:
| Kernel | Formula | Use Case | Behavior |
|---|---|---|---|
| Linear | x · x' | Text classification, high-dim data | Fast, no transformation |
| RBF (Gaussian) | `exp(-γ | x-x' | |
| Polynomial | (γ x·x' + r)^d | NLP, structured data | Flexible, degree controls complexity |
| Sigmoid | tanh(γ x·x' + r) | Neural-net-like behavior | Less common, can be unstable |
The RBF (Radial Basis Function) kernel is the default choice. The gamma parameter controls how far the influence of a single training point reaches — high gamma means tight, local influence; low gamma means broad influence.
SVM shines in specific scenarios:
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
import numpy as np
# Load the digits dataset (8x8 pixel images, 10 classes: 0-9)
digits = datasets.load_digits()
X = digits.data # Shape: (1797, 64) — 64 pixel features per image
y = digits.target # Shape: (1797,) — digit label 0–9
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 80% for training, 20% for testing
# Scale features: SVM is sensitive to feature magnitude
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Fit on training data only
X_test = scaler.transform(X_test) # Apply same scaling to test
# Train SVM with RBF kernel
svm_model = SVC(kernel='rbf', C=10, gamma=0.001, random_state=42)
svm_model.fit(X_train, y_train)
# SVC finds support vectors and optimal hyperplane
# Make predictions
y_pred = svm_model.predict(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Output:
# Accuracy: 0.9889
#
# Classification Report:
# precision recall f1-score support
# 0 1.00 1.00 1.00 33
# 1 0.97 1.00 0.98 28
# 2 1.00 1.00 1.00 33
# 3 0.97 0.97 0.97 36
# 4 1.00 0.98 0.99 46
# 5 0.98 0.98 0.98 46
# 6 1.00 1.00 1.00 35
# 7 1.00 0.97 0.99 34
# 8 0.97 0.97 0.97 30
# 9 0.97 0.97 0.97 39
# Use GridSearchCV to find best C and gamma combination
param_grid = {
'C': [0.1, 1, 10, 100], # Regularization strength
'gamma': [0.001, 0.01, 0.1, 1], # RBF kernel spread
'kernel': ['rbf']
}
grid_search = GridSearchCV(
SVC(), param_grid, cv=5, # 5-fold cross-validation
scoring='accuracy', n_jobs=-1 # Use all CPU cores
)
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")
# Output:
# Best Parameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
# Best CV Accuracy: 0.9875
Each combination of C and gamma is evaluated across 5 data splits. The combination with the highest average accuracy is selected. This prevents choosing parameters that just happen to work on one particular split.
In the next lesson, we move from supervised learning to unsupervised learning with K-Means Clustering — where there are no labels at all.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises