Support Vector Machines
Support Vector Machines: Finding the Optimal Boundary
Support Vector Machines have an elegant geometric intuition: find the hyperplane that separates classes with the maximum possible margin. This principle — maximum margin classification — makes SVMs robust and theoretically well-grounded.
The Core Idea: Maximum Margin
Two classes, many possible separating lines:
Line A: ---- passes close to some points (low margin)
Line B: ---- passes far from all points (high margin) ← SVM chooses this
Why maximum margin? Points far from the boundary →
more confident classification → better generalization
The margin is the width of the "street" between the two classes. SVMs find the line (or hyperplane in higher dimensions) that maximizes this street width.
Support vectors are the data points closest to the decision boundary — the ones actually "supporting" it. Only these points determine the final model; all others can be removed without changing anything.
Linear SVM: Hard Margin
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_blobs
# Linearly separable data
X, y = make_blobs(n_samples=100, centers=2, cluster_std=0.8, random_state=42)
# Hard margin SVM (no errors allowed)
svm = SVC(kernel='linear', C=1e10) # Very large C = hard margin
svm.fit(X, y)
# Support vectors
print(f"Number of support vectors: {svm.n_support_}")
print(f"Support vectors:\n{svm.support_vectors_}")
# Visualize
def plot_svm(svm, X, y):
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', alpha=0.7)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = svm.decision_function(xy).reshape(XX.shape)
# Decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1],
alpha=0.5, linestyles=['--', '-', '--'])
# Support vectors
ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
s=200, linewidth=1, facecolors='none', edgecolors='k')
plt.show()
plot_svm(svm, X, y)
Soft Margin SVM: Handling Real Data
Real data is rarely perfectly separable. The soft margin SVM allows some misclassification, controlled by the parameter C.
# C controls the trade-off between margin size and misclassification
# Low C: Wide margin, allows more errors (underfitting risk)
svm_low_c = SVC(kernel='linear', C=0.01)
# High C: Narrow margin, fewer errors allowed (overfitting risk)
svm_high_c = SVC(kernel='linear', C=100.0)
# The C hyperparameter
# Large C → penalize misclassification heavily → smaller margin, fewer errors
# Small C → tolerate misclassification → larger margin, more errors
The Kernel Trick: Non-Linear SVMs
Linear SVMs fail when data isn't linearly separable. The kernel trick maps data into a higher-dimensional space where it is linearly separable — without explicitly computing the transformation.
from sklearn.datasets import make_circles, make_moons
# Non-linearly separable data
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, random_state=42)
X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)
# RBF (Radial Basis Function) kernel — most commonly used
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_rbf.fit(X_circles, y_circles)
print(f"RBF SVM on circles: {svm_rbf.score(X_circles, y_circles):.3f}")
# Polynomial kernel
svm_poly = SVC(kernel='poly', degree=3, C=1.0)
svm_poly.fit(X_circles, y_circles)
# Sigmoid kernel (rarely used)
svm_sigmoid = SVC(kernel='sigmoid', C=1.0)
Common kernels:
| Kernel | Formula | Use When |
|---|---|---|
| Linear | K(x,y) = x·y | Data is linearly separable |
| RBF/Gaussian | K(x,y) = exp(-γ‖x-y‖²) | Default for non-linear problems |
| Polynomial | K(x,y) = (γx·y + r)^d | Image classification |
| Sigmoid | K(x,y) = tanh(γx·y + r) | NLP (historically) |
RBF Kernel: The Gamma Parameter
Gamma controls the "reach" of each training example's influence.
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
for gamma in [0.01, 0.1, 1, 10, 100]:
svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
svm.fit(X_train, y_train)
train_acc = svm.score(X_train, y_train)
test_acc = svm.score(X_test, y_test)
print(f"γ={gamma:6.2f}: Train={train_acc:.3f}, Test={test_acc:.3f}")
γ=0.01: Train=0.742, Test=0.733 ← Underfitting (too smooth)
γ=0.10: Train=0.870, Test=0.867 ← Good
γ=1.00: Train=0.963, Test=0.950 ← Good
γ=10.0: Train=0.996, Test=0.933 ← Starting to overfit
γ=100.0: Train=1.000, Test=0.717 ← Severe overfitting
- Large gamma: Small radius of influence → model fits each point too tightly → overfitting
- Small gamma: Large radius of influence → model is too smooth → underfitting
Tuning C and Gamma Together
from sklearn.model_selection import GridSearchCV
param_grid = {
'C': [0.1, 1, 10, 100, 1000],
'gamma': [0.001, 0.01, 0.1, 1, 10],
'kernel': ['rbf']
}
grid_search = GridSearchCV(
SVC(), param_grid,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print("Best params:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)
print("Test score:", grid_search.best_estimator_.score(X_test, y_test))
Feature Scaling is Mandatory for SVMs
SVMs are distance-based — features on different scales will dominate the distance calculation.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# ALWAYS use a pipeline with SVM
svm_pipeline = Pipeline([
('scaler', StandardScaler()), # Scale first
('svm', SVC(kernel='rbf', C=10, gamma='scale'))
])
svm_pipeline.fit(X_train, y_train)
print(f"Accuracy: {svm_pipeline.score(X_test, y_test):.3f}")
SVR: Support Vector Regression
from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error
housing = fetch_california_housing()
X, y = housing.data[:2000], housing.target[:2000] # Small sample for speed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
svr_pipeline = Pipeline([
('scaler', StandardScaler()),
('svr', SVR(kernel='rbf', C=10, epsilon=0.1))
])
svr_pipeline.fit(X_train, y_train)
y_pred = svr_pipeline.predict(X_test)
print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
The epsilon parameter defines an "insensitive tube" — predictions within ε of the true value incur no penalty.
Probability Estimates
By default, SVMs don't output probabilities. Enable them with probability=True:
svm = SVC(kernel='rbf', C=1.0, probability=True)
svm.fit(X_train, y_train)
# Now you can get probability estimates
proba = svm.predict_proba(X_test)
print(f"Confidence: {proba[:5]}") # Probability per class
Note: Enabling probability estimation significantly increases training time (uses Platt scaling internally).
SVM Strengths and Weaknesses
Strengths:
- Works well in high-dimensional spaces (text classification)
- Effective even when features > samples
- Memory efficient (only support vectors stored)
- Versatile through kernels
- Strong theoretical guarantees
Weaknesses:
- Slow on large datasets (O(n²) to O(n³) training complexity)
- Requires careful feature scaling
- C and gamma tuning is sensitive
- Difficult to interpret (no feature importances)
- Probability estimates are expensive add-ons
SVMs shine when:
- Dataset is small to medium (under 100K samples)
- High-dimensional data (text, genomics)
- Clear margin of separation exists
- You need robust classification in high-dimensional spaces
For most tabular ML today, gradient boosting methods (XGBoost, LightGBM) outperform SVMs in both accuracy and training speed. But SVMs remain the gold standard for text classification and bioinformatics problems.
Next lesson: K-Nearest Neighbors — the simplest non-parametric algorithm in ML.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises