Support Vector Machines | Machine Learning Fundamentals | AiTechWorlds

Support Vector Machines: Finding the Optimal Boundary

Support Vector Machines have an elegant geometric intuition: find the hyperplane that separates classes with the maximum possible margin. This principle — maximum margin classification — makes SVMs robust and theoretically well-grounded.

The Core Idea: Maximum Margin

Two classes, many possible separating lines:

Line A:  ----  passes close to some points (low margin)
Line B:  ----  passes far from all points (high margin) ← SVM chooses this

Why maximum margin? Points far from the boundary → 
more confident classification → better generalization

The margin is the width of the "street" between the two classes. SVMs find the line (or hyperplane in higher dimensions) that maximizes this street width.

Support vectors are the data points closest to the decision boundary — the ones actually "supporting" it. Only these points determine the final model; all others can be removed without changing anything.

Linear SVM: Hard Margin

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_blobs

# Linearly separable data
X, y = make_blobs(n_samples=100, centers=2, cluster_std=0.8, random_state=42)

# Hard margin SVM (no errors allowed)
svm = SVC(kernel='linear', C=1e10)  # Very large C = hard margin
svm.fit(X, y)

# Support vectors
print(f"Number of support vectors: {svm.n_support_}")
print(f"Support vectors:\n{svm.support_vectors_}")

# Visualize
def plot_svm(svm, X, y):
    plt.figure(figsize=(10, 6))
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', alpha=0.7)
    
    ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    xx = np.linspace(xlim[0], xlim[1], 30)
    yy = np.linspace(ylim[0], ylim[1], 30)
    YY, XX = np.meshgrid(yy, xx)
    xy = np.vstack([XX.ravel(), YY.ravel()]).T
    Z = svm.decision_function(xy).reshape(XX.shape)
    
    # Decision boundary and margins
    ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], 
               alpha=0.5, linestyles=['--', '-', '--'])
    
    # Support vectors
    ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
               s=200, linewidth=1, facecolors='none', edgecolors='k')
    plt.show()

plot_svm(svm, X, y)

Soft Margin SVM: Handling Real Data

Real data is rarely perfectly separable. The soft margin SVM allows some misclassification, controlled by the parameter C.

# C controls the trade-off between margin size and misclassification

# Low C: Wide margin, allows more errors (underfitting risk)
svm_low_c = SVC(kernel='linear', C=0.01)

# High C: Narrow margin, fewer errors allowed (overfitting risk)  
svm_high_c = SVC(kernel='linear', C=100.0)

# The C hyperparameter
# Large C → penalize misclassification heavily → smaller margin, fewer errors
# Small C → tolerate misclassification → larger margin, more errors

The Kernel Trick: Non-Linear SVMs

Linear SVMs fail when data isn't linearly separable. The kernel trick maps data into a higher-dimensional space where it is linearly separable — without explicitly computing the transformation.

from sklearn.datasets import make_circles, make_moons

# Non-linearly separable data
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, random_state=42)
X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)

# RBF (Radial Basis Function) kernel — most commonly used
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_rbf.fit(X_circles, y_circles)
print(f"RBF SVM on circles: {svm_rbf.score(X_circles, y_circles):.3f}")

# Polynomial kernel
svm_poly = SVC(kernel='poly', degree=3, C=1.0)
svm_poly.fit(X_circles, y_circles)

# Sigmoid kernel (rarely used)
svm_sigmoid = SVC(kernel='sigmoid', C=1.0)

Common kernels:

Kernel	Formula	Use When
Linear	K(x,y) = x·y	Data is linearly separable
RBF/Gaussian	K(x,y) = exp(-γ‖x-y‖²)	Default for non-linear problems
Polynomial	K(x,y) = (γx·y + r)^d	Image classification
Sigmoid	K(x,y) = tanh(γx·y + r)	NLP (historically)

RBF Kernel: The Gamma Parameter

Gamma controls the "reach" of each training example's influence.

from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

for gamma in [0.01, 0.1, 1, 10, 100]:
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
    svm.fit(X_train, y_train)
    train_acc = svm.score(X_train, y_train)
    test_acc = svm.score(X_test, y_test)
    print(f"γ={gamma:6.2f}: Train={train_acc:.3f}, Test={test_acc:.3f}")

γ=0.01:  Train=0.742, Test=0.733  ← Underfitting (too smooth)
γ=0.10:  Train=0.870, Test=0.867  ← Good
γ=1.00:  Train=0.963, Test=0.950  ← Good
γ=10.0:  Train=0.996, Test=0.933  ← Starting to overfit
γ=100.0: Train=1.000, Test=0.717  ← Severe overfitting

Large gamma: Small radius of influence → model fits each point too tightly → overfitting
Small gamma: Large radius of influence → model is too smooth → underfitting

Tuning C and Gamma Together

from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [0.001, 0.01, 0.1, 1, 10],
    'kernel': ['rbf']
}

grid_search = GridSearchCV(
    SVC(), param_grid,
    cv=5, scoring='accuracy',
    n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

print("Best params:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)
print("Test score:", grid_search.best_estimator_.score(X_test, y_test))

Feature Scaling is Mandatory for SVMs

SVMs are distance-based — features on different scales will dominate the distance calculation.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# ALWAYS use a pipeline with SVM
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),   # Scale first
    ('svm', SVC(kernel='rbf', C=10, gamma='scale'))
])

svm_pipeline.fit(X_train, y_train)
print(f"Accuracy: {svm_pipeline.score(X_test, y_test):.3f}")

SVR: Support Vector Regression

from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error

housing = fetch_california_housing()
X, y = housing.data[:2000], housing.target[:2000]  # Small sample for speed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

svr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR(kernel='rbf', C=10, epsilon=0.1))
])

svr_pipeline.fit(X_train, y_train)
y_pred = svr_pipeline.predict(X_test)

print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")

The epsilon parameter defines an "insensitive tube" — predictions within ε of the true value incur no penalty.

Probability Estimates

By default, SVMs don't output probabilities. Enable them with probability=True:

svm = SVC(kernel='rbf', C=1.0, probability=True)
svm.fit(X_train, y_train)

# Now you can get probability estimates
proba = svm.predict_proba(X_test)
print(f"Confidence: {proba[:5]}")  # Probability per class

Note: Enabling probability estimation significantly increases training time (uses Platt scaling internally).

SVM Strengths and Weaknesses

Strengths:

Works well in high-dimensional spaces (text classification)
Effective even when features > samples
Memory efficient (only support vectors stored)
Versatile through kernels
Strong theoretical guarantees

Weaknesses:

Slow on large datasets (O(n²) to O(n³) training complexity)
Requires careful feature scaling
C and gamma tuning is sensitive
Difficult to interpret (no feature importances)
Probability estimates are expensive add-ons

SVMs shine when:

Dataset is small to medium (under 100K samples)
High-dimensional data (text, genomics)
Clear margin of separation exists
You need robust classification in high-dimensional spaces

For most tabular ML today, gradient boosting methods (XGBoost, LightGBM) outperform SVMs in both accuracy and training speed. But SVMs remain the gold standard for text classification and bioinformatics problems.

Next lesson: K-Nearest Neighbors — the simplest non-parametric algorithm in ML.