Imagine a master sculptor creates a stunning 3D figure — intricate curves, deep grooves, subtle shadows. You want to photograph it for a poster. You can only pick one angle. Choose wisely and the photograph captures the essence: the face, the posture, the drama. Choose poorly and you get a flat silhouette that tells nothing.

That is exactly what Principal Component Analysis (PCA) does with data. Your dataset lives in many dimensions — maybe 30 features. PCA finds the "best angle" to look at that data: a new set of axes that preserves the most variation, the most information. You lose some depth, but you keep the essential shape.

Why Reduce Dimensions?

High-dimensional data causes real, practical problems. Understanding why you want fewer dimensions is the "Why" before the "How."

Visualization: Humans cannot plot 30 dimensions. Reducing to 2 or 3 lets you see clusters, outliers, and patterns that were invisible before.

Speed: Every extra feature multiplies training time. A model trained on 2 features instead of 30 can be hundreds of times faster.

The Curse of Dimensionality: As dimensions increase, data points become increasingly sparse. Distances lose meaning. Algorithms that rely on distance — like K-Nearest Neighbors — degrade badly in high dimensions.

Noise Removal: Many features capture noise or redundant information. PCA finds the directions of real variance and discards the rest, effectively denoising your data.

Avoiding Overfitting: Fewer input features mean fewer parameters to fit, which reduces the risk of a model memorizing training data.

How PCA Works: Directions of Maximum Variance

PCA is a linear transformation. It rotates your data into a new coordinate system where:

The first axis (first principal component, PC1) points in the direction of maximum variance in the data.
The second axis (PC2) points in the direction of maximum remaining variance, and is perpendicular to PC1.
Each subsequent component captures the next largest chunk of remaining variance.

Technically, PCA performs an eigendecomposition of the covariance matrix. Each eigenvector is a principal component (direction), and each eigenvalue tells you how much variance that component explains.

You do not need to compute this by hand. Sklearn handles it in two lines. But understanding the geometry — you are rotating your axes to align with the spread of the data — makes everything else click.

Critical preprocessing step: Always standardize your features before PCA. If one feature is in thousands (salary) and another is in ones (age), PCA will be dominated by the larger scale, not the actual variance.

Explained Variance Ratio

After fitting PCA, each component has an explained variance ratio: the fraction of total dataset variance it captures.

If PC1 explains 0.72 and PC2 explains 0.15, together they explain 87% of the variance. You reduced from 30 dimensions to 2 and kept 87% of the information.

The cumulative explained variance plot (scree plot) shows this visually. You look for the "elbow" — the point where adding more components gives diminishing returns.

Choosing the Number of Components: The 95% Rule

A common rule of thumb: keep enough components to explain 95% of the variance. This is a practical balance — you discard noise while preserving almost all signal.

You can let sklearn choose automatically:

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)  # Keep enough components for 95% variance

Complete sklearn PCA Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load dataset: 30 features, 2 classes (malignant / benign)
data = load_breast_cancer()
X, y = data.data, data.target
print(f"Original shape: {X.shape}")
# Output: Original shape: (569, 30)

# Step 1: Standardize (ALWAYS do this before PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Fit PCA — reduce to 2 components for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced shape: {X_pca.shape}")
# Output: Reduced shape: (569, 2)

# Step 3: Explained variance
print(f"PC1 explains: {pca.explained_variance_ratio_[0]:.2%}")
print(f"PC2 explains: {pca.explained_variance_ratio_[1]:.2%}")
print(f"Total explained: {sum(pca.explained_variance_ratio_):.2%}")
# Output:
# PC1 explains: 44.27%
# PC2 explains: 18.97%
# Total explained: 63.24%

# Step 4: Cumulative variance plot (how many components for 95%?)
pca_full = PCA().fit(X_scaled)
cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)
components_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components needed for 95% variance: {components_95}")
# Output: Components needed for 95% variance: 10

# Step 5: Visualize the 2-component projection
plt.figure(figsize=(8, 6))
colors = ['steelblue', 'tomato']
labels = ['Malignant', 'Benign']
for i, (color, label) in enumerate(zip(colors, labels)):
    mask = y == i
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1],
                c=color, label=label, alpha=0.6, s=40)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Breast Cancer Dataset — PCA to 2 Components')
plt.legend()
plt.tight_layout()
plt.savefig('pca_visualization.png', dpi=150)
plt.show()
# Two clearly separated clusters are visible even with just 2 components

Even with just 2 components out of 30, the two cancer classes are substantially separated. This is the power of PCA — meaningful structure survives compression.

t-SNE: When PCA Is Not Enough

PCA is linear. If your data has complex, non-linear structure (like a spiral or a Swiss roll), PCA flattens it poorly.

t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear technique that excels at visualization. It pulls similar points together and pushes dissimilar points apart in the low-dimensional space. The result is often beautiful, well-separated clusters.

Limitation: t-SNE is for visualization only. It is not a preprocessing step for training — the axes have no interpretable meaning, and it cannot be applied to new data. Use PCA for preprocessing; use t-SNE for exploration.

Quick Reference Table

Original Features	Components Kept	Variance Retained	Use Case
30	2	~63%	Visualization
30	10	~95%	Preprocessing before training
784 (MNIST images)	50	~85%	Speed up classification
100+	auto (95% rule)	95%	General dimensionality reduction
Any	All	100%	No reduction — baseline only

Key Takeaways

Always standardize before PCA — unscaled features dominate the components.
The explained variance ratio tells you how much information each component holds.
The 95% rule is a practical starting point for choosing how many components to keep.
2 components are enough for visualization; use more (typically 10–50) when preprocessing for a downstream model.
t-SNE is a powerful visualization-only alternative for non-linear structure.
PCA is one of the most widely used algorithms in ML — it is fast, deterministic, and interpretable. Learn it thoroughly before reaching for more complex tools.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

32 minLesson 13 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min