AiTechWorlds
AiTechWorlds
Imagine a master sculptor creates a stunning 3D figure — intricate curves, deep grooves, subtle shadows. You want to photograph it for a poster. You can only pick one angle. Choose wisely and the photograph captures the essence: the face, the posture, the drama. Choose poorly and you get a flat silhouette that tells nothing.
That is exactly what Principal Component Analysis (PCA) does with data. Your dataset lives in many dimensions — maybe 30 features. PCA finds the "best angle" to look at that data: a new set of axes that preserves the most variation, the most information. You lose some depth, but you keep the essential shape.
High-dimensional data causes real, practical problems. Understanding why you want fewer dimensions is the "Why" before the "How."
Visualization: Humans cannot plot 30 dimensions. Reducing to 2 or 3 lets you see clusters, outliers, and patterns that were invisible before.
Speed: Every extra feature multiplies training time. A model trained on 2 features instead of 30 can be hundreds of times faster.
The Curse of Dimensionality: As dimensions increase, data points become increasingly sparse. Distances lose meaning. Algorithms that rely on distance — like K-Nearest Neighbors — degrade badly in high dimensions.
Noise Removal: Many features capture noise or redundant information. PCA finds the directions of real variance and discards the rest, effectively denoising your data.
Avoiding Overfitting: Fewer input features mean fewer parameters to fit, which reduces the risk of a model memorizing training data.
PCA is a linear transformation. It rotates your data into a new coordinate system where:
Technically, PCA performs an eigendecomposition of the covariance matrix. Each eigenvector is a principal component (direction), and each eigenvalue tells you how much variance that component explains.
You do not need to compute this by hand. Sklearn handles it in two lines. But understanding the geometry — you are rotating your axes to align with the spread of the data — makes everything else click.
Critical preprocessing step: Always standardize your features before PCA. If one feature is in thousands (salary) and another is in ones (age), PCA will be dominated by the larger scale, not the actual variance.
After fitting PCA, each component has an explained variance ratio: the fraction of total dataset variance it captures.
If PC1 explains 0.72 and PC2 explains 0.15, together they explain 87% of the variance. You reduced from 30 dimensions to 2 and kept 87% of the information.
The cumulative explained variance plot (scree plot) shows this visually. You look for the "elbow" — the point where adding more components gives diminishing returns.
A common rule of thumb: keep enough components to explain 95% of the variance. This is a practical balance — you discard noise while preserving almost all signal.
You can let sklearn choose automatically:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95) # Keep enough components for 95% variance
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Load dataset: 30 features, 2 classes (malignant / benign)
data = load_breast_cancer()
X, y = data.data, data.target
print(f"Original shape: {X.shape}")
# Output: Original shape: (569, 30)
# Step 1: Standardize (ALWAYS do this before PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Fit PCA — reduce to 2 components for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced shape: {X_pca.shape}")
# Output: Reduced shape: (569, 2)
# Step 3: Explained variance
print(f"PC1 explains: {pca.explained_variance_ratio_[0]:.2%}")
print(f"PC2 explains: {pca.explained_variance_ratio_[1]:.2%}")
print(f"Total explained: {sum(pca.explained_variance_ratio_):.2%}")
# Output:
# PC1 explains: 44.27%
# PC2 explains: 18.97%
# Total explained: 63.24%
# Step 4: Cumulative variance plot (how many components for 95%?)
pca_full = PCA().fit(X_scaled)
cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)
components_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components needed for 95% variance: {components_95}")
# Output: Components needed for 95% variance: 10
# Step 5: Visualize the 2-component projection
plt.figure(figsize=(8, 6))
colors = ['steelblue', 'tomato']
labels = ['Malignant', 'Benign']
for i, (color, label) in enumerate(zip(colors, labels)):
mask = y == i
plt.scatter(X_pca[mask, 0], X_pca[mask, 1],
c=color, label=label, alpha=0.6, s=40)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Breast Cancer Dataset — PCA to 2 Components')
plt.legend()
plt.tight_layout()
plt.savefig('pca_visualization.png', dpi=150)
plt.show()
# Two clearly separated clusters are visible even with just 2 components
Even with just 2 components out of 30, the two cancer classes are substantially separated. This is the power of PCA — meaningful structure survives compression.
PCA is linear. If your data has complex, non-linear structure (like a spiral or a Swiss roll), PCA flattens it poorly.
t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear technique that excels at visualization. It pulls similar points together and pushes dissimilar points apart in the low-dimensional space. The result is often beautiful, well-separated clusters.
Limitation: t-SNE is for visualization only. It is not a preprocessing step for training — the axes have no interpretable meaning, and it cannot be applied to new data. Use PCA for preprocessing; use t-SNE for exploration.
| Original Features | Components Kept | Variance Retained | Use Case |
|---|---|---|---|
| 30 | 2 | ~63% | Visualization |
| 30 | 10 | ~95% | Preprocessing before training |
| 784 (MNIST images) | 50 | ~85% | Speed up classification |
| 100+ | auto (95% rule) | 95% | General dimensionality reduction |
| Any | All | 100% | No reduction — baseline only |
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises