AiTechWorlds
AiTechWorlds
Imagine you are dropped on a mountain in thick fog. You cannot see the summit or the valley. You cannot see more than a few steps ahead. Your goal: reach the lowest valley.
Your strategy: feel the slope under your feet. If the ground rises to your left, step right. If it rises ahead, step backward. Always move in the direction the slope decreases. Take small, careful steps. Eventually, step by step, you descend.
This is gradient descent — the algorithm that trains nearly every neural network ever built. The mountain is the loss landscape (a surface showing how wrong your model is for every possible combination of weights). The valley is the minimum loss. The slope is the gradient. And your step size is the learning rate.
Backpropagation solves a specific sub-problem: in a deep network with millions of weights, how do you figure out which direction is "downhill" for each individual weight? The answer is the chain rule of calculus, applied systematically layer by layer.
Before gradient descent can run, you need to define what you are minimizing. The loss function measures how wrong the model's prediction is compared to the true label.
For binary classification, the standard loss is binary cross-entropy:
L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]
If the true label is 1 and the model predicts 0.95, loss is small. If it predicts 0.05, loss is large. The loss function translates prediction errors into a single scalar number that gradient descent can act on.
For multi-class classification: categorical cross-entropy. For regression: mean squared error. The choice of loss function is part of the model design — it shapes what the training process optimizes.
The gradient of the loss with respect to a weight is the partial derivative ∂L/∂w. It answers: "If I increase this weight by a tiny amount, does the loss go up or down, and by how much?"
The gradient is a vector (one value per weight). Its direction always points toward steepest increase. To minimize the loss, you step in the negative gradient direction.
The weight update at each step is:
w_new = w_old - α × ∂L/∂w
Where α (alpha) is the learning rate — the size of each step.
Learning rate too large: You overshoot the valley. The loss bounces around and may diverge entirely. The model never converges.
Learning rate too small: Progress is glacially slow. Training takes thousands of extra iterations. You may get stuck in a suboptimal region.
Good learning rate: The loss decreases smoothly and reaches a minimum. Common starting values: 0.001 for Adam optimizer, 0.01 for SGD.
These three variants differ in how many training samples are used to compute each gradient update:
Batch Gradient Descent: Computes the gradient over the entire training set before updating. Very stable, but extremely slow for large datasets. Memory-intensive.
Stochastic Gradient Descent (SGD): Computes the gradient and updates weights for one sample at a time. Very fast and memory-efficient. Noisy updates — can escape local minima — but convergence is erratic.
Mini-Batch Gradient Descent: The practical standard. Computes gradients over a small batch (typically 32 or 64 samples) and updates. Balances speed and stability. GPU hardware is optimized for matrix operations on batches of this size.
In practice, "SGD" in frameworks usually refers to mini-batch SGD.
Backpropagation (Rumelhart et al., 1986) is the algorithm that efficiently computes gradients for every weight in a deep network. The core insight: use the chain rule of calculus.
If L depends on z, which depends on w, then:
∂L/∂w = (∂L/∂z) × (∂z/∂w)
In a neural network, the loss depends on the final layer, which depends on the layer before it, which depends on the layer before that, and so on back to the first layer. Backprop applies the chain rule backwards through the network:
This is repeated for each batch until convergence. Modern frameworks (PyTorch, TensorFlow) implement automatic differentiation — you define the forward pass, and they compute all gradients automatically.
Plain gradient descent is slow and sensitive. Modern optimizers add momentum and adaptive learning rates:
SGD with Momentum: Accumulates a velocity vector in directions of persistent gradient, dampening oscillations. Converges faster than vanilla SGD.
RMSprop: Adapts the learning rate for each weight individually. Weights that receive large gradients get smaller learning rates; weights with small gradients get larger learning rates.
Adam (Adaptive Moment Estimation): Combines momentum and RMSprop. Computes first and second moment estimates of gradients. The most widely used optimizer. Works well across almost all architectures with learning rate 0.001 as default.
| Optimizer | Adaptive LR | Momentum | Best For |
|---|---|---|---|
| SGD | No | Optional | Convex problems, fine-tuning |
| SGD + Momentum | No | Yes | Most training, often matches Adam with tuning |
| RMSprop | Yes | No | Recurrent networks |
| Adam | Yes | Yes | Default choice; robust across problems |
| AdamW | Yes | Yes | Transformers, modern deep learning |
In deep networks using Sigmoid or Tanh activations, backprop multiplies gradients layer by layer. Each multiplication by a Sigmoid derivative (max 0.25) shrinks the gradient. By the time you reach the first layer, the gradient is near zero.
The weights of early layers do not update. They do not learn. This is the vanishing gradient problem, and it is why networks deeper than ~4 layers were untrainable with Sigmoid activations.
The ReLU solution: ReLU's derivative is 1 for positive inputs and 0 for negative. When a neuron is active (positive), the gradient passes through unchanged — no shrinkage. This simple fix unlocked deep learning: networks with 50, 100, 150 layers became trainable.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Non-linear 2-class problem (two interlocking half-moons)
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train with Adam optimizer
mlp_adam = MLPClassifier(hidden_layer_sizes=(64, 64), activation='relu',
solver='adam', learning_rate_init=0.001,
max_iter=500, random_state=42)
mlp_adam.fit(X_train, y_train)
print(f"Adam optimizer test accuracy: {mlp_adam.score(X_test, y_test):.4f}")
# Output: Adam optimizer test accuracy: 0.9800
# Train with SGD for comparison
mlp_sgd = MLPClassifier(hidden_layer_sizes=(64, 64), activation='relu',
solver='sgd', learning_rate_init=0.01,
momentum=0.9, max_iter=500, random_state=42)
mlp_sgd.fit(X_train, y_train)
print(f"SGD optimizer test accuracy: {mlp_sgd.score(X_test, y_test):.4f}")
# Output: SGD optimizer test accuracy: 0.9750
# Plot loss curves
plt.figure(figsize=(9, 4))
plt.plot(mlp_adam.loss_curve_, label='Adam', color='steelblue')
plt.plot(mlp_sgd.loss_curve_, label='SGD+Momentum', color='tomato')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Loss Curves: Adam vs SGD (make_moons dataset)')
plt.legend()
plt.tight_layout()
plt.savefig('loss_curves.png', dpi=150)
plt.show()
# Adam converges faster and to a lower loss than SGD
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises