Imagine you are dropped on a mountain in thick fog. You cannot see the summit or the valley. You cannot see more than a few steps ahead. Your goal: reach the lowest valley.

Your strategy: feel the slope under your feet. If the ground rises to your left, step right. If it rises ahead, step backward. Always move in the direction the slope decreases. Take small, careful steps. Eventually, step by step, you descend.

This is gradient descent — the algorithm that trains nearly every neural network ever built. The mountain is the loss landscape (a surface showing how wrong your model is for every possible combination of weights). The valley is the minimum loss. The slope is the gradient. And your step size is the learning rate.

Backpropagation solves a specific sub-problem: in a deep network with millions of weights, how do you figure out which direction is "downhill" for each individual weight? The answer is the chain rule of calculus, applied systematically layer by layer.

The Loss Function: What We Minimize

Before gradient descent can run, you need to define what you are minimizing. The loss function measures how wrong the model's prediction is compared to the true label.

For binary classification, the standard loss is binary cross-entropy:

L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]

If the true label is 1 and the model predicts 0.95, loss is small. If it predicts 0.05, loss is large. The loss function translates prediction errors into a single scalar number that gradient descent can act on.

For multi-class classification: categorical cross-entropy. For regression: mean squared error. The choice of loss function is part of the model design — it shapes what the training process optimizes.

The Gradient: Direction of Steepest Increase

The gradient of the loss with respect to a weight is the partial derivative ∂L/∂w. It answers: "If I increase this weight by a tiny amount, does the loss go up or down, and by how much?"

Positive gradient → increasing this weight increases the loss → step down by decreasing it.
Negative gradient → increasing this weight decreases the loss → step down by increasing it.

The gradient is a vector (one value per weight). Its direction always points toward steepest increase. To minimize the loss, you step in the negative gradient direction.

Gradient Descent: The Update Rule

The weight update at each step is:

w_new = w_old - α × ∂L/∂w

Where α (alpha) is the learning rate — the size of each step.

Learning rate too large: You overshoot the valley. The loss bounces around and may diverge entirely. The model never converges.

Learning rate too small: Progress is glacially slow. Training takes thousands of extra iterations. You may get stuck in a suboptimal region.

Good learning rate: The loss decreases smoothly and reaches a minimum. Common starting values: 0.001 for Adam optimizer, 0.01 for SGD.

Batch vs Mini-Batch vs Stochastic Gradient Descent

These three variants differ in how many training samples are used to compute each gradient update:

Batch Gradient Descent: Computes the gradient over the entire training set before updating. Very stable, but extremely slow for large datasets. Memory-intensive.

Stochastic Gradient Descent (SGD): Computes the gradient and updates weights for one sample at a time. Very fast and memory-efficient. Noisy updates — can escape local minima — but convergence is erratic.

Mini-Batch Gradient Descent: The practical standard. Computes gradients over a small batch (typically 32 or 64 samples) and updates. Balances speed and stability. GPU hardware is optimized for matrix operations on batches of this size.

In practice, "SGD" in frameworks usually refers to mini-batch SGD.

Backpropagation: The Chain Rule Layer by Layer

Backpropagation (Rumelhart et al., 1986) is the algorithm that efficiently computes gradients for every weight in a deep network. The core insight: use the chain rule of calculus.

If L depends on z, which depends on w, then:

∂L/∂w = (∂L/∂z) × (∂z/∂w)

In a neural network, the loss depends on the final layer, which depends on the layer before it, which depends on the layer before that, and so on back to the first layer. Backprop applies the chain rule backwards through the network:

Forward pass: Compute and store activations at every layer.
Compute loss: Compare final output to true label.
Backward pass: Starting from the output layer, compute the gradient of the loss with respect to each layer's weights, using the stored activations and the chain rule.
Update weights: Apply the gradient descent update rule to every weight simultaneously.

This is repeated for each batch until convergence. Modern frameworks (PyTorch, TensorFlow) implement automatic differentiation — you define the forward pass, and they compute all gradients automatically.

Optimizers: Beyond Basic Gradient Descent

Plain gradient descent is slow and sensitive. Modern optimizers add momentum and adaptive learning rates:

SGD with Momentum: Accumulates a velocity vector in directions of persistent gradient, dampening oscillations. Converges faster than vanilla SGD.

RMSprop: Adapts the learning rate for each weight individually. Weights that receive large gradients get smaller learning rates; weights with small gradients get larger learning rates.

Adam (Adaptive Moment Estimation): Combines momentum and RMSprop. Computes first and second moment estimates of gradients. The most widely used optimizer. Works well across almost all architectures with learning rate 0.001 as default.

Optimizer	Adaptive LR	Momentum	Best For
SGD	No	Optional	Convex problems, fine-tuning
SGD + Momentum	No	Yes	Most training, often matches Adam with tuning
RMSprop	Yes	No	Recurrent networks
Adam	Yes	Yes	Default choice; robust across problems
AdamW	Yes	Yes	Transformers, modern deep learning

The Vanishing Gradient Problem

In deep networks using Sigmoid or Tanh activations, backprop multiplies gradients layer by layer. Each multiplication by a Sigmoid derivative (max 0.25) shrinks the gradient. By the time you reach the first layer, the gradient is near zero.

The weights of early layers do not update. They do not learn. This is the vanishing gradient problem, and it is why networks deeper than ~4 layers were untrainable with Sigmoid activations.

The ReLU solution: ReLU's derivative is 1 for positive inputs and 0 for negative. When a neuron is active (positive), the gradient passes through unchanged — no shrinkage. This simple fix unlocked deep learning: networks with 50, 100, 150 layers became trainable.

Putting It Together: Mini Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Non-linear 2-class problem (two interlocking half-moons)
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Train with Adam optimizer
mlp_adam = MLPClassifier(hidden_layer_sizes=(64, 64), activation='relu',
                          solver='adam', learning_rate_init=0.001,
                          max_iter=500, random_state=42)
mlp_adam.fit(X_train, y_train)
print(f"Adam optimizer test accuracy:  {mlp_adam.score(X_test, y_test):.4f}")
# Output: Adam optimizer test accuracy:  0.9800

# Train with SGD for comparison
mlp_sgd = MLPClassifier(hidden_layer_sizes=(64, 64), activation='relu',
                         solver='sgd', learning_rate_init=0.01,
                         momentum=0.9, max_iter=500, random_state=42)
mlp_sgd.fit(X_train, y_train)
print(f"SGD  optimizer test accuracy:  {mlp_sgd.score(X_test, y_test):.4f}")
# Output: SGD  optimizer test accuracy:  0.9750

# Plot loss curves
plt.figure(figsize=(9, 4))
plt.plot(mlp_adam.loss_curve_, label='Adam', color='steelblue')
plt.plot(mlp_sgd.loss_curve_,  label='SGD+Momentum', color='tomato')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Loss Curves: Adam vs SGD (make_moons dataset)')
plt.legend()
plt.tight_layout()
plt.savefig('loss_curves.png', dpi=150)
plt.show()
# Adam converges faster and to a lower loss than SGD

Key Takeaways

Gradient descent minimizes the loss by repeatedly stepping in the negative gradient direction.
The learning rate controls step size — tune it carefully; it is the most important hyperparameter.
Backpropagation applies the chain rule backwards through the network to compute gradients for every weight.
Mini-batch gradient descent (batch size 32–64) is the practical default — fast and stable.
Adam is the default optimizer for most problems. Start there, then try SGD with momentum for fine-tuning.
Vanishing gradients crippled Sigmoid-based deep networks. ReLU solved this by passing gradients through unchanged.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

38 minLesson 17 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min