Deep Learning Explained: Neural Networks from Zero to Understanding
Most tutorials teach you the API. This guide teaches you what's actually happening inside a neural network — forward pass, backprop, and why depth matters.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Deep Learning Explained: Neural Networks from Zero to Understanding
Here is something most tutorials get wrong: they start with the API. You get model.fit() in the first five minutes, and you never actually understand why any of it works. Then your loss plateaus and you have no idea what to do.
This guide does the opposite. We start with the math — not to torture you, but because the math is what makes debugging possible. Once you understand what a neural network is actually computing, every training problem becomes diagnosable.
The Real Definition of a Neural Network
A neural network is a parameterized function. That's it. It takes an input vector, runs it through a series of matrix multiplications and nonlinearities, and produces an output. Training is the process of finding the parameters (weights) that make the output match what you want.
The "learning" happens through calculus — specifically, gradient descent. You measure how wrong the network is (the loss), compute how each weight contributed to that wrongness (the gradient), and nudge each weight slightly in the direction that reduces the error.
This was not a new idea when deep learning took off. Rosenblatt's perceptron in 1957 did this for a single layer. What changed in the 1980s was the discovery (or rediscovery) of how to efficiently compute gradients through multiple layers. That algorithm is backpropagation.
Forward Pass: What the Network Computes
Take a single neuron. It computes a weighted sum of its inputs, adds a bias term, and passes the result through an activation function:
output = activation(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)
In matrix notation, for an entire layer with m neurons receiving n inputs:
z = Wx + b # linear transformation, shape [m]
a = activation(z) # elementwise nonlinearity, shape [m]
Stack several of these layers and you have a "deep" network. Each layer transforms the representation from the previous layer into something more useful for the final prediction.
Here is a minimal NumPy implementation to make this concrete:
import numpy as np
def relu(z):
return np.maximum(0, z)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
class DenseLayer:
def __init__(self, in_features, out_features):
# Xavier initialization — prevents vanishing/exploding gradients
scale = np.sqrt(2.0 / in_features)
self.W = np.random.randn(out_features, in_features) * scale
self.b = np.zeros(out_features)
def forward(self, x):
self.x = x # cache for backward pass
self.z = self.W @ x + self.b
self.a = relu(self.z)
return self.a
Why Xavier initialization? If weights are too large, activations explode as you go deeper. Too small, and they vanish to zero. Xavier scales weights by sqrt(2/fan_in) to keep the variance of activations roughly constant across layers — a result derived by Glorot and Bengio in their 2010 paper.
Why Depth Matters: The Hierarchy of Representations
The "deep" in deep learning refers to depth of composition, not depth in any mystical sense. Each layer learns a different level of abstraction.
For an image recognition network, the first layers learn to detect edges and colors. Middle layers detect textures and parts (an eye, a wheel). Final layers detect objects and scenes. This hierarchy emerges automatically from training — nobody tells layer 3 to look for eyes.
This is empirically verified. In their famous 2014 visualization paper, Zeiler and Fergus showed that you can literally visualize what each layer of a CNN is looking for by finding images that maximize specific neurons' activations.
The depth advantage is computational. A two-layer network that computes a certain function might need exponentially many neurons. A deeper network can compute the same function with far fewer total parameters by reusing intermediate computations across layers. This is the same reason why circuit complexity favors depth over width.
Loss Functions: Defining "Wrong"
Before you can fix the network's mistakes, you need to quantify them. The loss function is that quantification.
For binary classification:
Binary Cross-Entropy = -(y log(ŷ) + (1-y) log(1-ŷ))
For multi-class classification (softmax output):
Categorical Cross-Entropy = -Σᵢ yᵢ log(ŷᵢ)
For regression:
Mean Squared Error = (1/n) Σ (y - ŷ)²
The choice of loss function encodes what you care about. MSE penalizes large errors quadratically — a prediction off by 10 hurts 100x more than one off by 1. For tasks where outliers should not dominate, Huber loss is a smooth blend of L1 and L2 that caps the quadratic penalty.
Backpropagation: The Chain Rule Applied
This is the part most tutorials skip or hand-wave. Let's not.
The goal: compute ∂L/∂W for every weight matrix W in the network. Gradient descent then updates: W ← W - lr * ∂L/∂W.
The chain rule says: if L = f(g(x)), then dL/dx = (dL/df) * (df/dg) * (dg/dx).
For a two-layer network with layers z₁ = W₁x, a₁ = relu(z₁), z₂ = W₂a₁, L = loss(z₂):
∂L/∂W₂ = ∂L/∂z₂ · ∂z₂/∂W₂ = δ₂ · a₁ᵀ
∂L/∂W₁ = ∂L/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂W₁
= (W₂ᵀδ₂) · relu'(z₁) · xᵀ
Where relu'(z) = 1 if z > 0 else 0. This is the gradient of ReLU.
The critical insight: gradients flow backwards through the same path the forward computation used, reusing cached activations. This is why self.x = x appears in the forward pass above — backprop needs it.
Here is the backward pass for the dense layer:
class DenseLayer:
# ... (forward pass as above)
def backward(self, dL_da):
# Gradient of ReLU: zero where pre-activation was negative
dL_dz = dL_da * (self.z > 0).astype(float)
# Gradient w.r.t. weights and bias
self.dL_dW = np.outer(dL_dz, self.x) # shape [out, in]
self.dL_db = dL_dz # shape [out]
# Gradient to pass to previous layer
dL_dx = self.W.T @ dL_dz # shape [in]
return dL_dx
Notice how dL_dx — the gradient passed backward — depends on the transpose of the weight matrix. Geometrically, the forward pass projects through W, and the backward pass projects back through Wᵀ. The symmetry is elegant.
Network Architecture: Visualizing the Flow
This is a standard fully-connected (dense) network for MNIST digit classification. Each arrow represents a weight matrix. The network has roughly 784×256 + 256×128 + 128×64 + 64×10 ≈ 250,000 parameters.
Training Loop in PyTorch
PyTorch handles the gradient computation automatically via autograd. Understanding what it's doing under the hood (as we just covered) lets you debug it when things go wrong.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# --- Model Definition ---
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Flatten(),
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.3), # regularization
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10)
# No softmax here — CrossEntropyLoss includes it
)
def forward(self, x):
return self.network(x)
# --- Data ---
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean/std
])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
# --- Training ---
model = MLP()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
def train_epoch(model, loader, optimizer, criterion):
model.train()
total_loss, correct = 0, 0
for batch_x, batch_y in loader:
# Forward pass
logits = model(batch_x)
loss = criterion(logits, batch_y)
# Backward pass — PyTorch computes all gradients
optimizer.zero_grad() # clear gradients from last step
loss.backward() # backprop
# Gradient clipping prevents exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step() # gradient descent step
total_loss += loss.item()
correct += (logits.argmax(dim=1) == batch_y).sum().item()
n = len(loader.dataset)
return total_loss / len(loader), correct / n
for epoch in range(10):
loss, acc = train_epoch(model, train_loader, optimizer, criterion)
print(f"Epoch {epoch+1:2d} | Loss: {loss:.4f} | Acc: {acc:.4f}")
The training loop diagram makes the flow explicit:
Activation Functions: Why Not Just Linear?
Without nonlinear activation functions, no matter how many layers you stack, the entire network collapses to a single matrix multiplication. The product of matrices is a matrix. Depth provides nothing.
| Activation | Formula | Dead Neurons? | Output Range | Use Case |
|---|---|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | Yes (saturation) | (0, 1) | Binary output |
| Tanh | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | Yes (saturation) | (-1, 1) | RNNs |
| ReLU | max(0, x) | Yes (x<0) | [0, ∞) | Hidden layers |
| Leaky ReLU | max(0.01x, x) | No | (-∞, ∞) | When dead ReLU is a problem |
| GELU | x·Φ(x) | No | ≈(-0.17, ∞) | Transformers |
| SiLU/Swish | x·sigmoid(x) | No | ≈(-0.28, ∞) | Modern architectures |
The "dead ReLU" problem is worth understanding. When a ReLU neuron's input is always negative, its output is always zero, its gradient is always zero, and it never learns. With a learning rate that's too high, many neurons can become permanently dead. Leaky ReLU and GELU avoid this.
Regularization: Preventing Memorization
A neural network with enough parameters can memorize the training set perfectly. This is not useful — you want the network to generalize to new data.
Dropout (Srivastava et al., 2014): randomly zero out a fraction of neurons during training. This prevents neurons from co-adapting — no neuron can rely on specific other neurons always being present. At inference, scale outputs by the keep probability.
Weight Decay (L2 Regularization): add λ||W||² to the loss. This penalizes large weights, encouraging the network to spread evidence across many features rather than relying on a few.
Batch Normalization (Ioffe & Szegedy, 2015): normalize activations within each mini-batch to have zero mean and unit variance, then apply learnable scale and shift. This smooths the loss landscape dramatically, allowing higher learning rates and reducing sensitivity to initialization.
# BatchNorm in practice — usually placed after Linear, before activation
nn.Sequential(
nn.Linear(256, 128),
nn.BatchNorm1d(128), # normalize across batch dimension
nn.ReLU(),
)
Common Benchmark Results
To ground your expectations, here are results on standard benchmarks for fully-connected networks vs. the deeper architectures we'll cover in subsequent articles:
| Model | Dataset | Test Accuracy | Parameters |
|---|---|---|---|
| Logistic Regression | MNIST | 92.6% | 7,850 |
| 2-Layer MLP (256h) | MNIST | 97.7% | 203,530 |
| 4-Layer MLP (512h) | MNIST | 98.3% | 669,706 |
| LeNet-5 (CNN) | MNIST | 99.2% | 60,000 |
| ResNet-18 (CNN) | CIFAR-10 | 93.0% | 11.7M |
| ViT-B/16 (Transformer) | ImageNet | 81.8% | 86M |
The jump from logistic regression to a 4-layer MLP — 5.7 percentage points — comes purely from learning hierarchical nonlinear features. The further jump to CNNs comes from incorporating the spatial structure of images, which we cover in the next article.
Practical Debugging Checklist
When your network is not learning, work through this list:
- Loss is not decreasing at all: check learning rate (try 10x higher/lower), check that
optimizer.zero_grad()is called, verify labels are correct - Loss decreases then plateaus: try learning rate scheduling (
torch.optim.lr_scheduler.CosineAnnealingLR), add more capacity, check for data quality issues - Training loss low but validation loss high: add dropout, weight decay, data augmentation, or collect more training data
- Loss is
nan: learning rate too high, check for log(0) in your loss, check for inf in inputs - Accuracy stuck at
1/num_classes: class imbalance, or the optimizer is not receiving gradients (checkrequires_grad)
These diagnostics come from understanding the math. Without it, you're guessing.
Where to Go From Here
Fully-connected networks are the foundation, but they are rarely the right architecture for specific data types. Images have spatial structure that MLPs ignore — convolutional networks exploit it explicitly. Sequences have temporal dependencies that MLPs cannot handle — recurrent networks and transformers capture them.
Check out the Deep Learning Quiz to test your understanding of these concepts, then continue to Convolutional Neural Networks Explained to see how architecture shapes learning.
For the mathematical foundations, the ML Algorithms Quiz covers the key concepts, and the Machine Learning course provides structured practice.
The ML category has additional articles on specific architectures and techniques, and if you want to understand how transformers fit in, transformer architecture notes are a good companion to this material.
💬 DiscussionPowered by GitHub Discussions
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Convolutional Neural Networks (CNNs): How Image Recognition Works
CNNs learn to see by sharing weights across space. Here's the math behind convolution, pooling, and why ResNets can train 100+ layers without vanishing gradients.
LSTM vs Transformer: The Evolution of Sequence Learning in AI
LSTMs ruled NLP for a decade. Transformers replaced them in three years. This is the technical story of why — and what each architecture actually computes.
Building Your First Deep Learning Model with PyTorch: Practical Guide
Learn to build deep learning models with PyTorch from scratch. Covers tensors, neural networks, training loops, and your first image classifier — hands-on for real beginners.
Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes
Transfer learning lets you use ResNet, BERT, and ViT weights trained on millions of examples for your own dataset. Fine-tune in 30 minutes with real code and benchmark comparisons.