Building Your First Deep Learning Model with PyTorch: Practical Guide
Learn to build deep learning models with PyTorch from scratch. Covers tensors, neural networks, training loops, and your first image classifier — hands-on for real beginners.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Building Your First Deep Learning Model with PyTorch: Practical Guide
Most tutorials start with theory. This one starts with working code.
By the end of this guide you'll have trained a neural network that classifies handwritten digits with over 97% accuracy. Along the way you'll understand why every line exists — not just copy-paste code that you can't debug when it breaks.
Deep learning is not magic. It is repeated multiplication, subtraction, and a few non-linear functions applied millions of times. PyTorch makes the plumbing invisible so you can focus on the structure.
What Makes PyTorch Different
Before 2016, building neural networks meant declaring a static computational graph — describing the entire network architecture as a data structure, then separately running data through it. Debugging was painful. Changing architecture mid-experiment was painful.
PyTorch uses dynamic computation graphs. Code executes immediately, like normal Python. You can print tensors mid-forward-pass, use Python if statements inside your network, and debug with standard tools.
import torch
# This runs immediately — no 'session', no 'compile' step
x = torch.tensor([1.0, 2.0, 3.0])
y = x * 2 + 1
print(y) # tensor([3., 5., 7.])
That immediacy is why PyTorch dominates research. And why it's easier to learn on.
Setup
# Install PyTorch (CPU version — works for this guide)
pip install torch torchvision matplotlib
# GPU version (if you have CUDA-compatible GPU)
# pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
Verify your installation:
import torch
print(torch.__version__) # e.g., 2.3.0
print(torch.cuda.is_available()) # True if GPU is ready
Tensors: The Foundation
Everything in PyTorch is a tensor — a multi-dimensional array that can live on CPU or GPU and tracks gradients.
import torch
# 0D tensor (scalar)
scalar = torch.tensor(42.0)
# 1D tensor (vector)
vector = torch.tensor([1.0, 2.0, 3.0])
# 2D tensor (matrix)
matrix = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
# 3D tensor (batch of images would be 4D: batch × channels × height × width)
cube = torch.zeros(3, 4, 5) # shape: (3, 4, 5)
print(matrix.shape) # torch.Size([2, 2])
print(matrix.dtype) # torch.float32
print(matrix.device) # cpu
Tensor Operations
a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])
# Element-wise
print(a + b)
print(a * b)
# Matrix multiplication (the workhorse of neural networks)
print(torch.matmul(a, b)) # or a @ b
# Broadcasting — smaller tensors expand to match larger ones
bias = torch.tensor([10.0, 20.0])
print(a + bias) # bias is added to each row automatically
Autograd: The Magic Behind Training
This is the part that actually matters:
# requires_grad=True tells PyTorch to track operations on this tensor
x = torch.tensor(3.0, requires_grad=True)
# Some computation
y = x ** 2 + 2 * x + 1 # y = (x+1)^2
# Compute gradients — dy/dx = 2x + 2 = 8 when x=3
y.backward()
print(x.grad) # tensor(8.) ← the gradient
When you train a neural network, PyTorch computes gradients of the loss with respect to every weight — automatically. You never write a derivative by hand.
Building a Neural Network
Neural networks in PyTorch are Python classes that inherit from nn.Module.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
# Define layers
self.fc1 = nn.Linear(input_size, hidden_size) # fully connected layer
self.fc2 = nn.Linear(hidden_size, hidden_size)
self.fc3 = nn.Linear(hidden_size, output_size)
self.dropout = nn.Dropout(0.2) # randomly zero 20% of activations during training
def forward(self, x):
# Define the forward pass — how data flows through the network
x = F.relu(self.fc1(x)) # ReLU activation: max(0, x)
x = self.dropout(x)
x = F.relu(self.fc2(x))
x = self.fc3(x) # no activation on final layer (softmax applied in loss)
return x
# Instantiate the network
model = SimpleNet(input_size=784, hidden_size=256, output_size=10)
print(model)
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}") # ~270,000 for these sizes
What Is a Layer Actually Doing?
nn.Linear(784, 256) contains a weight matrix of shape (256, 784) and a bias vector of shape (256). The forward pass computes output = input @ weight.T + bias. That is it. Everything else — ReLU, dropout, batch norm — is either a non-linearity or a regularization trick layered on top of matrix multiplication.
ReLU (max(0, x)) seems almost offensively simple. Why does it work? Because stacking linear layers without activation functions just gives you one big linear layer — no matter how many layers you add. Non-linearity lets the network learn curved decision boundaries.
The Training Loop: Where Learning Happens
This is the section most tutorials rush. Take your time here.
import torch.optim as optim
# Loss function: measures how wrong the model is
# CrossEntropyLoss combines log-softmax + negative log-likelihood
criterion = nn.CrossEntropyLoss()
# Optimizer: decides how to update weights based on gradients
# Adam is almost always a better starting point than plain SGD
optimizer = optim.Adam(model.parameters(), lr=0.001)
def train_epoch(model, loader, criterion, optimizer, device):
model.train() # sets layers like Dropout to training mode
total_loss = 0
correct = 0
for batch_idx, (data, targets) in enumerate(loader):
data, targets = data.to(device), targets.to(device)
# CRITICAL: clear gradients from previous batch
# If you forget this, gradients accumulate and training breaks
optimizer.zero_grad()
# Forward pass: compute predictions
outputs = model(data)
# Compute loss
loss = criterion(outputs, targets)
# Backward pass: compute gradients via autograd
loss.backward()
# Update weights using the gradients
optimizer.step()
total_loss += loss.item()
# Get predicted class (highest logit)
predicted = outputs.argmax(dim=1)
correct += predicted.eq(targets).sum().item()
avg_loss = total_loss / len(loader)
accuracy = 100.0 * correct / len(loader.dataset)
return avg_loss, accuracy
The four-line core of every training loop:
optimizer.zero_grad()— clear old gradientsoutputs = model(data)— forward passloss.backward()— compute gradientsoptimizer.step()— update weights
Everything else is bookkeeping.
Full Project: MNIST Digit Classifier
Let's put it all together. MNIST is 60,000 training images of handwritten digits (0–9), each 28×28 pixels.
Training Pipeline Flow
Complete Code
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# ── Config ────────────────────────────────────────────────
BATCH_SIZE = 64
EPOCHS = 10
LR = 0.001
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")
# ── Data ──────────────────────────────────────────────────
# Normalize to mean=0.1307, std=0.3081 (pre-computed for MNIST)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST(root="./data", train=True,
download=True, transform=transform)
test_dataset = datasets.MNIST(root="./data", train=False,
download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
shuffle=False, num_workers=2)
# ── Model ─────────────────────────────────────────────────
class MNISTNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 512)
self.fc2 = nn.Linear(512, 256)
self.fc3 = nn.Linear(256, 10)
self.dropout = nn.Dropout(0.2)
self.bn1 = nn.BatchNorm1d(512) # normalize activations per batch
self.bn2 = nn.BatchNorm1d(256)
def forward(self, x):
x = x.view(-1, 784) # flatten: (batch, 1, 28, 28) → (batch, 784)
x = F.relu(self.bn1(self.fc1(x)))
x = self.dropout(x)
x = F.relu(self.bn2(self.fc2(x)))
x = self.dropout(x)
return self.fc3(x)
model = MNISTNet().to(DEVICE)
# ── Training ──────────────────────────────────────────────
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)
# Learning rate scheduler: reduce LR by half every 3 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.5)
def evaluate(model, loader):
model.eval() # disables dropout, uses running stats for batchnorm
correct = 0
with torch.no_grad(): # don't track gradients during evaluation
for data, targets in loader:
data, targets = data.to(DEVICE), targets.to(DEVICE)
outputs = model(data)
correct += outputs.argmax(dim=1).eq(targets).sum().item()
return 100.0 * correct / len(loader.dataset)
for epoch in range(1, EPOCHS + 1):
model.train()
total_loss = 0
for data, targets in train_loader:
data, targets = data.to(DEVICE), targets.to(DEVICE)
optimizer.zero_grad()
loss = criterion(model(data), targets)
loss.backward()
optimizer.step()
total_loss += loss.item()
scheduler.step() # update learning rate
train_acc = evaluate(model, train_loader)
test_acc = evaluate(model, test_loader)
avg_loss = total_loss / len(train_loader)
print(f"Epoch {epoch:2d} | Loss: {avg_loss:.4f} | "
f"Train: {train_acc:.2f}% | Test: {test_acc:.2f}%")
# ── Save model ────────────────────────────────────────────
torch.save(model.state_dict(), "mnist_model.pt")
print("Model saved.")
Expected Results
| Epoch | Train Loss | Train Acc | Test Acc |
|---|---|---|---|
| 1 | 0.19 | 94.2% | 97.1% |
| 3 | 0.08 | 97.8% | 97.9% |
| 5 | 0.06 | 98.4% | 98.2% |
| 10 | 0.04 | 98.9% | 98.4% |
The test accuracy plateauing slightly below train accuracy is normal — that gap is called generalization error. If the gap is large (train 99%, test 85%), your model is overfitting — memorizing training data instead of learning patterns.
Common Bugs and Fixes
Loss is NaN from epoch 1: Usually means your learning rate is too high, or your input data has NaN or infinite values. Print data.isnan().any() before training.
Loss decreases then explodes: Gradient explosion. Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) before optimizer.step().
Model doesn't improve past random chance: Check that you're calling model.train() before the training loop and model.eval() before evaluation. Dropout in eval mode lets all neurons fire — not calling this is a subtle bug.
RuntimeError: Expected all tensors to be on the same device: Your data is on CPU but your model is on GPU (or vice versa). Make sure both data.to(DEVICE) and model.to(DEVICE) are called.
Loading a Saved Model
# Load architecture
model = MNISTNet()
# Load saved weights
model.load_state_dict(torch.load("mnist_model.pt", map_location="cpu"))
model.eval()
# Make a prediction on a single image
import torchvision
sample_img, label = test_dataset[0]
with torch.no_grad():
output = model(sample_img.unsqueeze(0)) # add batch dimension
prediction = output.argmax().item()
print(f"True label: {label}, Predicted: {prediction}")
Upgrading to Convolutional Networks
The MLP above treats each pixel independently. Convolutional networks (CNNs) learn spatial patterns — edges, curves, shapes — which is why they dominate image tasks.
class ConvNet(nn.Module):
def __init__(self):
super().__init__()
# Conv layers: (in_channels, out_channels, kernel_size)
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2) # halve spatial dimensions
self.fc1 = nn.Linear(64*7*7, 128)
self.fc2 = nn.Linear(128, 10)
self.dropout = nn.Dropout(0.25)
def forward(self, x):
# x shape: (batch, 1, 28, 28)
x = F.relu(self.conv1(x)) # → (batch, 32, 28, 28)
x = self.pool(x) # → (batch, 32, 14, 14)
x = F.relu(self.conv2(x)) # → (batch, 64, 14, 14)
x = self.pool(x) # → (batch, 64, 7, 7)
x = x.view(-1, 64 * 7 * 7) # flatten
x = F.relu(self.fc1(x))
x = self.dropout(x)
return self.fc2(x)
# This achieves ~99.3% on MNIST vs 98.4% for the MLP
The CNN beats the MLP by ~0.9% on MNIST — small here, but on complex image datasets the gap is enormous. CNNs for CIFAR-10 hit 95%+; fully connected networks struggle past 60%.
Next Steps
Once you're comfortable with this workflow, the natural progression is:
- Try CIFAR-10 — 10 classes of 32×32 color images, much harder than MNIST
- Add data augmentation — random crops, flips, color jitter via
transforms.RandomHorizontalFlip() - Use pre-trained models —
torchvision.models.resnet18(pretrained=True)gets you 89%+ on CIFAR-10 with minimal code (see our Transfer Learning guide) - Experiment with optimizers — try SGD with momentum vs Adam vs AdamW
- Learn
torch.utils.tensorboard— visualize training curves in real time
If the math here felt unfamiliar, the Deep Learning Basics quiz will show you which concepts to revisit. For the theoretical foundation of the architectures used here, the transformer architecture notes cover attention mechanisms, which are the next big concept after CNNs.
The Machine Learning course covers gradient descent, loss functions, and regularization in depth — all the concepts that make training loops go from "copy-paste" to "I understand what this is doing."
💬 DiscussionPowered by GitHub Discussions
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Convolutional Neural Networks (CNNs): How Image Recognition Works
CNNs learn to see by sharing weights across space. Here's the math behind convolution, pooling, and why ResNets can train 100+ layers without vanishing gradients.
Deep Learning Explained: Neural Networks from Zero to Understanding
Most tutorials teach you the API. This guide teaches you what's actually happening inside a neural network — forward pass, backprop, and why depth matters.
LSTM vs Transformer: The Evolution of Sequence Learning in AI
LSTMs ruled NLP for a decade. Transformers replaced them in three years. This is the technical story of why — and what each architecture actually computes.
Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes
Transfer learning lets you use ResNet, BERT, and ViT weights trained on millions of examples for your own dataset. Fine-tune in 30 minutes with real code and benchmark comparisons.