PyTorch vs TensorFlow — which should a beginner choose in 2026?

PyTorch. It's now the dominant framework in research (over 70% of NeurIPS papers use it) and its 'eager execution' model means code runs line by line like normal Python — no computational graphs to wrestle with. TensorFlow has caught up with Keras 3 and eager mode, but PyTorch's debugging experience is cleaner and the community momentum is stronger for new learners. If you're targeting production deployment at scale, TensorFlow/Keras is still worth knowing, but start with PyTorch.

How much GPU do I need to follow this guide?

None, technically. The examples in this guide run fine on CPU — training takes a few minutes instead of seconds. If you want to experiment with larger models or datasets, Google Colab gives you a free T4 GPU. For serious training (ResNets, transformers), a consumer GPU like an RTX 3060 or above helps, but it's not a requirement to learn the fundamentals covered here.

What is the difference between a tensor and a NumPy array?

Functionally they're very similar — both are multi-dimensional arrays. The critical difference is that PyTorch tensors can live on a GPU and they track computational history for automatic differentiation. When you do math on tensors in PyTorch, the framework silently records every operation. Later, calling .backward() uses that history to compute gradients automatically. NumPy arrays have no such machinery. You can convert between them freely with .numpy() and torch.from_numpy().

Why is my training loss not going down?

The most common reasons: learning rate is too high (loss bounces or explodes) or too low (loss barely moves), the data isn't normalized (features on wildly different scales confuse gradient descent), or there's a bug in the training loop where you forgot to call optimizer.zero_grad() before loss.backward(), causing gradient accumulation. Check those three things first. Print the loss every 10 batches rather than every epoch so you catch problems early.

AiTechWorlds

Deep Learning

Building Your First Deep Learning Model with PyTorch: Practical Guide

⚡ Quick Answer

Learn to build deep learning models with PyTorch from scratch. Covers tensors, neural networks, training loops, and your first image classifier — hands-on for real beginners.

Abdullah Al Arman Emon June 5, 2026 10 min read

#pytorch #deep-learning #neural-networks #python #machine-learning #ai #beginners

📚Part of the Deep Learning guide — explore all Deep Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Building Your First Deep Learning Model with PyTorch: Practical Guide

PyTorch is Python's dominant deep learning framework — it runs your code line by line like normal Python, instead of forcing you to build a separate computational graph first.

By the end of this guide, you'll have trained a neural network that classifies handwritten digits at over 97% accuracy. You'll understand why every line exists — not copy-paste code you can't debug when it breaks.

Deep learning is not magic. It's repeated multiplication, subtraction, and a few non-linear functions applied millions of times. PyTorch makes the plumbing invisible so you can focus on the structure.

What Makes PyTorch Different

Before 2016, building neural networks meant declaring a static computational graph — describing the entire network architecture as a data structure, then separately running data through it. Debugging was painful. Changing architecture mid-experiment was painful.

PyTorch uses dynamic computation graphs. Code executes immediately, like normal Python. You can print tensors mid-forward-pass, use Python if statements inside your network, and debug with standard tools.

import torch

# This runs immediately — no 'session', no 'compile' step
x = torch.tensor([1.0, 2.0, 3.0])
y = x * 2 + 1
print(y)  # tensor([3., 5., 7.])

That immediacy is why PyTorch dominates research. And why it's easier to learn on.

Setup

# Install PyTorch (CPU version — works for this guide)
pip install torch torchvision matplotlib

# GPU version (if you have CUDA-compatible GPU)
# pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Verify your installation:

import torch
print(torch.__version__)        # e.g., 2.3.0
print(torch.cuda.is_available()) # True if GPU is ready

Tensors: The Foundation

Everything in PyTorch is a tensor — a multi-dimensional array that can live on CPU or GPU and tracks gradients.

import torch

# 0D tensor (scalar)
scalar = torch.tensor(42.0)

# 1D tensor (vector)
vector = torch.tensor([1.0, 2.0, 3.0])

# 2D tensor (matrix)
matrix = torch.tensor([[1.0, 2.0], [3.0, 4.0]])

# 3D tensor (batch of images would be 4D: batch × channels × height × width)
cube = torch.zeros(3, 4, 5)  # shape: (3, 4, 5)

print(matrix.shape)   # torch.Size([2, 2])
print(matrix.dtype)   # torch.float32
print(matrix.device)  # cpu

Tensor Operations

a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

# Element-wise
print(a + b)
print(a * b)

# Matrix multiplication (the workhorse of neural networks)
print(torch.matmul(a, b))  # or a @ b

# Broadcasting — smaller tensors expand to match larger ones
bias = torch.tensor([10.0, 20.0])
print(a + bias)  # bias is added to each row automatically

Autograd: The Magic Behind Training

This is the part that actually matters:

# requires_grad=True tells PyTorch to track operations on this tensor
x = torch.tensor(3.0, requires_grad=True)

# Some computation
y = x ** 2 + 2 * x + 1  # y = (x+1)^2

# Compute gradients — dy/dx = 2x + 2 = 8 when x=3
y.backward()

print(x.grad)  # tensor(8.)  ← the gradient

When you train a neural network, PyTorch computes gradients of the loss with respect to every weight — automatically. You never write a derivative by hand.

Building a Neural Network

Neural networks in PyTorch are Python classes that inherit from nn.Module.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        # Define layers
        self.fc1 = nn.Linear(input_size, hidden_size)   # fully connected layer
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(0.2)  # randomly zero 20% of activations during training
    
    def forward(self, x):
        # Define the forward pass — how data flows through the network
        x = F.relu(self.fc1(x))    # ReLU activation: max(0, x)
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)            # no activation on final layer (softmax applied in loss)
        return x

# Instantiate the network
model = SimpleNet(input_size=784, hidden_size=256, output_size=10)
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")  # ~270,000 for these sizes

What Is a Layer Actually Doing?

nn.Linear(784, 256) contains a weight matrix of shape (256, 784) and a bias vector of shape (256). The forward pass computes output = input @ weight.T + bias. That is it. Everything else — ReLU, dropout, batch norm — is either a non-linearity or a regularization trick layered on top of matrix multiplication.

ReLU (max(0, x)) seems almost offensively simple. Why does it work? Because stacking linear layers without activation functions just gives you one big linear layer — no matter how many layers you add. Non-linearity lets the network learn curved decision boundaries.

The Training Loop: Where Learning Happens

This is the section most tutorials rush. Take your time here.

import torch.optim as optim

# Loss function: measures how wrong the model is
# CrossEntropyLoss combines log-softmax + negative log-likelihood
criterion = nn.CrossEntropyLoss()

# Optimizer: decides how to update weights based on gradients
# Adam is almost always a better starting point than plain SGD
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()  # sets layers like Dropout to training mode
    total_loss = 0
    correct = 0
    
    for batch_idx, (data, targets) in enumerate(loader):
        data, targets = data.to(device), targets.to(device)
        
        # CRITICAL: clear gradients from previous batch
        # If you forget this, gradients accumulate and training breaks
        optimizer.zero_grad()
        
        # Forward pass: compute predictions
        outputs = model(data)
        
        # Compute loss
        loss = criterion(outputs, targets)
        
        # Backward pass: compute gradients via autograd
        loss.backward()
        
        # Update weights using the gradients
        optimizer.step()
        
        total_loss += loss.item()
        
        # Get predicted class (highest logit)
        predicted = outputs.argmax(dim=1)
        correct += predicted.eq(targets).sum().item()
    
    avg_loss = total_loss / len(loader)
    accuracy = 100.0 * correct / len(loader.dataset)
    return avg_loss, accuracy

The four-line core of every training loop:

optimizer.zero_grad() — clear old gradients
outputs = model(data) — forward pass
loss.backward() — compute gradients
optimizer.step() — update weights

Everything else is bookkeeping.

Full Project: MNIST Digit Classifier

Let's put it all together. MNIST is 60,000 training images of handwritten digits (0–9), each 28×28 pixels.

Training Pipeline Flow

Complete Code

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ── Config ────────────────────────────────────────────────
BATCH_SIZE = 64
EPOCHS = 10
LR = 0.001
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

# ── Data ──────────────────────────────────────────────────
# Normalize to mean=0.1307, std=0.3081 (pre-computed for MNIST)
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST(root="./data", train=True,
                                download=True, transform=transform)
test_dataset  = datasets.MNIST(root="./data", train=False,
                                download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                          shuffle=True, num_workers=2)
test_loader  = DataLoader(test_dataset,  batch_size=BATCH_SIZE,
                          shuffle=False, num_workers=2)

# ── Model ─────────────────────────────────────────────────
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1     = nn.Linear(784, 512)
        self.fc2     = nn.Linear(512, 256)
        self.fc3     = nn.Linear(256, 10)
        self.dropout = nn.Dropout(0.2)
        self.bn1     = nn.BatchNorm1d(512)  # normalize activations per batch
        self.bn2     = nn.BatchNorm1d(256)
    
    def forward(self, x):
        x = x.view(-1, 784)           # flatten: (batch, 1, 28, 28) → (batch, 784)
        x = F.relu(self.bn1(self.fc1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.dropout(x)
        return self.fc3(x)

model = MNISTNet().to(DEVICE)

# ── Training ──────────────────────────────────────────────
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

# Learning rate scheduler: reduce LR by half every 3 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.5)

def evaluate(model, loader):
    model.eval()  # disables dropout, uses running stats for batchnorm
    correct = 0
    with torch.no_grad():  # don't track gradients during evaluation
        for data, targets in loader:
            data, targets = data.to(DEVICE), targets.to(DEVICE)
            outputs = model(data)
            correct += outputs.argmax(dim=1).eq(targets).sum().item()
    return 100.0 * correct / len(loader.dataset)

for epoch in range(1, EPOCHS + 1):
    model.train()
    total_loss = 0
    
    for data, targets in train_loader:
        data, targets = data.to(DEVICE), targets.to(DEVICE)
        optimizer.zero_grad()
        loss = criterion(model(data), targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    scheduler.step()  # update learning rate
    
    train_acc = evaluate(model, train_loader)
    test_acc  = evaluate(model, test_loader)
    avg_loss  = total_loss / len(train_loader)
    
    print(f"Epoch {epoch:2d} | Loss: {avg_loss:.4f} | "
          f"Train: {train_acc:.2f}% | Test: {test_acc:.2f}%")

# ── Save model ────────────────────────────────────────────
torch.save(model.state_dict(), "mnist_model.pt")
print("Model saved.")

Expected Results

Epoch	Train Loss	Train Acc	Test Acc
1	0.19	94.2%	97.1%
3	0.08	97.8%	97.9%
5	0.06	98.4%	98.2%
10	0.04	98.9%	98.4%

The test accuracy plateauing slightly below train accuracy is normal — that gap is called generalization error. If the gap is large (train 99%, test 85%), your model is overfitting — memorizing training data instead of learning patterns.

Common Bugs and Fixes

Loss is NaN from epoch 1: Usually means your learning rate is too high, or your input data has NaN or infinite values. Print data.isnan().any() before training.

Loss decreases then explodes: Gradient explosion. Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) before optimizer.step().

Model doesn't improve past random chance: Check that you're calling model.train() before the training loop and model.eval() before evaluation. Dropout in eval mode lets all neurons fire — not calling this is a subtle bug.

RuntimeError: Expected all tensors to be on the same device: Your data is on CPU but your model is on GPU (or vice versa). Make sure both data.to(DEVICE) and model.to(DEVICE) are called.

Loading a Saved Model

# Load architecture
model = MNISTNet()

# Load saved weights
model.load_state_dict(torch.load("mnist_model.pt", map_location="cpu"))
model.eval()

# Make a prediction on a single image
import torchvision
sample_img, label = test_dataset[0]
with torch.no_grad():
    output = model(sample_img.unsqueeze(0))  # add batch dimension
    prediction = output.argmax().item()

print(f"True label: {label}, Predicted: {prediction}")

Upgrading to Convolutional Networks

The MLP above treats each pixel independently. Convolutional networks (CNNs) learn spatial patterns — edges, curves, shapes — which is why they dominate image tasks.

class ConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Conv layers: (in_channels, out_channels, kernel_size)
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool  = nn.MaxPool2d(2)       # halve spatial dimensions
        self.fc1   = nn.Linear(64*7*7, 128)
        self.fc2   = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.25)
    
    def forward(self, x):
        # x shape: (batch, 1, 28, 28)
        x = F.relu(self.conv1(x))          # → (batch, 32, 28, 28)
        x = self.pool(x)                   # → (batch, 32, 14, 14)
        x = F.relu(self.conv2(x))          # → (batch, 64, 14, 14)
        x = self.pool(x)                   # → (batch, 64, 7, 7)
        x = x.view(-1, 64 * 7 * 7)        # flatten
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        return self.fc2(x)

# This achieves ~99.3% on MNIST vs 98.4% for the MLP

The CNN beats the MLP by ~0.9% on MNIST — small here, but on complex image datasets the gap is enormous. CNNs for CIFAR-10 hit 95%+; fully connected networks struggle past 60%.

Next Steps

Comfortable with the workflow above? The natural progression is:

Try CIFAR-10 — 10 classes of 32×32 color images, much harder than MNIST
Add data augmentation — random crops, flips, color jitter via transforms.RandomHorizontalFlip()
Use pre-trained models — torchvision.models.resnet18(pretrained=True) gets you 89%+ on CIFAR-10 with minimal code (see our Transfer Learning guide)
Experiment with optimizers — try SGD with momentum vs Adam vs AdamW
Learn torch.utils.tensorboard — visualize training curves in real time

If the math here felt unfamiliar, the Deep Learning Basics quiz will show you which concepts to revisit. For the theoretical foundation of the architectures used here, the transformer architecture notes cover attention mechanisms, which are the next big concept after CNNs.

The Machine Learning course covers gradient descent, loss functions, and regularization in depth — all the concepts that make training loops go from "copy-paste" to "I understand what this is doing."

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

You don't need to manually compute derivatives — PyTorch's autograd engine handles that automatically. But understanding what a gradient *is* conceptually (the slope of a function, the direction of steepest ascent) will make your debugging 10x faster. When your loss explodes or your model doesn't learn, knowing that the gradient tells each weight 'move this way to reduce error' helps you diagnose the issue. You can start without calculus, but reading a short intro on partial derivatives will pay off quickly.

Abdullah Al Arman Emon✓ Verified Writer

Software Testing Expert & Prompt Engineering

Ensures every release is bug-free through rigorous testing, and crafts high-precision prompts that power our AI-driven workflows. Abdullah Al Arman Emon leads QA and prompt engineering across AiTechWorlds.

💻 GitHub View Profile →

Not sure yet? Ask AI about this article

Get an instant, unbiased AI summary of “Building Your First Deep Learning Model with PyTorch: Practical Guide”.

Ask ChatGPT Ask Claude Ask Perplexity

Data visualization grid showing feature maps and filters in a convolutional neural network

AI Learning

Convolutional Neural Networks (CNNs): How Image Recognition Works

CNNs learn to see by sharing weights across space. Here's the math behind convolution, pooling, and why ResNets can train 100+ layers without vanishing gradients.

June 5, 2026 13 min read

Abstract neural network visualization with glowing nodes and connections representing deep learning

AI Learning

Deep Learning Explained: Neural Networks from Zero to Understanding

Most tutorials teach you the API. This guide teaches you what's actually happening inside a neural network — forward pass, backprop, and why depth matters.

June 5, 2026 12 min read

Abstract AI brain visualization representing sequence learning and attention mechanisms in neural networks

AI Learning

LSTM vs Transformer: The Evolution of Sequence Learning in AI

LSTMs ruled NLP for a decade. Transformers replaced them in three years. This is the technical story of why — and what each architecture actually computes.

June 5, 2026 12 min read

Neural network architecture diagram showing layers of a pre-trained deep learning model

AI Learning

Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes

Transfer learning lets you use ResNet, BERT, and ViT weights trained on millions of examples for your own dataset. Fine-tune in 30 minutes with real code and benchmark comparisons.

June 5, 2026 12 min read

Go deeper on this topic

CourseMachine Learning NotesActivation & Loss Functions Reference BookMachine Learning: A Visual Guide InterviewPython NotesLLM Core Concepts Explained CourseMachine Learning Fundamentals

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Deep Learning

Building Your First Deep Learning Model with PyTorch: Practical Guide

⚡ Quick Answer

Learn to build deep learning models with PyTorch from scratch. Covers tensors, neural networks, training loops, and your first image classifier — hands-on for real beginners.

Abdullah Al Arman Emon June 5, 2026 10 min read

#pytorch #deep-learning #neural-networks #python #machine-learning #ai #beginners

📚Part of the Deep Learning guide — explore all Deep Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Building Your First Deep Learning Model with PyTorch: Practical Guide

PyTorch is Python's dominant deep learning framework — it runs your code line by line like normal Python, instead of forcing you to build a separate computational graph first.

What Makes PyTorch Different

import torch

# This runs immediately — no 'session', no 'compile' step
x = torch.tensor([1.0, 2.0, 3.0])
y = x * 2 + 1
print(y)  # tensor([3., 5., 7.])

That immediacy is why PyTorch dominates research. And why it's easier to learn on.

Setup

# Install PyTorch (CPU version — works for this guide)
pip install torch torchvision matplotlib

# GPU version (if you have CUDA-compatible GPU)
# pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Verify your installation:

import torch
print(torch.__version__)        # e.g., 2.3.0
print(torch.cuda.is_available()) # True if GPU is ready

Tensors: The Foundation

Everything in PyTorch is a tensor — a multi-dimensional array that can live on CPU or GPU and tracks gradients.

import torch

# 0D tensor (scalar)
scalar = torch.tensor(42.0)

# 1D tensor (vector)
vector = torch.tensor([1.0, 2.0, 3.0])

# 2D tensor (matrix)
matrix = torch.tensor([[1.0, 2.0], [3.0, 4.0]])

# 3D tensor (batch of images would be 4D: batch × channels × height × width)
cube = torch.zeros(3, 4, 5)  # shape: (3, 4, 5)

print(matrix.shape)   # torch.Size([2, 2])
print(matrix.dtype)   # torch.float32
print(matrix.device)  # cpu

Tensor Operations

a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

# Element-wise
print(a + b)
print(a * b)

# Matrix multiplication (the workhorse of neural networks)
print(torch.matmul(a, b))  # or a @ b

# Broadcasting — smaller tensors expand to match larger ones
bias = torch.tensor([10.0, 20.0])
print(a + bias)  # bias is added to each row automatically

Autograd: The Magic Behind Training

This is the part that actually matters:

# requires_grad=True tells PyTorch to track operations on this tensor
x = torch.tensor(3.0, requires_grad=True)

# Some computation
y = x ** 2 + 2 * x + 1  # y = (x+1)^2

# Compute gradients — dy/dx = 2x + 2 = 8 when x=3
y.backward()

print(x.grad)  # tensor(8.)  ← the gradient

When you train a neural network, PyTorch computes gradients of the loss with respect to every weight — automatically. You never write a derivative by hand.

Building a Neural Network

Neural networks in PyTorch are Python classes that inherit from nn.Module.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        # Define layers
        self.fc1 = nn.Linear(input_size, hidden_size)   # fully connected layer
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(0.2)  # randomly zero 20% of activations during training
    
    def forward(self, x):
        # Define the forward pass — how data flows through the network
        x = F.relu(self.fc1(x))    # ReLU activation: max(0, x)
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)            # no activation on final layer (softmax applied in loss)
        return x

# Instantiate the network
model = SimpleNet(input_size=784, hidden_size=256, output_size=10)
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")  # ~270,000 for these sizes

What Is a Layer Actually Doing?

The Training Loop: Where Learning Happens

This is the section most tutorials rush. Take your time here.

import torch.optim as optim

# Loss function: measures how wrong the model is
# CrossEntropyLoss combines log-softmax + negative log-likelihood
criterion = nn.CrossEntropyLoss()

# Optimizer: decides how to update weights based on gradients
# Adam is almost always a better starting point than plain SGD
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train_epoch(model, loader, criterion, optimizer, device):
    model.train()  # sets layers like Dropout to training mode
    total_loss = 0
    correct = 0
    
    for batch_idx, (data, targets) in enumerate(loader):
        data, targets = data.to(device), targets.to(device)
        
        # CRITICAL: clear gradients from previous batch
        # If you forget this, gradients accumulate and training breaks
        optimizer.zero_grad()
        
        # Forward pass: compute predictions
        outputs = model(data)
        
        # Compute loss
        loss = criterion(outputs, targets)
        
        # Backward pass: compute gradients via autograd
        loss.backward()
        
        # Update weights using the gradients
        optimizer.step()
        
        total_loss += loss.item()
        
        # Get predicted class (highest logit)
        predicted = outputs.argmax(dim=1)
        correct += predicted.eq(targets).sum().item()
    
    avg_loss = total_loss / len(loader)
    accuracy = 100.0 * correct / len(loader.dataset)
    return avg_loss, accuracy

The four-line core of every training loop:

optimizer.zero_grad() — clear old gradients
outputs = model(data) — forward pass
loss.backward() — compute gradients
optimizer.step() — update weights

Everything else is bookkeeping.

Full Project: MNIST Digit Classifier

Let's put it all together. MNIST is 60,000 training images of handwritten digits (0–9), each 28×28 pixels.

Training Pipeline Flow

Complete Code

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ── Config ────────────────────────────────────────────────
BATCH_SIZE = 64
EPOCHS = 10
LR = 0.001
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

# ── Data ──────────────────────────────────────────────────
# Normalize to mean=0.1307, std=0.3081 (pre-computed for MNIST)
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST(root="./data", train=True,
                                download=True, transform=transform)
test_dataset  = datasets.MNIST(root="./data", train=False,
                                download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                          shuffle=True, num_workers=2)
test_loader  = DataLoader(test_dataset,  batch_size=BATCH_SIZE,
                          shuffle=False, num_workers=2)

# ── Model ─────────────────────────────────────────────────
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1     = nn.Linear(784, 512)
        self.fc2     = nn.Linear(512, 256)
        self.fc3     = nn.Linear(256, 10)
        self.dropout = nn.Dropout(0.2)
        self.bn1     = nn.BatchNorm1d(512)  # normalize activations per batch
        self.bn2     = nn.BatchNorm1d(256)
    
    def forward(self, x):
        x = x.view(-1, 784)           # flatten: (batch, 1, 28, 28) → (batch, 784)
        x = F.relu(self.bn1(self.fc1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.dropout(x)
        return self.fc3(x)

model = MNISTNet().to(DEVICE)

# ── Training ──────────────────────────────────────────────
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

# Learning rate scheduler: reduce LR by half every 3 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.5)

def evaluate(model, loader):
    model.eval()  # disables dropout, uses running stats for batchnorm
    correct = 0
    with torch.no_grad():  # don't track gradients during evaluation
        for data, targets in loader:
            data, targets = data.to(DEVICE), targets.to(DEVICE)
            outputs = model(data)
            correct += outputs.argmax(dim=1).eq(targets).sum().item()
    return 100.0 * correct / len(loader.dataset)

for epoch in range(1, EPOCHS + 1):
    model.train()
    total_loss = 0
    
    for data, targets in train_loader:
        data, targets = data.to(DEVICE), targets.to(DEVICE)
        optimizer.zero_grad()
        loss = criterion(model(data), targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    scheduler.step()  # update learning rate
    
    train_acc = evaluate(model, train_loader)
    test_acc  = evaluate(model, test_loader)
    avg_loss  = total_loss / len(train_loader)
    
    print(f"Epoch {epoch:2d} | Loss: {avg_loss:.4f} | "
          f"Train: {train_acc:.2f}% | Test: {test_acc:.2f}%")

# ── Save model ────────────────────────────────────────────
torch.save(model.state_dict(), "mnist_model.pt")
print("Model saved.")

Expected Results

Epoch	Train Loss	Train Acc	Test Acc
1	0.19	94.2%	97.1%
3	0.08	97.8%	97.9%
5	0.06	98.4%	98.2%
10	0.04	98.9%	98.4%

Common Bugs and Fixes

Loss is NaN from epoch 1: Usually means your learning rate is too high, or your input data has NaN or infinite values. Print data.isnan().any() before training.

Loss decreases then explodes: Gradient explosion. Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) before optimizer.step().

RuntimeError: Expected all tensors to be on the same device: Your data is on CPU but your model is on GPU (or vice versa). Make sure both data.to(DEVICE) and model.to(DEVICE) are called.

Loading a Saved Model

# Load architecture
model = MNISTNet()

# Load saved weights
model.load_state_dict(torch.load("mnist_model.pt", map_location="cpu"))
model.eval()

# Make a prediction on a single image
import torchvision
sample_img, label = test_dataset[0]
with torch.no_grad():
    output = model(sample_img.unsqueeze(0))  # add batch dimension
    prediction = output.argmax().item()

print(f"True label: {label}, Predicted: {prediction}")

Upgrading to Convolutional Networks

The MLP above treats each pixel independently. Convolutional networks (CNNs) learn spatial patterns — edges, curves, shapes — which is why they dominate image tasks.

class ConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Conv layers: (in_channels, out_channels, kernel_size)
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.pool  = nn.MaxPool2d(2)       # halve spatial dimensions
        self.fc1   = nn.Linear(64*7*7, 128)
        self.fc2   = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.25)
    
    def forward(self, x):
        # x shape: (batch, 1, 28, 28)
        x = F.relu(self.conv1(x))          # → (batch, 32, 28, 28)
        x = self.pool(x)                   # → (batch, 32, 14, 14)
        x = F.relu(self.conv2(x))          # → (batch, 64, 14, 14)
        x = self.pool(x)                   # → (batch, 64, 7, 7)
        x = x.view(-1, 64 * 7 * 7)        # flatten
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        return self.fc2(x)

# This achieves ~99.3% on MNIST vs 98.4% for the MLP

The CNN beats the MLP by ~0.9% on MNIST — small here, but on complex image datasets the gap is enormous. CNNs for CIFAR-10 hit 95%+; fully connected networks struggle past 60%.

Next Steps

Comfortable with the workflow above? The natural progression is:

Try CIFAR-10 — 10 classes of 32×32 color images, much harder than MNIST
Add data augmentation — random crops, flips, color jitter via transforms.RandomHorizontalFlip()
Use pre-trained models — torchvision.models.resnet18(pretrained=True) gets you 89%+ on CIFAR-10 with minimal code (see our Transfer Learning guide)
Experiment with optimizers — try SGD with momentum vs Adam vs AdamW
Learn torch.utils.tensorboard — visualize training curves in real time

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Abdullah Al Arman Emon✓ Verified Writer

Software Testing Expert & Prompt Engineering

💻 GitHub View Profile →

Not sure yet? Ask AI about this article

Get an instant, unbiased AI summary of “Building Your First Deep Learning Model with PyTorch: Practical Guide”.

Ask ChatGPT Ask Claude Ask Perplexity

AI Learning

Convolutional Neural Networks (CNNs): How Image Recognition Works

CNNs learn to see by sharing weights across space. Here's the math behind convolution, pooling, and why ResNets can train 100+ layers without vanishing gradients.

June 5, 2026 13 min read

AI Learning

Deep Learning Explained: Neural Networks from Zero to Understanding

Most tutorials teach you the API. This guide teaches you what's actually happening inside a neural network — forward pass, backprop, and why depth matters.

June 5, 2026 12 min read

AI Learning

LSTM vs Transformer: The Evolution of Sequence Learning in AI

LSTMs ruled NLP for a decade. Transformers replaced them in three years. This is the technical story of why — and what each architecture actually computes.

June 5, 2026 12 min read

AI Learning

Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes

Transfer learning lets you use ResNet, BERT, and ViT weights trained on millions of examples for your own dataset. Fine-tune in 30 minutes with real code and benchmark comparisons.

June 5, 2026 12 min read

Go deeper on this topic

CourseMachine Learning NotesActivation & Loss Functions Reference BookMachine Learning: A Visual Guide InterviewPython NotesLLM Core Concepts Explained CourseMachine Learning Fundamentals

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Building Your First Deep Learning Model with PyTorch: Practical Guide

Building Your First Deep Learning Model with PyTorch: Practical Guide

What Makes PyTorch Different

Setup

Tensors: The Foundation

Tensor Operations

Autograd: The Magic Behind Training

Building a Neural Network

What Is a Layer Actually Doing?

The Training Loop: Where Learning Happens

Full Project: MNIST Digit Classifier

Training Pipeline Flow

Complete Code

Expected Results

Common Bugs and Fixes

Loading a Saved Model

Upgrading to Convolutional Networks

Next Steps

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Not sure yet? Ask AI about this article

Related Articles

Convolutional Neural Networks (CNNs): How Image Recognition Works

Deep Learning Explained: Neural Networks from Zero to Understanding

LSTM vs Transformer: The Evolution of Sequence Learning in AI

Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes

Go deeper on this topic

Get Free AI Notes Daily

Building Your First Deep Learning Model with PyTorch: Practical Guide

Building Your First Deep Learning Model with PyTorch: Practical Guide

What Makes PyTorch Different

Setup

Tensors: The Foundation

Tensor Operations

Autograd: The Magic Behind Training

Building a Neural Network

What Is a Layer Actually Doing?

The Training Loop: Where Learning Happens

Full Project: MNIST Digit Classifier

Training Pipeline Flow

Complete Code

Expected Results

Common Bugs and Fixes

Loading a Saved Model

Upgrading to Convolutional Networks

Next Steps

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Not sure yet? Ask AI about this article

Related Articles

Convolutional Neural Networks (CNNs): How Image Recognition Works

Deep Learning Explained: Neural Networks from Zero to Understanding

LSTM vs Transformer: The Evolution of Sequence Learning in AI

Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes

Go deeper on this topic

Get Free AI Notes Daily