Computer Vision Tutorial: Build an Image Classifier from Scratch
Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Computer Vision Tutorial: Build an Image Classifier from Scratch
The moment computer vision clicked for me wasn't a research paper — it was debugging a model that confidently identified my black lab as a cat.
Understanding why it got that wrong — the features it was using, what the filters were detecting, where the training distribution didn't match my data — required actually understanding how the network worked, not just using it.
This tutorial builds a complete image classification system in PyTorch, from loading images to a deployed model, with enough explanation of the underlying mechanics that you'll be able to debug and improve your own models. We'll start with a CNN from scratch, understand what's happening inside it, and then use transfer learning to get production-quality results.
How Computers "See" Images
Before code, the mental model. A color image is a 3D array of numbers:
Image: 224 × 224 × 3
↑ ↑ ↑
height width channels (R, G, B)
Each pixel: [red_value, green_value, blue_value]
values between 0-255
So a 224×224 color image = 224 × 224 × 3 = 150,528 numbers
A neural network sees this as a tensor of 150K numbers. The challenge: how do you build a network that recognizes "cat" from these 150K numbers, regardless of where in the image the cat is, what size it is, and how it's lit?
The answer: Convolutional Neural Networks.
How Convolutional Layers Work
A convolution slides a small filter across the image, computing a weighted sum at each position:
Input (5×5 image): 3×3 filter: Output (3×3):
1 2 3 4 5 1 0 -1 [sum1 sum2 sum3]
1 2 3 4 5 1 0 -1 [sum4 sum5 sum6]
1 2 3 4 5 1 0 -1 [sum7 sum8 sum9]
1 2 3 4 5
1 2 3 4 5
Position (0,0): sum1 = 1×1 + 2×0 + 3×-1 + 1×1 + 2×0 + 3×-1 + ...
This particular filter is a vertical edge detector (detects where image intensity changes left-to-right). A different filter detects horizontal edges. Another detects diagonals.
The magic: in a CNN, we don't choose the filters manually. We initialize them randomly and let gradient descent learn which filters are useful for the task. After training on millions of images, the filters that emerge detect meaningful visual features.
Building a CNN with PyTorch
Setup
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
import matplotlib.pyplot as plt
import numpy as np
# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
Data Loading and Augmentation
# Data transforms with augmentation for training
train_transforms = transforms.Compose([
transforms.RandomResizedCrop(224), # Random crop and resize
transforms.RandomHorizontalFlip(), # Flip with 50% probability
transforms.RandomRotation(15), # Rotate up to 15 degrees
transforms.ColorJitter(brightness=0.2, # Vary brightness
contrast=0.2,
saturation=0.2),
transforms.ToTensor(), # Convert to PyTorch tensor
transforms.Normalize( # Normalize to ImageNet stats
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
# Validation transforms (no augmentation — only resize and normalize)
val_transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Load dataset (directory structure: data/train/class1/, data/train/class2/, etc.)
train_dataset = datasets.ImageFolder('data/train', transform=train_transforms)
val_dataset = datasets.ImageFolder('data/val', transform=val_transforms)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)
print(f"Classes: {train_dataset.classes}")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
Custom CNN Architecture
class CustomCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Feature extraction layers
self.features = nn.Sequential(
# Block 1: 3 channels → 32 filters
nn.Conv2d(3, 32, kernel_size=3, padding=1), # (224, 224, 32)
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # (112, 112, 32)
# Block 2: 32 → 64 filters
nn.Conv2d(32, 64, kernel_size=3, padding=1), # (112, 112, 64)
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # (56, 56, 64)
# Block 3: 64 → 128 filters
nn.Conv2d(64, 128, kernel_size=3, padding=1), # (56, 56, 128)
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # (28, 28, 128)
# Block 4: 128 → 256 filters
nn.Conv2d(128, 256, kernel_size=3, padding=1), # (28, 28, 256)
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # (14, 14, 256)
)
# Classification head
self.classifier = nn.Sequential(
nn.AdaptiveAvgPool2d((1, 1)), # (1, 1, 256) — global average pooling
nn.Flatten(), # 256 features
nn.Linear(256, 128),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
# Initialize
model = CustomCNN(num_classes=len(train_dataset.classes)).to(device)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
Training Loop
def train_model(model, train_loader, val_loader, epochs=20, lr=0.001):
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
best_val_acc = 0.0
history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
for epoch in range(epochs):
# Training phase
model.train()
train_loss, train_correct, train_total = 0, 0, 0
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
train_correct += predicted.eq(labels).sum().item()
train_total += labels.size(0)
# Validation phase
model.eval()
val_loss, val_correct, val_total = 0, 0, 0
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
val_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
val_correct += predicted.eq(labels).sum().item()
val_total += labels.size(0)
# Record metrics
train_acc = train_correct / train_total
val_acc = val_correct / val_total
history['train_loss'].append(train_loss / train_total)
history['val_loss'].append(val_loss / val_total)
history['train_acc'].append(train_acc)
history['val_acc'].append(val_acc)
# Save best model
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_model.pth')
scheduler.step()
print(f"Epoch {epoch+1}/{epochs} — "
f"Train: {train_acc:.4f} | Val: {val_acc:.4f} | "
f"Best: {best_val_acc:.4f}")
return history
history = train_model(model, train_loader, val_loader, epochs=20)
Transfer Learning: The Right Way to Do Computer Vision
Training from scratch with a small dataset will give mediocre results. Transfer learning, starting with a model pretrained on ImageNet (1.2 million images, 1000 classes), dramatically improves results with limited data.
Using ResNet50 with Transfer Learning
from torchvision import models
def create_transfer_model(num_classes, freeze_backbone=True):
# Load pretrained ResNet50
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
if freeze_backbone:
# Freeze all layers (only train the final classifier)
for param in model.parameters():
param.requires_grad = False
# Replace the final layer with one for our number of classes
# Original: model.fc = Linear(2048, 1000)
# New: Linear(2048, num_classes)
model.fc = nn.Sequential(
nn.Linear(2048, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
return model.to(device)
# Phase 1: Train only the classifier head
model_tl = create_transfer_model(
num_classes=len(train_dataset.classes),
freeze_backbone=True
)
# Only optimize the new fc layer
optimizer = optim.Adam(model_tl.fc.parameters(), lr=0.001)
history_phase1 = train_model(model_tl, train_loader, val_loader, epochs=5)
# Phase 2: Fine-tune the whole network with a small learning rate
for param in model_tl.parameters():
param.requires_grad = True
optimizer = optim.Adam(model_tl.parameters(), lr=0.0001) # Lower LR for fine-tuning
history_phase2 = train_model(model_tl, train_loader, val_loader, epochs=10)
Why two-phase training works:
- Phase 1: The pretrained backbone extracts features; you only train the new classification head
- Phase 2: Fine-tune everything with a small learning rate to adapt pretrained features to your domain
- Without Phase 1: random gradients from the new classifier can destroy the carefully learned pretrained features
Modern Approach: EfficientNet or ViT
# EfficientNet-B4 (excellent accuracy/efficiency tradeoff)
model = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)
model.classifier[1] = nn.Linear(1792, num_classes)
# Or use Hugging Face for Vision Transformer
from transformers import ViTForImageClassification, ViTFeatureExtractor
model = ViTForImageClassification.from_pretrained(
'google/vit-base-patch16-224',
num_labels=num_classes,
ignore_mismatched_sizes=True
)
Visualizing What the CNN Learned
A crucial debugging tool: visualize the filters and activations:
def visualize_filters(model, layer_num=0):
"""Visualize the learned filters in the first conv layer"""
# Get the first conv layer
conv_layer = list(model.features.children())[layer_num]
filters = conv_layer.weight.data.cpu()
# Normalize for visualization
filters = (filters - filters.min()) / (filters.max() - filters.min())
fig, axes = plt.subplots(4, 8, figsize=(12, 6))
for idx in range(32):
ax = axes[idx // 8, idx % 8]
# Show first channel of each filter
ax.imshow(filters[idx, 0], cmap='gray')
ax.axis('off')
plt.suptitle('Learned Filters - Layer 1')
plt.show()
Making Predictions on New Images
from PIL import Image
def predict_image(model, image_path, class_names, transform):
model.eval()
# Load and transform image
image = Image.open(image_path).convert('RGB')
input_tensor = transform(image).unsqueeze(0).to(device)
with torch.no_grad():
output = model(input_tensor)
probabilities = torch.softmax(output, dim=1)[0]
# Get top predictions
top_probs, top_indices = probabilities.topk(5)
print("Predictions:")
for i, (prob, idx) in enumerate(zip(top_probs, top_indices)):
print(f" {i+1}. {class_names[idx.item()]:20}: {prob.item():.3%}")
predict_image(
model_tl,
'test_image.jpg',
train_dataset.classes,
val_transforms
)
Common Issues and Solutions
| Problem | Symptom | Solution |
|---|---|---|
| Overfitting | Train acc >> Val acc | More dropout, data augmentation, fewer layers |
| Underfitting | Both accuracies low | More layers/filters, longer training, unfreeze backbone |
| Slow training | Progress too slow | Use GPU, increase batch size |
| Class imbalance | High accuracy on majority class | class_weight in loss, oversampling |
| Bad input normalization | Training doesn't converge | Use ImageNet normalization stats for pretrained models |
Conclusion
Computer vision with deep learning has become remarkably accessible. The same techniques that power autonomous vehicles and medical diagnostics can be applied to custom classification tasks with a few hundred labeled images and a few hours of training.
Transfer learning is the right starting point for almost every new computer vision project in 2025 — it delivers production-quality results with limited data and compute. Build from scratch only when your domain is fundamentally different from natural images.
For the deep learning foundations, see our neural networks explained guide. For the PyTorch deep dive, our TensorFlow vs PyTorch comparison covers when to choose each framework.
Further Reading
- Neural Networks Explained: From Perceptron to Deep Learning
- NLP for Beginners: How Computers Learn to Understand Language
- Scikit-Learn Tutorial: Build Your First ML Model in 30 Minutes
- Overfitting in Machine Learning: How to Detect and Fix It
- Kaggle Competition Guide: How to Rank in the Top 10% Every Time
- Jupyter Notebook Guide: The Data Scientist's Favorite Tool
- Fine-Tuning LLMs: When to Do It and How to Do It Right
- FastAPI Tutorial: Building Your First REST API in 30 Minutes
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Best Machine Learning Courses in 2025: Ranked After Taking Them All
The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.
Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs
Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.
Kaggle Competition Guide: How to Rank in the Top 10% Every Time
Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.
Machine Learning for Beginners: A Honest Guide to Getting Started
Machine learning for beginners explained honestly — what ML actually is, which skills you need first, the fastest learning path, and what to build to prove you can do it.