Computer Vision Tutorial: Build an Image Classifier from Scratch
Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Computer Vision Tutorial: Build an Image Classifier from Scratch
The moment computer vision clicked for me wasn't a research paper — it was debugging a model that confidently identified my black lab as a cat.
Understanding why it got that wrong — the features it was using, what the filters were detecting, where the training distribution didn't match my data — required actually understanding how the network worked, not just using it.
This tutorial builds a complete image classification system in PyTorch, from loading images to a deployed model, with enough explanation of the underlying mechanics that you'll be able to debug and improve your own models. We'll start with a CNN from scratch, understand what's happening inside it, and then use transfer learning to get production-quality results.
How Computers "See" Images
Before code, the mental model. A color image is a 3D array of numbers:
Image: 224 × 224 × 3
↑ ↑ ↑
height width channels (R, G, B)
Each pixel: [red_value, green_value, blue_value]
values between 0-255
So a 224×224 color image = 224 × 224 × 3 = 150,528 numbers
A neural network sees this as a tensor of 150K numbers. The challenge: how do you build a network that recognizes "cat" from these 150K numbers, regardless of where in the image the cat is, what size it is, and how it's lit?
The answer: Convolutional Neural Networks.
How Convolutional Layers Work
A convolution slides a small filter across the image, computing a weighted sum at each position:
Input (5×5 image): 3×3 filter: Output (3×3):
1 2 3 4 5 1 0 -1 [sum1 sum2 sum3]
1 2 3 4 5 1 0 -1 [sum4 sum5 sum6]
1 2 3 4 5 1 0 -1 [sum7 sum8 sum9]
1 2 3 4 5
1 2 3 4 5
Position (0,0): sum1 = 1×1 + 2×0 + 3×-1 + 1×1 + 2×0 + 3×-1 + ...
This particular filter is a vertical edge detector (detects where image intensity changes left-to-right). A different filter detects horizontal edges. Another detects diagonals.
The magic: in a CNN, we don't choose the filters manually. We initialize them randomly and let gradient descent learn which filters are useful for the task. After training on millions of images, the filters that emerge detect meaningful visual features.
Building a CNN with PyTorch
Setup
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
import matplotlib.pyplot as plt
import numpy as np
# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
Data Loading and Augmentation
# Data transforms with augmentation for training
train_transforms = transforms.Compose([
transforms.RandomResizedCrop(224), # Random crop and resize
transforms.RandomHorizontalFlip(), # Flip with 50% probability
transforms.RandomRotation(15), # Rotate up to 15 degrees
transforms.ColorJitter(brightness=0.2, # Vary brightness
contrast=0.2,
saturation=0.2),
transforms.ToTensor(), # Convert to PyTorch tensor
transforms.Normalize( # Normalize to ImageNet stats
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
# Validation transforms (no augmentation — only resize and normalize)
val_transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Load dataset (directory structure: data/train/class1/, data/train/class2/, etc.)
train_dataset = datasets.ImageFolder('data/train', transform=train_transforms)
val_dataset = datasets.ImageFolder('data/val', transform=val_transforms)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)
print(f"Classes: {train_dataset.classes}")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
Custom CNN Architecture
class CustomCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Feature extraction layers
self.features = nn.Sequential(
# Block 1: 3 channels → 32 filters
nn.Conv2d(3, 32, kernel_size=3, padding=1), # (224, 224, 32)
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # (112, 112, 32)
# Block 2: 32 → 64 filters
nn.Conv2d(32, 64, kernel_size=3, padding=1), # (112, 112, 64)
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # (56, 56, 64)
# Block 3: 64 → 128 filters
nn.Conv2d(64, 128, kernel_size=3, padding=1), # (56, 56, 128)
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # (28, 28, 128)
# Block 4: 128 → 256 filters
nn.Conv2d(128, 256, kernel_size=3, padding=1), # (28, 28, 256)
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # (14, 14, 256)
)
# Classification head
self.classifier = nn.Sequential(
nn.AdaptiveAvgPool2d((1, 1)), # (1, 1, 256) — global average pooling
nn.Flatten(), # 256 features
nn.Linear(256, 128),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
# Initialize
model = CustomCNN(num_classes=len(train_dataset.classes)).to(device)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
Training Loop
def train_model(model, train_loader, val_loader, epochs=20, lr=0.001):
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
best_val_acc = 0.0
history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
for epoch in range(epochs):
# Training phase
model.train()
train_loss, train_correct, train_total = 0, 0, 0
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
train_correct += predicted.eq(labels).sum().item()
train_total += labels.size(0)
# Validation phase
model.eval()
val_loss, val_correct, val_total = 0, 0, 0
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
val_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
val_correct += predicted.eq(labels).sum().item()
val_total += labels.size(0)
# Record metrics
train_acc = train_correct / train_total
val_acc = val_correct / val_total
history['train_loss'].append(train_loss / train_total)
history['val_loss'].append(val_loss / val_total)
history['train_acc'].append(train_acc)
history['val_acc'].append(val_acc)
# Save best model
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_model.pth')
scheduler.step()
print(f"Epoch {epoch+1}/{epochs} — "
f"Train: {train_acc:.4f} | Val: {val_acc:.4f} | "
f"Best: {best_val_acc:.4f}")
return history
history = train_model(model, train_loader, val_loader, epochs=20)
Transfer Learning: The Right Way to Do Computer Vision
Training from scratch with a small dataset will give mediocre results. Transfer learning, starting with a model pretrained on ImageNet (1.2 million images, 1000 classes), dramatically improves results with limited data.
Using ResNet50 with Transfer Learning
from torchvision import models
def create_transfer_model(num_classes, freeze_backbone=True):
# Load pretrained ResNet50
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
if freeze_backbone:
# Freeze all layers (only train the final classifier)
for param in model.parameters():
param.requires_grad = False
# Replace the final layer with one for our number of classes
# Original: model.fc = Linear(2048, 1000)
# New: Linear(2048, num_classes)
model.fc = nn.Sequential(
nn.Linear(2048, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
return model.to(device)
# Phase 1: Train only the classifier head
model_tl = create_transfer_model(
num_classes=len(train_dataset.classes),
freeze_backbone=True
)
# Only optimize the new fc layer
optimizer = optim.Adam(model_tl.fc.parameters(), lr=0.001)
history_phase1 = train_model(model_tl, train_loader, val_loader, epochs=5)
# Phase 2: Fine-tune the whole network with a small learning rate
for param in model_tl.parameters():
param.requires_grad = True
optimizer = optim.Adam(model_tl.parameters(), lr=0.0001) # Lower LR for fine-tuning
history_phase2 = train_model(model_tl, train_loader, val_loader, epochs=10)
Why two-phase training works:
- Phase 1: The pretrained backbone extracts features; you only train the new classification head
- Phase 2: Fine-tune everything with a small learning rate to adapt pretrained features to your domain
- Without Phase 1: random gradients from the new classifier can destroy the carefully learned pretrained features
Modern Approach: EfficientNet or ViT
# EfficientNet-B4 (excellent accuracy/efficiency tradeoff)
model = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)
model.classifier[1] = nn.Linear(1792, num_classes)
# Or use Hugging Face for Vision Transformer
from transformers import ViTForImageClassification, ViTFeatureExtractor
model = ViTForImageClassification.from_pretrained(
'google/vit-base-patch16-224',
num_labels=num_classes,
ignore_mismatched_sizes=True
)
Visualizing What the CNN Learned
A crucial debugging tool: visualize the filters and activations:
def visualize_filters(model, layer_num=0):
"""Visualize the learned filters in the first conv layer"""
# Get the first conv layer
conv_layer = list(model.features.children())[layer_num]
filters = conv_layer.weight.data.cpu()
# Normalize for visualization
filters = (filters - filters.min()) / (filters.max() - filters.min())
fig, axes = plt.subplots(4, 8, figsize=(12, 6))
for idx in range(32):
ax = axes[idx // 8, idx % 8]
# Show first channel of each filter
ax.imshow(filters[idx, 0], cmap='gray')
ax.axis('off')
plt.suptitle('Learned Filters - Layer 1')
plt.show()
Making Predictions on New Images
from PIL import Image
def predict_image(model, image_path, class_names, transform):
model.eval()
# Load and transform image
image = Image.open(image_path).convert('RGB')
input_tensor = transform(image).unsqueeze(0).to(device)
with torch.no_grad():
output = model(input_tensor)
probabilities = torch.softmax(output, dim=1)[0]
# Get top predictions
top_probs, top_indices = probabilities.topk(5)
print("Predictions:")
for i, (prob, idx) in enumerate(zip(top_probs, top_indices)):
print(f" {i+1}. {class_names[idx.item()]:20}: {prob.item():.3%}")
predict_image(
model_tl,
'test_image.jpg',
train_dataset.classes,
val_transforms
)
Common Issues and Solutions
| Problem | Symptom | Solution |
|---|---|---|
| Overfitting | Train acc >> Val acc | More dropout, data augmentation, fewer layers |
| Underfitting | Both accuracies low | More layers/filters, longer training, unfreeze backbone |
| Slow training | Progress too slow | Use GPU, increase batch size |
| Class imbalance | High accuracy on majority class | class_weight in loss, oversampling |
| Bad input normalization | Training doesn't converge | Use ImageNet normalization stats for pretrained models |
Conclusion
Computer vision with deep learning has become remarkably accessible. The same techniques that power autonomous vehicles and medical diagnostics can be applied to custom classification tasks with a few hundred labeled images and a few hours of training.
Transfer learning is the right starting point for almost every new computer vision project in 2025 — it delivers production-quality results with limited data and compute. Build from scratch only when your domain is fundamentally different from natural images.
For the deep learning foundations, see our neural networks explained guide. For the PyTorch deep dive, our TensorFlow vs PyTorch comparison covers when to choose each framework.
Frequently Asked Questions
What is computer vision and what can it do?
Computer vision enables machines to interpret visual information. Current capabilities: image classification, object detection and localization, image segmentation, facial recognition, medical image analysis. Best models match human performance on many benchmarks. Most impactful production uses: medical imaging, manufacturing inspection, autonomous vehicles.
What is a CNN and how does it work?
A neural network for image data that uses convolutional layers — small learned filters applied across the entire image. Early layers detect edges; middle layers detect shapes/textures; final layers detect objects. Weight sharing (same filter across the whole image) makes CNNs efficient and position-invariant.
Should I build a CNN from scratch or use transfer learning?
Use transfer learning for almost every new project. With 500-1000 labeled examples, transfer learning from ResNet or EfficientNet beats training from scratch with 10x the data. Build from scratch only for very unusual image formats or architecture research.
What are the best pretrained models for computer vision in 2025?
Classification: EfficientNet-V2 (best accuracy/compute), ResNet-50 (reliable workhorse). Detection: YOLOv8 (real-time), DETR (best accuracy). Segmentation: SAM (Meta). Zero-shot: CLIP (OpenAI). All available through PyTorch Hub or Hugging Face.
How much labeled data do I need for image classification?
With transfer learning: 500-1000 images per class often achieves 85-90%+ accuracy. 1000-5000 per class is comfortable for production. Data augmentation effectively multiplies your dataset 5-10x. Without transfer learning: 10K-100K+ per class.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Best Machine Learning Courses in 2025: Ranked After Taking Them All
The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.
Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs
Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.
Kaggle Competition Guide: How to Rank in the Top 10% Every Time
Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.
Machine Learning for Beginners: A Honest Guide to Getting Started
Machine learning for beginners explained honestly — what ML actually is, which skills you need first, the fastest learning path, and what to build to prove you can do it.