Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

Computer Vision Tutorial: Build an Image Classifier from Scratch

Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.

A
AiTechWorlds Team
May 27, 2026 9 min read
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Computer Vision Tutorial: Build an Image Classifier from Scratch

The moment computer vision clicked for me wasn't a research paper — it was debugging a model that confidently identified my black lab as a cat.

Understanding why it got that wrong — the features it was using, what the filters were detecting, where the training distribution didn't match my data — required actually understanding how the network worked, not just using it.

This tutorial builds a complete image classification system in PyTorch, from loading images to a deployed model, with enough explanation of the underlying mechanics that you'll be able to debug and improve your own models. We'll start with a CNN from scratch, understand what's happening inside it, and then use transfer learning to get production-quality results.


How Computers "See" Images

Before code, the mental model. A color image is a 3D array of numbers:

Image: 224 × 224 × 3
         ↑       ↑   ↑
       height  width channels (R, G, B)

Each pixel: [red_value, green_value, blue_value]
            values between 0-255

So a 224×224 color image = 224 × 224 × 3 = 150,528 numbers

A neural network sees this as a tensor of 150K numbers. The challenge: how do you build a network that recognizes "cat" from these 150K numbers, regardless of where in the image the cat is, what size it is, and how it's lit?

The answer: Convolutional Neural Networks.


How Convolutional Layers Work

A convolution slides a small filter across the image, computing a weighted sum at each position:

Input (5×5 image):    3×3 filter:      Output (3×3):
1 2 3 4 5            1 0 -1          [sum1  sum2  sum3]
1 2 3 4 5            1 0 -1          [sum4  sum5  sum6]
1 2 3 4 5            1 0 -1          [sum7  sum8  sum9]
1 2 3 4 5
1 2 3 4 5

Position (0,0): sum1 = 1×1 + 2×0 + 3×-1 + 1×1 + 2×0 + 3×-1 + ... 

This particular filter is a vertical edge detector (detects where image intensity changes left-to-right). A different filter detects horizontal edges. Another detects diagonals.

The magic: in a CNN, we don't choose the filters manually. We initialize them randomly and let gradient descent learn which filters are useful for the task. After training on millions of images, the filters that emerge detect meaningful visual features.


Building a CNN with PyTorch

Setup

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
import matplotlib.pyplot as plt
import numpy as np

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Data Loading and Augmentation

# Data transforms with augmentation for training
train_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224),        # Random crop and resize
    transforms.RandomHorizontalFlip(),         # Flip with 50% probability
    transforms.RandomRotation(15),             # Rotate up to 15 degrees
    transforms.ColorJitter(brightness=0.2,    # Vary brightness
                          contrast=0.2,
                          saturation=0.2),
    transforms.ToTensor(),                     # Convert to PyTorch tensor
    transforms.Normalize(                      # Normalize to ImageNet stats
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# Validation transforms (no augmentation — only resize and normalize)
val_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# Load dataset (directory structure: data/train/class1/, data/train/class2/, etc.)
train_dataset = datasets.ImageFolder('data/train', transform=train_transforms)
val_dataset = datasets.ImageFolder('data/val', transform=val_transforms)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)

print(f"Classes: {train_dataset.classes}")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Custom CNN Architecture

class CustomCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        
        # Feature extraction layers
        self.features = nn.Sequential(
            # Block 1: 3 channels → 32 filters
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (224, 224, 32)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (112, 112, 32)
            
            # Block 2: 32 → 64 filters
            nn.Conv2d(32, 64, kernel_size=3, padding=1),   # (112, 112, 64)
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (56, 56, 64)
            
            # Block 3: 64 → 128 filters
            nn.Conv2d(64, 128, kernel_size=3, padding=1),  # (56, 56, 128)
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (28, 28, 128)
            
            # Block 4: 128 → 256 filters
            nn.Conv2d(128, 256, kernel_size=3, padding=1), # (28, 28, 256)
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (14, 14, 256)
        )
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)),  # (1, 1, 256) — global average pooling
            nn.Flatten(),                   # 256 features
            nn.Linear(256, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Initialize
model = CustomCNN(num_classes=len(train_dataset.classes)).to(device)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Training Loop

def train_model(model, train_loader, val_loader, epochs=20, lr=0.001):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
    
    best_val_acc = 0.0
    history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
    
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss, train_correct, train_total = 0, 0, 0
        
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            train_correct += predicted.eq(labels).sum().item()
            train_total += labels.size(0)
        
        # Validation phase
        model.eval()
        val_loss, val_correct, val_total = 0, 0, 0
        
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                
                val_loss += loss.item() * inputs.size(0)
                _, predicted = outputs.max(1)
                val_correct += predicted.eq(labels).sum().item()
                val_total += labels.size(0)
        
        # Record metrics
        train_acc = train_correct / train_total
        val_acc = val_correct / val_total
        
        history['train_loss'].append(train_loss / train_total)
        history['val_loss'].append(val_loss / val_total)
        history['train_acc'].append(train_acc)
        history['val_acc'].append(val_acc)
        
        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
        
        scheduler.step()
        
        print(f"Epoch {epoch+1}/{epochs} — "
              f"Train: {train_acc:.4f} | Val: {val_acc:.4f} | "
              f"Best: {best_val_acc:.4f}")
    
    return history

history = train_model(model, train_loader, val_loader, epochs=20)

Transfer Learning: The Right Way to Do Computer Vision

Training from scratch with a small dataset will give mediocre results. Transfer learning, starting with a model pretrained on ImageNet (1.2 million images, 1000 classes), dramatically improves results with limited data.

Using ResNet50 with Transfer Learning

from torchvision import models

def create_transfer_model(num_classes, freeze_backbone=True):
    # Load pretrained ResNet50
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
    
    if freeze_backbone:
        # Freeze all layers (only train the final classifier)
        for param in model.parameters():
            param.requires_grad = False
    
    # Replace the final layer with one for our number of classes
    # Original: model.fc = Linear(2048, 1000)
    # New: Linear(2048, num_classes)
    model.fc = nn.Sequential(
        nn.Linear(2048, 512),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(512, num_classes)
    )
    
    return model.to(device)

# Phase 1: Train only the classifier head
model_tl = create_transfer_model(
    num_classes=len(train_dataset.classes), 
    freeze_backbone=True
)

# Only optimize the new fc layer
optimizer = optim.Adam(model_tl.fc.parameters(), lr=0.001)
history_phase1 = train_model(model_tl, train_loader, val_loader, epochs=5)

# Phase 2: Fine-tune the whole network with a small learning rate
for param in model_tl.parameters():
    param.requires_grad = True

optimizer = optim.Adam(model_tl.parameters(), lr=0.0001)  # Lower LR for fine-tuning
history_phase2 = train_model(model_tl, train_loader, val_loader, epochs=10)

Why two-phase training works:

  • Phase 1: The pretrained backbone extracts features; you only train the new classification head
  • Phase 2: Fine-tune everything with a small learning rate to adapt pretrained features to your domain
  • Without Phase 1: random gradients from the new classifier can destroy the carefully learned pretrained features

Modern Approach: EfficientNet or ViT

# EfficientNet-B4 (excellent accuracy/efficiency tradeoff)
model = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)
model.classifier[1] = nn.Linear(1792, num_classes)

# Or use Hugging Face for Vision Transformer
from transformers import ViTForImageClassification, ViTFeatureExtractor

model = ViTForImageClassification.from_pretrained(
    'google/vit-base-patch16-224',
    num_labels=num_classes,
    ignore_mismatched_sizes=True
)

Visualizing What the CNN Learned

A crucial debugging tool: visualize the filters and activations:

def visualize_filters(model, layer_num=0):
    """Visualize the learned filters in the first conv layer"""
    # Get the first conv layer
    conv_layer = list(model.features.children())[layer_num]
    filters = conv_layer.weight.data.cpu()
    
    # Normalize for visualization
    filters = (filters - filters.min()) / (filters.max() - filters.min())
    
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for idx in range(32):
        ax = axes[idx // 8, idx % 8]
        # Show first channel of each filter
        ax.imshow(filters[idx, 0], cmap='gray')
        ax.axis('off')
    
    plt.suptitle('Learned Filters - Layer 1')
    plt.show()

Making Predictions on New Images

from PIL import Image

def predict_image(model, image_path, class_names, transform):
    model.eval()
    
    # Load and transform image
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0).to(device)
    
    with torch.no_grad():
        output = model(input_tensor)
        probabilities = torch.softmax(output, dim=1)[0]
    
    # Get top predictions
    top_probs, top_indices = probabilities.topk(5)
    
    print("Predictions:")
    for i, (prob, idx) in enumerate(zip(top_probs, top_indices)):
        print(f"  {i+1}. {class_names[idx.item()]:20}: {prob.item():.3%}")

predict_image(
    model_tl, 
    'test_image.jpg', 
    train_dataset.classes,
    val_transforms
)

Common Issues and Solutions

ProblemSymptomSolution
OverfittingTrain acc >> Val accMore dropout, data augmentation, fewer layers
UnderfittingBoth accuracies lowMore layers/filters, longer training, unfreeze backbone
Slow trainingProgress too slowUse GPU, increase batch size
Class imbalanceHigh accuracy on majority classclass_weight in loss, oversampling
Bad input normalizationTraining doesn't convergeUse ImageNet normalization stats for pretrained models

Conclusion

Computer vision with deep learning has become remarkably accessible. The same techniques that power autonomous vehicles and medical diagnostics can be applied to custom classification tasks with a few hundred labeled images and a few hours of training.

Transfer learning is the right starting point for almost every new computer vision project in 2025 — it delivers production-quality results with limited data and compute. Build from scratch only when your domain is fundamentally different from natural images.

For the deep learning foundations, see our neural networks explained guide. For the PyTorch deep dive, our TensorFlow vs PyTorch comparison covers when to choose each framework.


Frequently Asked Questions

What is computer vision and what can it do?

Computer vision enables machines to interpret visual information. Current capabilities: image classification, object detection and localization, image segmentation, facial recognition, medical image analysis. Best models match human performance on many benchmarks. Most impactful production uses: medical imaging, manufacturing inspection, autonomous vehicles.

What is a CNN and how does it work?

A neural network for image data that uses convolutional layers — small learned filters applied across the entire image. Early layers detect edges; middle layers detect shapes/textures; final layers detect objects. Weight sharing (same filter across the whole image) makes CNNs efficient and position-invariant.

Should I build a CNN from scratch or use transfer learning?

Use transfer learning for almost every new project. With 500-1000 labeled examples, transfer learning from ResNet or EfficientNet beats training from scratch with 10x the data. Build from scratch only for very unusual image formats or architecture research.

What are the best pretrained models for computer vision in 2025?

Classification: EfficientNet-V2 (best accuracy/compute), ResNet-50 (reliable workhorse). Detection: YOLOv8 (real-time), DETR (best accuracy). Segmentation: SAM (Meta). Zero-shot: CLIP (OpenAI). All available through PyTorch Hub or Hugging Face.

How much labeled data do I need for image classification?

With transfer learning: 500-1000 images per class often achieves 85-90%+ accuracy. 1000-5000 per class is comfortable for production. Data augmentation effectively multiplies your dataset 5-10x. Without transfer learning: 10K-100K+ per class.

Share this article:

Frequently Asked Questions

Computer vision is the field of AI that enables machines to interpret and understand visual information from images and videos. Current capabilities include: image classification (is this a cat or a dog?), object detection (locate and label all objects in an image), image segmentation (identify which pixels belong to which objects), facial recognition, scene understanding, medical image analysis, autonomous vehicle perception, and visual quality inspection. In 2025, the best computer vision models (Vision Transformers, SAM, CLIP) can match or exceed human performance on many standardized benchmarks. The most impactful production applications are medical imaging, manufacturing inspection, and autonomous vehicles.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!