What is a Convolutional Neural Network (CNN) and how does it work?

A CNN is a neural network architecture designed specifically for processing grid-structured data like images. It uses convolutional layers that apply learned filters across the input — detecting local patterns regardless of their position. Early layers detect simple features (edges, gradients). Middle layers combine those into shapes and textures. Final layers combine those into object parts and objects. The key innovation vs. regular networks: weight sharing (the same filter is applied across the entire image) makes CNNs dramatically more efficient and able to recognize patterns regardless of their position in the image. This is why the same filter can detect a horizontal edge whether it's at the top or bottom of the image.

Should I build a CNN from scratch or use transfer learning?

Use transfer learning for almost every new project. Building a CNN from scratch requires large amounts of labeled data (typically 100K+) and significant compute. Transfer learning fine-tunes a pretrained model (trained on millions of images) on your specific task. With as few as 500-1000 labeled examples, transfer learning from ResNet, EfficientNet, or a Vision Transformer typically produces better results than training from scratch with 10x more data. Build from scratch only when: you have a unique image format (medical images, satellite imagery) that differs fundamentally from ImageNet, or you're doing research on new architectures. For production applications, transfer learning is almost always the right choice.

What are the best pretrained models for computer vision in 2025?

For image classification: EfficientNet-V2 offers the best accuracy/compute tradeoff for most tasks. ResNet-50 is the reliable workhorse that's well-understood and widely supported. Vision Transformer (ViT) achieves top performance when you have large datasets. For object detection: YOLOv8 is the state-of-the-art for real-time detection speed. DETR (Detection Transformer) for best accuracy. For image segmentation: SAM (Segment Anything Model, Meta) is the most versatile. For feature extraction and zero-shot classification: CLIP (OpenAI) is remarkably capable. All of these are available through Hugging Face or their respective official repositories.

How much labeled data do I need for image classification?

With transfer learning: 500-1000 images per class is often sufficient for a fine-tuned model to achieve 85-90%+ accuracy. 1000-5000 per class is comfortable for most production use cases. Without transfer learning (training from scratch): you typically need 10K-100K+ images per class. The more different your target domain is from ImageNet (the typical pretraining dataset), the more fine-tuning data you'll need. Data augmentation can effectively multiply your dataset 5-10x. Active learning — carefully selecting the most informative images to label — can dramatically reduce labeling costs.

AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

Machine Learning

Computer Vision Tutorial: Build an Image Classifier from Scratch

⚡ Quick Answer

Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.

AiTechWorlds Team May 27, 2026 9 min read

#computer-vision-tutorial #image-classification-python #cnn-pytorch #machine-learning

📚Part of the Machine Learning guide — explore all Machine Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Computer Vision Tutorial: Build an Image Classifier from Scratch

The moment computer vision clicked for me wasn't a research paper — it was debugging a model that confidently identified my black lab as a cat.

Understanding why it got that wrong — the features it was using, what the filters were detecting, where the training distribution didn't match my data — required actually understanding how the network worked, not just using it.

This tutorial builds a complete image classification system in PyTorch, from loading images to a deployed model, with enough explanation of the underlying mechanics that you'll be able to debug and improve your own models. We'll start with a CNN from scratch, understand what's happening inside it, and then use transfer learning to get production-quality results.

How Computers "See" Images

Before code, the mental model. A color image is a 3D array of numbers:

Image: 224 × 224 × 3
         ↑       ↑   ↑
       height  width channels (R, G, B)

Each pixel: [red_value, green_value, blue_value]
            values between 0-255

So a 224×224 color image = 224 × 224 × 3 = 150,528 numbers

A neural network sees this as a tensor of 150K numbers. The challenge: how do you build a network that recognizes "cat" from these 150K numbers, regardless of where in the image the cat is, what size it is, and how it's lit?

The answer: Convolutional Neural Networks.

How Convolutional Layers Work

A convolution slides a small filter across the image, computing a weighted sum at each position:

Input (5×5 image):    3×3 filter:      Output (3×3):
1 2 3 4 5            1 0 -1          [sum1  sum2  sum3]
1 2 3 4 5            1 0 -1          [sum4  sum5  sum6]
1 2 3 4 5            1 0 -1          [sum7  sum8  sum9]
1 2 3 4 5
1 2 3 4 5

Position (0,0): sum1 = 1×1 + 2×0 + 3×-1 + 1×1 + 2×0 + 3×-1 + ...

This particular filter is a vertical edge detector (detects where image intensity changes left-to-right). A different filter detects horizontal edges. Another detects diagonals.

The magic: in a CNN, we don't choose the filters manually. We initialize them randomly and let gradient descent learn which filters are useful for the task. After training on millions of images, the filters that emerge detect meaningful visual features.

Building a CNN with PyTorch

Setup

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
import matplotlib.pyplot as plt
import numpy as np

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Data Loading and Augmentation

# Data transforms with augmentation for training
train_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224),        # Random crop and resize
    transforms.RandomHorizontalFlip(),         # Flip with 50% probability
    transforms.RandomRotation(15),             # Rotate up to 15 degrees
    transforms.ColorJitter(brightness=0.2,    # Vary brightness
                          contrast=0.2,
                          saturation=0.2),
    transforms.ToTensor(),                     # Convert to PyTorch tensor
    transforms.Normalize(                      # Normalize to ImageNet stats
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# Validation transforms (no augmentation — only resize and normalize)
val_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# Load dataset (directory structure: data/train/class1/, data/train/class2/, etc.)
train_dataset = datasets.ImageFolder('data/train', transform=train_transforms)
val_dataset = datasets.ImageFolder('data/val', transform=val_transforms)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)

print(f"Classes: {train_dataset.classes}")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Custom CNN Architecture

class CustomCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        
        # Feature extraction layers
        self.features = nn.Sequential(
            # Block 1: 3 channels → 32 filters
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (224, 224, 32)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (112, 112, 32)
            
            # Block 2: 32 → 64 filters
            nn.Conv2d(32, 64, kernel_size=3, padding=1),   # (112, 112, 64)
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (56, 56, 64)
            
            # Block 3: 64 → 128 filters
            nn.Conv2d(64, 128, kernel_size=3, padding=1),  # (56, 56, 128)
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (28, 28, 128)
            
            # Block 4: 128 → 256 filters
            nn.Conv2d(128, 256, kernel_size=3, padding=1), # (28, 28, 256)
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (14, 14, 256)
        )
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)),  # (1, 1, 256) — global average pooling
            nn.Flatten(),                   # 256 features
            nn.Linear(256, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Initialize
model = CustomCNN(num_classes=len(train_dataset.classes)).to(device)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Training Loop

def train_model(model, train_loader, val_loader, epochs=20, lr=0.001):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
    
    best_val_acc = 0.0
    history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
    
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss, train_correct, train_total = 0, 0, 0
        
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            train_correct += predicted.eq(labels).sum().item()
            train_total += labels.size(0)
        
        # Validation phase
        model.eval()
        val_loss, val_correct, val_total = 0, 0, 0
        
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                
                val_loss += loss.item() * inputs.size(0)
                _, predicted = outputs.max(1)
                val_correct += predicted.eq(labels).sum().item()
                val_total += labels.size(0)
        
        # Record metrics
        train_acc = train_correct / train_total
        val_acc = val_correct / val_total
        
        history['train_loss'].append(train_loss / train_total)
        history['val_loss'].append(val_loss / val_total)
        history['train_acc'].append(train_acc)
        history['val_acc'].append(val_acc)
        
        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
        
        scheduler.step()
        
        print(f"Epoch {epoch+1}/{epochs} — "
              f"Train: {train_acc:.4f} | Val: {val_acc:.4f} | "
              f"Best: {best_val_acc:.4f}")
    
    return history

history = train_model(model, train_loader, val_loader, epochs=20)

Transfer Learning: The Right Way to Do Computer Vision

Training from scratch with a small dataset will give mediocre results. Transfer learning, starting with a model pretrained on ImageNet (1.2 million images, 1000 classes), dramatically improves results with limited data.

Using ResNet50 with Transfer Learning

from torchvision import models

def create_transfer_model(num_classes, freeze_backbone=True):
    # Load pretrained ResNet50
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
    
    if freeze_backbone:
        # Freeze all layers (only train the final classifier)
        for param in model.parameters():
            param.requires_grad = False
    
    # Replace the final layer with one for our number of classes
    # Original: model.fc = Linear(2048, 1000)
    # New: Linear(2048, num_classes)
    model.fc = nn.Sequential(
        nn.Linear(2048, 512),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(512, num_classes)
    )
    
    return model.to(device)

# Phase 1: Train only the classifier head
model_tl = create_transfer_model(
    num_classes=len(train_dataset.classes), 
    freeze_backbone=True
)

# Only optimize the new fc layer
optimizer = optim.Adam(model_tl.fc.parameters(), lr=0.001)
history_phase1 = train_model(model_tl, train_loader, val_loader, epochs=5)

# Phase 2: Fine-tune the whole network with a small learning rate
for param in model_tl.parameters():
    param.requires_grad = True

optimizer = optim.Adam(model_tl.parameters(), lr=0.0001)  # Lower LR for fine-tuning
history_phase2 = train_model(model_tl, train_loader, val_loader, epochs=10)

Why two-phase training works:

Phase 1: The pretrained backbone extracts features; you only train the new classification head
Phase 2: Fine-tune everything with a small learning rate to adapt pretrained features to your domain
Without Phase 1: random gradients from the new classifier can destroy the carefully learned pretrained features

Modern Approach: EfficientNet or ViT

# EfficientNet-B4 (excellent accuracy/efficiency tradeoff)
model = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)
model.classifier[1] = nn.Linear(1792, num_classes)

# Or use Hugging Face for Vision Transformer
from transformers import ViTForImageClassification, ViTFeatureExtractor

model = ViTForImageClassification.from_pretrained(
    'google/vit-base-patch16-224',
    num_labels=num_classes,
    ignore_mismatched_sizes=True
)

Visualizing What the CNN Learned

A crucial debugging tool: visualize the filters and activations:

def visualize_filters(model, layer_num=0):
    """Visualize the learned filters in the first conv layer"""
    # Get the first conv layer
    conv_layer = list(model.features.children())[layer_num]
    filters = conv_layer.weight.data.cpu()
    
    # Normalize for visualization
    filters = (filters - filters.min()) / (filters.max() - filters.min())
    
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for idx in range(32):
        ax = axes[idx // 8, idx % 8]
        # Show first channel of each filter
        ax.imshow(filters[idx, 0], cmap='gray')
        ax.axis('off')
    
    plt.suptitle('Learned Filters - Layer 1')
    plt.show()

Making Predictions on New Images

from PIL import Image

def predict_image(model, image_path, class_names, transform):
    model.eval()
    
    # Load and transform image
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0).to(device)
    
    with torch.no_grad():
        output = model(input_tensor)
        probabilities = torch.softmax(output, dim=1)[0]
    
    # Get top predictions
    top_probs, top_indices = probabilities.topk(5)
    
    print("Predictions:")
    for i, (prob, idx) in enumerate(zip(top_probs, top_indices)):
        print(f"  {i+1}. {class_names[idx.item()]:20}: {prob.item():.3%}")

predict_image(
    model_tl, 
    'test_image.jpg', 
    train_dataset.classes,
    val_transforms
)

Common Issues and Solutions

Problem	Symptom	Solution
Overfitting	Train acc >> Val acc	More dropout, data augmentation, fewer layers
Underfitting	Both accuracies low	More layers/filters, longer training, unfreeze backbone
Slow training	Progress too slow	Use GPU, increase batch size
Class imbalance	High accuracy on majority class	class_weight in loss, oversampling
Bad input normalization	Training doesn't converge	Use ImageNet normalization stats for pretrained models

Conclusion

Computer vision with deep learning has become remarkably accessible. The same techniques that power autonomous vehicles and medical diagnostics can be applied to custom classification tasks with a few hundred labeled images and a few hours of training.

Transfer learning is the right starting point for almost every new computer vision project in 2025 — it delivers production-quality results with limited data and compute. Build from scratch only when your domain is fundamentally different from natural images.

For the deep learning foundations, see our neural networks explained guide. For the PyTorch deep dive, our TensorFlow vs PyTorch comparison covers when to choose each framework.

Frequently Asked Questions

Computer vision is the field of AI that enables machines to interpret and understand visual information from images and videos. Current capabilities include: image classification (is this a cat or a dog?), object detection (locate and label all objects in an image), image segmentation (identify which pixels belong to which objects), facial recognition, scene understanding, medical image analysis, autonomous vehicle perception, and visual quality inspection. In 2025, the best computer vision models (Vision Transformers, SAM, CLIP) can match or exceed human performance on many standardized benchmarks. The most impactful production applications are medical imaging, manufacturing inspection, and autonomous vehicles.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

machine learning data visualization and model training — best machine learning courses in 2025

AI Learning

Best Machine Learning Courses in 2025: Ranked After Taking Them All

The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.

May 27, 2026 10 min read

machine learning data visualization and model training — feature engineering guide

AI Learning

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.

May 27, 2026 9 min read

machine learning data visualization and model training — kaggle competition guide

AI Learning

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.

May 27, 2026 8 min read

machine learning data visualization and model training — machine learning for beginners machine learning beginners

AI Learning

🔥 Trending

Machine Learning for Beginners: A Honest Guide to Getting Started

Machine learning for beginners explained honestly — what ML actually is, which skills you need first, the fastest learning path, and what to build to prove you can do it.

May 27, 2026 9 min read

Go deeper on this topic

NotesLLM Core Concepts Explained NotesML Learning Paradigms: Complete Guide CourseMachine Learning CourseMachine Learning Fundamentals NotesPrompt Engineering Cheat Sheet NotesChatGPT Tips & Tricks Cheat Sheet

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Machine Learning

Computer Vision Tutorial: Build an Image Classifier from Scratch

⚡ Quick Answer

Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.

AiTechWorlds Team May 27, 2026 9 min read

#computer-vision-tutorial #image-classification-python #cnn-pytorch #machine-learning

📚Part of the Machine Learning guide — explore all Machine Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Computer Vision Tutorial: Build an Image Classifier from Scratch

The moment computer vision clicked for me wasn't a research paper — it was debugging a model that confidently identified my black lab as a cat.

How Computers "See" Images

Before code, the mental model. A color image is a 3D array of numbers:

Image: 224 × 224 × 3
         ↑       ↑   ↑
       height  width channels (R, G, B)

Each pixel: [red_value, green_value, blue_value]
            values between 0-255

So a 224×224 color image = 224 × 224 × 3 = 150,528 numbers

The answer: Convolutional Neural Networks.

How Convolutional Layers Work

A convolution slides a small filter across the image, computing a weighted sum at each position:

Input (5×5 image):    3×3 filter:      Output (3×3):
1 2 3 4 5            1 0 -1          [sum1  sum2  sum3]
1 2 3 4 5            1 0 -1          [sum4  sum5  sum6]
1 2 3 4 5            1 0 -1          [sum7  sum8  sum9]
1 2 3 4 5
1 2 3 4 5

Position (0,0): sum1 = 1×1 + 2×0 + 3×-1 + 1×1 + 2×0 + 3×-1 + ...

This particular filter is a vertical edge detector (detects where image intensity changes left-to-right). A different filter detects horizontal edges. Another detects diagonals.

Building a CNN with PyTorch

Setup

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
import matplotlib.pyplot as plt
import numpy as np

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Data Loading and Augmentation

# Data transforms with augmentation for training
train_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224),        # Random crop and resize
    transforms.RandomHorizontalFlip(),         # Flip with 50% probability
    transforms.RandomRotation(15),             # Rotate up to 15 degrees
    transforms.ColorJitter(brightness=0.2,    # Vary brightness
                          contrast=0.2,
                          saturation=0.2),
    transforms.ToTensor(),                     # Convert to PyTorch tensor
    transforms.Normalize(                      # Normalize to ImageNet stats
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# Validation transforms (no augmentation — only resize and normalize)
val_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# Load dataset (directory structure: data/train/class1/, data/train/class2/, etc.)
train_dataset = datasets.ImageFolder('data/train', transform=train_transforms)
val_dataset = datasets.ImageFolder('data/val', transform=val_transforms)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)

print(f"Classes: {train_dataset.classes}")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Custom CNN Architecture

class CustomCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        
        # Feature extraction layers
        self.features = nn.Sequential(
            # Block 1: 3 channels → 32 filters
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (224, 224, 32)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (112, 112, 32)
            
            # Block 2: 32 → 64 filters
            nn.Conv2d(32, 64, kernel_size=3, padding=1),   # (112, 112, 64)
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (56, 56, 64)
            
            # Block 3: 64 → 128 filters
            nn.Conv2d(64, 128, kernel_size=3, padding=1),  # (56, 56, 128)
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (28, 28, 128)
            
            # Block 4: 128 → 256 filters
            nn.Conv2d(128, 256, kernel_size=3, padding=1), # (28, 28, 256)
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (14, 14, 256)
        )
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)),  # (1, 1, 256) — global average pooling
            nn.Flatten(),                   # 256 features
            nn.Linear(256, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Initialize
model = CustomCNN(num_classes=len(train_dataset.classes)).to(device)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Training Loop

def train_model(model, train_loader, val_loader, epochs=20, lr=0.001):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
    
    best_val_acc = 0.0
    history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
    
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss, train_correct, train_total = 0, 0, 0
        
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            train_correct += predicted.eq(labels).sum().item()
            train_total += labels.size(0)
        
        # Validation phase
        model.eval()
        val_loss, val_correct, val_total = 0, 0, 0
        
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                
                val_loss += loss.item() * inputs.size(0)
                _, predicted = outputs.max(1)
                val_correct += predicted.eq(labels).sum().item()
                val_total += labels.size(0)
        
        # Record metrics
        train_acc = train_correct / train_total
        val_acc = val_correct / val_total
        
        history['train_loss'].append(train_loss / train_total)
        history['val_loss'].append(val_loss / val_total)
        history['train_acc'].append(train_acc)
        history['val_acc'].append(val_acc)
        
        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
        
        scheduler.step()
        
        print(f"Epoch {epoch+1}/{epochs} — "
              f"Train: {train_acc:.4f} | Val: {val_acc:.4f} | "
              f"Best: {best_val_acc:.4f}")
    
    return history

history = train_model(model, train_loader, val_loader, epochs=20)

Transfer Learning: The Right Way to Do Computer Vision

Using ResNet50 with Transfer Learning

from torchvision import models

def create_transfer_model(num_classes, freeze_backbone=True):
    # Load pretrained ResNet50
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
    
    if freeze_backbone:
        # Freeze all layers (only train the final classifier)
        for param in model.parameters():
            param.requires_grad = False
    
    # Replace the final layer with one for our number of classes
    # Original: model.fc = Linear(2048, 1000)
    # New: Linear(2048, num_classes)
    model.fc = nn.Sequential(
        nn.Linear(2048, 512),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(512, num_classes)
    )
    
    return model.to(device)

# Phase 1: Train only the classifier head
model_tl = create_transfer_model(
    num_classes=len(train_dataset.classes), 
    freeze_backbone=True
)

# Only optimize the new fc layer
optimizer = optim.Adam(model_tl.fc.parameters(), lr=0.001)
history_phase1 = train_model(model_tl, train_loader, val_loader, epochs=5)

# Phase 2: Fine-tune the whole network with a small learning rate
for param in model_tl.parameters():
    param.requires_grad = True

optimizer = optim.Adam(model_tl.parameters(), lr=0.0001)  # Lower LR for fine-tuning
history_phase2 = train_model(model_tl, train_loader, val_loader, epochs=10)

Why two-phase training works:

Phase 1: The pretrained backbone extracts features; you only train the new classification head
Phase 2: Fine-tune everything with a small learning rate to adapt pretrained features to your domain
Without Phase 1: random gradients from the new classifier can destroy the carefully learned pretrained features

Modern Approach: EfficientNet or ViT

# EfficientNet-B4 (excellent accuracy/efficiency tradeoff)
model = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)
model.classifier[1] = nn.Linear(1792, num_classes)

# Or use Hugging Face for Vision Transformer
from transformers import ViTForImageClassification, ViTFeatureExtractor

model = ViTForImageClassification.from_pretrained(
    'google/vit-base-patch16-224',
    num_labels=num_classes,
    ignore_mismatched_sizes=True
)

Visualizing What the CNN Learned

A crucial debugging tool: visualize the filters and activations:

def visualize_filters(model, layer_num=0):
    """Visualize the learned filters in the first conv layer"""
    # Get the first conv layer
    conv_layer = list(model.features.children())[layer_num]
    filters = conv_layer.weight.data.cpu()
    
    # Normalize for visualization
    filters = (filters - filters.min()) / (filters.max() - filters.min())
    
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for idx in range(32):
        ax = axes[idx // 8, idx % 8]
        # Show first channel of each filter
        ax.imshow(filters[idx, 0], cmap='gray')
        ax.axis('off')
    
    plt.suptitle('Learned Filters - Layer 1')
    plt.show()

Making Predictions on New Images

from PIL import Image

def predict_image(model, image_path, class_names, transform):
    model.eval()
    
    # Load and transform image
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0).to(device)
    
    with torch.no_grad():
        output = model(input_tensor)
        probabilities = torch.softmax(output, dim=1)[0]
    
    # Get top predictions
    top_probs, top_indices = probabilities.topk(5)
    
    print("Predictions:")
    for i, (prob, idx) in enumerate(zip(top_probs, top_indices)):
        print(f"  {i+1}. {class_names[idx.item()]:20}: {prob.item():.3%}")

predict_image(
    model_tl, 
    'test_image.jpg', 
    train_dataset.classes,
    val_transforms
)

Common Issues and Solutions

Problem	Symptom	Solution
Overfitting	Train acc >> Val acc	More dropout, data augmentation, fewer layers
Underfitting	Both accuracies low	More layers/filters, longer training, unfreeze backbone
Slow training	Progress too slow	Use GPU, increase batch size
Class imbalance	High accuracy on majority class	class_weight in loss, oversampling
Bad input normalization	Training doesn't converge	Use ImageNet normalization stats for pretrained models

Conclusion

For the deep learning foundations, see our neural networks explained guide. For the PyTorch deep dive, our TensorFlow vs PyTorch comparison covers when to choose each framework.

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI Learning

Best Machine Learning Courses in 2025: Ranked After Taking Them All

The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.

May 27, 2026 10 min read

AI Learning

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.

May 27, 2026 9 min read

AI Learning

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.

May 27, 2026 8 min read

AI Learning

🔥 Trending

Machine Learning for Beginners: A Honest Guide to Getting Started

Machine learning for beginners explained honestly — what ML actually is, which skills you need first, the fastest learning path, and what to build to prove you can do it.

May 27, 2026 9 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Computer Vision Tutorial: Build an Image Classifier from Scratch

Computer Vision Tutorial: Build an Image Classifier from Scratch

How Computers "See" Images

How Convolutional Layers Work

Building a CNN with PyTorch

Setup

Data Loading and Augmentation

Custom CNN Architecture

Training Loop

Transfer Learning: The Right Way to Do Computer Vision

Using ResNet50 with Transfer Learning

Modern Approach: EfficientNet or ViT

Visualizing What the CNN Learned

Making Predictions on New Images

Common Issues and Solutions

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

Best Machine Learning Courses in 2025: Ranked After Taking Them All

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

Machine Learning for Beginners: A Honest Guide to Getting Started

Go deeper on this topic

Get Free AI Notes Daily

Computer Vision Tutorial: Build an Image Classifier from Scratch

Computer Vision Tutorial: Build an Image Classifier from Scratch

How Computers "See" Images

How Convolutional Layers Work

Building a CNN with PyTorch

Setup

Data Loading and Augmentation

Custom CNN Architecture

Training Loop

Transfer Learning: The Right Way to Do Computer Vision

Using ResNet50 with Transfer Learning

Modern Approach: EfficientNet or ViT

Visualizing What the CNN Learned

Making Predictions on New Images

Common Issues and Solutions

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

Best Machine Learning Courses in 2025: Ranked After Taking Them All

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

Kaggle Competition Guide: How to Rank in the Top 10% Every Time

Machine Learning for Beginners: A Honest Guide to Getting Started

Go deeper on this topic

Get Free AI Notes Daily