What is computer vision and what can it do?

Computer vision is the field of AI that enables machines to interpret and understand visual information from images and videos. Current capabilities include: image classification (is this a cat or a dog?), object detection (locate and label all objects in an image), image segmentation (identify which pixels belong to which objects), facial recognition, scene understanding, medical image analysis, autonomous vehicle perception, and visual quality inspection. In 2025, the best computer vision models (Vision Transformers, SAM, CLIP) can match or exceed human performance on many standardized benchmarks. The most impactful production applications are medical imaging, manufacturing inspection, and autonomous vehicles.

What is a Convolutional Neural Network (CNN) and how does it work?

A CNN is a neural network architecture designed specifically for processing grid-structured data like images. It uses convolutional layers that apply learned filters across the input — detecting local patterns regardless of their position. Early layers detect simple features (edges, gradients). Middle layers combine those into shapes and textures. Final layers combine those into object parts and objects. The key innovation vs. regular networks: weight sharing (the same filter is applied across the entire image) makes CNNs dramatically more efficient and able to recognize patterns regardless of their position in the image. This is why the same filter can detect a horizontal edge whether it's at the top or bottom of the image.

Should I build a CNN from scratch or use transfer learning?

Use transfer learning for almost every new project. Building a CNN from scratch requires large amounts of labeled data (typically 100K+) and significant compute. Transfer learning fine-tunes a pretrained model (trained on millions of images) on your specific task. With as few as 500-1000 labeled examples, transfer learning from ResNet, EfficientNet, or a Vision Transformer typically produces better results than training from scratch with 10x more data. Build from scratch only when: you have a unique image format (medical images, satellite imagery) that differs fundamentally from ImageNet, or you're doing research on new architectures. For production applications, transfer learning is almost always the right choice.

What are the best pretrained models for computer vision in 2025?

For image classification: EfficientNet-V2 offers the best accuracy/compute tradeoff for most tasks. ResNet-50 is the reliable workhorse that's well-understood and widely supported. Vision Transformer (ViT) achieves top performance when you have large datasets. For object detection: YOLOv8 is the state-of-the-art for real-time detection speed. DETR (Detection Transformer) for best accuracy. For image segmentation: SAM (Segment Anything Model, Meta) is the most versatile. For feature extraction and zero-shot classification: CLIP (OpenAI) is remarkably capable. All of these are available through Hugging Face or their respective official repositories.

How much labeled data do I need for image classification?

With transfer learning: 500-1000 images per class is often sufficient for a fine-tuned model to achieve 85-90%+ accuracy. 1000-5000 per class is comfortable for most production use cases. Without transfer learning (training from scratch): you typically need 10K-100K+ images per class. The more different your target domain is from ImageNet (the typical pretraining dataset), the more fine-tuning data you'll need. Data augmentation can effectively multiply your dataset 5-10x. Active learning — carefully selecting the most informative images to label — can dramatically reduce labeling costs.

Computer Vision Tutorial: Build an Image Classifier from Scratch

The moment computer vision clicked for me wasn't a research paper — it was debugging a model that confidently identified my black lab as a cat.

Understanding why it got that wrong — the features it was using, what the filters were detecting, where the training distribution didn't match my data — required actually understanding how the network worked, not just using it.

This tutorial builds a complete image classification system in PyTorch, from loading images to a deployed model, with enough explanation of the underlying mechanics that you'll be able to debug and improve your own models. We'll start with a CNN from scratch, understand what's happening inside it, and then use transfer learning to get production-quality results.

How Computers "See" Images

Before code, the mental model. A color image is a 3D array of numbers:

Image: 224 × 224 × 3
         ↑       ↑   ↑
       height  width channels (R, G, B)

Each pixel: [red_value, green_value, blue_value]
            values between 0-255

So a 224×224 color image = 224 × 224 × 3 = 150,528 numbers

A neural network sees this as a tensor of 150K numbers. The challenge: how do you build a network that recognizes "cat" from these 150K numbers, regardless of where in the image the cat is, what size it is, and how it's lit?

The answer: Convolutional Neural Networks.

How Convolutional Layers Work

A convolution slides a small filter across the image, computing a weighted sum at each position:

Input (5×5 image):    3×3 filter:      Output (3×3):
1 2 3 4 5            1 0 -1          [sum1  sum2  sum3]
1 2 3 4 5            1 0 -1          [sum4  sum5  sum6]
1 2 3 4 5            1 0 -1          [sum7  sum8  sum9]
1 2 3 4 5
1 2 3 4 5

Position (0,0): sum1 = 1×1 + 2×0 + 3×-1 + 1×1 + 2×0 + 3×-1 + ...

This particular filter is a vertical edge detector (detects where image intensity changes left-to-right). A different filter detects horizontal edges. Another detects diagonals.

The magic: in a CNN, we don't choose the filters manually. We initialize them randomly and let gradient descent learn which filters are useful for the task. After training on millions of images, the filters that emerge detect meaningful visual features.

Building a CNN with PyTorch

Setup

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
import matplotlib.pyplot as plt
import numpy as np

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Data Loading and Augmentation

# Data transforms with augmentation for training
train_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224),        # Random crop and resize
    transforms.RandomHorizontalFlip(),         # Flip with 50% probability
    transforms.RandomRotation(15),             # Rotate up to 15 degrees
    transforms.ColorJitter(brightness=0.2,    # Vary brightness
                          contrast=0.2,
                          saturation=0.2),
    transforms.ToTensor(),                     # Convert to PyTorch tensor
    transforms.Normalize(                      # Normalize to ImageNet stats
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# Validation transforms (no augmentation — only resize and normalize)
val_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

# Load dataset (directory structure: data/train/class1/, data/train/class2/, etc.)
train_dataset = datasets.ImageFolder('data/train', transform=train_transforms)
val_dataset = datasets.ImageFolder('data/val', transform=val_transforms)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)

print(f"Classes: {train_dataset.classes}")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Custom CNN Architecture

class CustomCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        
        # Feature extraction layers
        self.features = nn.Sequential(
            # Block 1: 3 channels → 32 filters
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (224, 224, 32)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (112, 112, 32)
            
            # Block 2: 32 → 64 filters
            nn.Conv2d(32, 64, kernel_size=3, padding=1),   # (112, 112, 64)
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (56, 56, 64)
            
            # Block 3: 64 → 128 filters
            nn.Conv2d(64, 128, kernel_size=3, padding=1),  # (56, 56, 128)
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (28, 28, 128)
            
            # Block 4: 128 → 256 filters
            nn.Conv2d(128, 256, kernel_size=3, padding=1), # (28, 28, 256)
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                             # (14, 14, 256)
        )
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)),  # (1, 1, 256) — global average pooling
            nn.Flatten(),                   # 256 features
            nn.Linear(256, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Initialize
model = CustomCNN(num_classes=len(train_dataset.classes)).to(device)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Training Loop

def train_model(model, train_loader, val_loader, epochs=20, lr=0.001):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
    
    best_val_acc = 0.0
    history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
    
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_loss, train_correct, train_total = 0, 0, 0
        
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            train_correct += predicted.eq(labels).sum().item()
            train_total += labels.size(0)
        
        # Validation phase
        model.eval()
        val_loss, val_correct, val_total = 0, 0, 0
        
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                
                val_loss += loss.item() * inputs.size(0)
                _, predicted = outputs.max(1)
                val_correct += predicted.eq(labels).sum().item()
                val_total += labels.size(0)
        
        # Record metrics
        train_acc = train_correct / train_total
        val_acc = val_correct / val_total
        
        history['train_loss'].append(train_loss / train_total)
        history['val_loss'].append(val_loss / val_total)
        history['train_acc'].append(train_acc)
        history['val_acc'].append(val_acc)
        
        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
        
        scheduler.step()
        
        print(f"Epoch {epoch+1}/{epochs} — "
              f"Train: {train_acc:.4f} | Val: {val_acc:.4f} | "
              f"Best: {best_val_acc:.4f}")
    
    return history

history = train_model(model, train_loader, val_loader, epochs=20)

Transfer Learning: The Right Way to Do Computer Vision

Training from scratch with a small dataset will give mediocre results. Transfer learning, starting with a model pretrained on ImageNet (1.2 million images, 1000 classes), dramatically improves results with limited data.

Using ResNet50 with Transfer Learning

from torchvision import models

def create_transfer_model(num_classes, freeze_backbone=True):
    # Load pretrained ResNet50
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
    
    if freeze_backbone:
        # Freeze all layers (only train the final classifier)
        for param in model.parameters():
            param.requires_grad = False
    
    # Replace the final layer with one for our number of classes
    # Original: model.fc = Linear(2048, 1000)
    # New: Linear(2048, num_classes)
    model.fc = nn.Sequential(
        nn.Linear(2048, 512),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(512, num_classes)
    )
    
    return model.to(device)

# Phase 1: Train only the classifier head
model_tl = create_transfer_model(
    num_classes=len(train_dataset.classes), 
    freeze_backbone=True
)

# Only optimize the new fc layer
optimizer = optim.Adam(model_tl.fc.parameters(), lr=0.001)
history_phase1 = train_model(model_tl, train_loader, val_loader, epochs=5)

# Phase 2: Fine-tune the whole network with a small learning rate
for param in model_tl.parameters():
    param.requires_grad = True

optimizer = optim.Adam(model_tl.parameters(), lr=0.0001)  # Lower LR for fine-tuning
history_phase2 = train_model(model_tl, train_loader, val_loader, epochs=10)

Why two-phase training works:

Phase 1: The pretrained backbone extracts features; you only train the new classification head
Phase 2: Fine-tune everything with a small learning rate to adapt pretrained features to your domain
Without Phase 1: random gradients from the new classifier can destroy the carefully learned pretrained features

Modern Approach: EfficientNet or ViT

# EfficientNet-B4 (excellent accuracy/efficiency tradeoff)
model = models.efficientnet_b4(weights=models.EfficientNet_B4_Weights.DEFAULT)
model.classifier[1] = nn.Linear(1792, num_classes)

# Or use Hugging Face for Vision Transformer
from transformers import ViTForImageClassification, ViTFeatureExtractor

model = ViTForImageClassification.from_pretrained(
    'google/vit-base-patch16-224',
    num_labels=num_classes,
    ignore_mismatched_sizes=True
)

Visualizing What the CNN Learned

A crucial debugging tool: visualize the filters and activations:

def visualize_filters(model, layer_num=0):
    """Visualize the learned filters in the first conv layer"""
    # Get the first conv layer
    conv_layer = list(model.features.children())[layer_num]
    filters = conv_layer.weight.data.cpu()
    
    # Normalize for visualization
    filters = (filters - filters.min()) / (filters.max() - filters.min())
    
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for idx in range(32):
        ax = axes[idx // 8, idx % 8]
        # Show first channel of each filter
        ax.imshow(filters[idx, 0], cmap='gray')
        ax.axis('off')
    
    plt.suptitle('Learned Filters - Layer 1')
    plt.show()

Making Predictions on New Images

from PIL import Image

def predict_image(model, image_path, class_names, transform):
    model.eval()
    
    # Load and transform image
    image = Image.open(image_path).convert('RGB')
    input_tensor = transform(image).unsqueeze(0).to(device)
    
    with torch.no_grad():
        output = model(input_tensor)
        probabilities = torch.softmax(output, dim=1)[0]
    
    # Get top predictions
    top_probs, top_indices = probabilities.topk(5)
    
    print("Predictions:")
    for i, (prob, idx) in enumerate(zip(top_probs, top_indices)):
        print(f"  {i+1}. {class_names[idx.item()]:20}: {prob.item():.3%}")

predict_image(
    model_tl, 
    'test_image.jpg', 
    train_dataset.classes,
    val_transforms
)

Common Issues and Solutions

Problem	Symptom	Solution
Overfitting	Train acc >> Val acc	More dropout, data augmentation, fewer layers
Underfitting	Both accuracies low	More layers/filters, longer training, unfreeze backbone
Slow training	Progress too slow	Use GPU, increase batch size
Class imbalance	High accuracy on majority class	class_weight in loss, oversampling
Bad input normalization	Training doesn't converge	Use ImageNet normalization stats for pretrained models

Conclusion

Computer vision with deep learning has become remarkably accessible. The same techniques that power autonomous vehicles and medical diagnostics can be applied to custom classification tasks with a few hundred labeled images and a few hours of training.

Transfer learning is the right starting point for almost every new computer vision project in 2025 — it delivers production-quality results with limited data and compute. Build from scratch only when your domain is fundamentally different from natural images.

For the deep learning foundations, see our neural networks explained guide. For the PyTorch deep dive, our TensorFlow vs PyTorch comparison covers when to choose each framework.