Convolutional Neural Networks (CNNs): How Image Recognition Works
CNNs learn to see by sharing weights across space. Here's the math behind convolution, pooling, and why ResNets can train 100+ layers without vanishing gradients.
Get more content like this on Telegram!
Daily AI tips, notes & resources β free
Convolutional Neural Networks (CNNs): How Image Recognition Works
Most people learn that CNNs use "sliding filters" and move on. But that description misses the key insight that makes CNNs work: they are a specific, principled way of hard-coding two prior beliefs about images into the architecture itself β and those priors turn out to be extraordinarily effective.
The priors are: (1) nearby pixels are more related than distant ones, and (2) the same visual pattern is meaningful regardless of where it appears. A CNN is a neural network that takes these assumptions seriously.
Why Fully-Connected Networks Fail for Images
A 224Γ224 RGB image has 224 Γ 224 Γ 3 = 150,528 input values. A single fully-connected hidden layer with 4,096 neurons would need 150,528 Γ 4,096 β 617 million parameters β just for the first layer. AlexNet (Krizhevsky et al., 2012), which changed computer vision, has only 60 million total parameters.
Beyond the parameter count, fully-connected layers ignore spatial structure entirely. Every input pixel connects to every neuron with an independent weight. The network has no built-in way to notice that adjacent pixels form edges, or that an eye looks the same in the top-left of an image as in the bottom-right.
Convolutional layers solve both problems simultaneously.
The Convolution Operation
A convolution slides a small weight matrix (the kernel or filter) across the input, computing a dot product at each position. If the input is a 2D feature map I and the kernel is K of size kΓk:
(I * K)[i, j] = Ξ£β Ξ£β I[i+m, j+n] Β· K[m, n]
The output is called a feature map or activation map. Each value in the feature map answers the question: "how strongly does this kernel pattern appear at this location?"
Here is the key: the same kernel weights K are applied at every position. This is weight sharing. A filter that detects a horizontal edge uses the same 9 numbers (for a 3Γ3 kernel) whether it is looking at position (0,0) or position (100,100). This reduces parameters from H Γ W Γ k Γ k (fully-connected) to just k Γ k.
A single convolutional layer applies multiple filters, each learning to detect a different pattern. With 64 filters of size 3Γ3Γ3 applied to an RGB image:
Parameters = 64 Γ (3 Γ 3 Γ 3) + 64 (bias) = 1,792
Compare that to the millions a fully-connected layer would need.
Convolution in PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
# A single convolutional layer
conv = nn.Conv2d(
in_channels=3, # RGB input
out_channels=64, # 64 different filters
kernel_size=3, # 3x3 kernels
padding=1, # 'same' padding β output same spatial size as input
stride=1
)
# Manual convolution to see the math
x = torch.randn(1, 3, 32, 32) # batch_size=1, channels=3, 32x32
output = conv(x)
print(f"Input shape: {x.shape}") # [1, 3, 32, 32]
print(f"Output shape: {output.shape}") # [1, 64, 32, 32]
# Visualizing what a learned filter looks like
# After training, conv.weight[0] is the first filter
# Shape: [3, 3, 3] β 3 input channels, 3x3 spatial
Pooling: Downsampling with Purpose
After convolution, a pooling layer reduces spatial dimensions. Max pooling takes the maximum value in each local window:
pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Halves spatial dimensions: [1, 64, 32, 32] β [1, 64, 16, 16]
Why max pooling specifically? It provides a form of translation invariance: if a feature appears slightly shifted within the pooling window, max pooling still fires. It also reduces computation β each subsequent layer processes fewer spatial positions.
Average pooling computes the mean instead of the maximum. It is less common in intermediate layers but standard in global average pooling (GAP), which averages across the entire spatial dimension to collapse a feature map to a single value per channel. GAP replaced fully-connected layers in modern architectures, reducing millions of parameters to zero while improving generalization.
Building a CNN from Scratch
class ConvNet(nn.Module):
"""
A small CNN for CIFAR-10 (32x32 RGB, 10 classes).
Architecture: 3 conv blocks + global average pooling + classifier
"""
def __init__(self):
super().__init__()
# Block 1: 3 β 64 channels, 32x32 β 16x16
self.block1 = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2) # 32x32 β 16x16
)
# Block 2: 64 β 128 channels, 16x16 β 8x8
self.block2 = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2) # 16x16 β 8x8
)
# Block 3: 128 β 256 channels, 8x8 β 4x4
self.block3 = nn.Sequential(
nn.Conv2d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2) # 8x8 β 4x4
)
# Global average pooling: 4x4 β 1x1
self.gap = nn.AdaptiveAvgPool2d(1)
# Classifier
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Dropout(0.5),
nn.Linear(256, 10)
)
def forward(self, x):
x = self.block1(x)
x = self.block2(x)
x = self.block3(x)
x = self.gap(x)
return self.classifier(x)
model = ConvNet()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# ~1.1M parameters β lean but effective
Architecture Evolution
Each architecture introduced an idea that became standard practice:
- AlexNet: ReLU activations (instead of tanh), dropout regularization, GPU training
- VGGNet: Deep stacks of 3Γ3 convolutions β simplicity over exotic kernel shapes
- GoogLeNet: Inception modules β process input at multiple scales simultaneously
- ResNet: Skip connections β the single most important CNN innovation
- EfficientNet: Compound scaling β scale width, depth, and resolution together
Residual Networks: Solving the Depth Problem
Before ResNets, deeper networks were paradoxically worse than shallower ones β not from overfitting, but from optimization difficulty. Gradients vanished as they were backpropagated through many layers. A 56-layer network trained worse than a 20-layer network on the training set (He et al., 2016).
The residual block solves this elegantly:
class ResidualBlock(nn.Module):
"""
Basic ResNet residual block (He et al., 2016).
F(x) + x: the block learns the residual, not the full transformation.
"""
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
identity = x # skip connection
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out)) # no activation yet
out = out + identity # add skip connection
out = F.relu(out) # activate after addition
return out
The skip connection does two things. At initialization, F(x) is near zero, so F(x) + x β x β the block starts as an identity mapping and learns deviations from it. This is far easier to optimize than learning a full transformation from scratch. During backprop, the gradient flows through two paths: through the residual branch F(x) and directly through the skip connection. Even if the residual branch has a vanishing gradient, the skip connection keeps information flowing.
ImageNet Benchmark Results
ImageNet is the standard benchmark for image classification: 1.2 million training images, 1,000 classes. Top-1 accuracy is the fraction of images where the top prediction is correct.
| Model | Year | Top-1 Accuracy | Parameters | GFLOPs |
|---|---|---|---|---|
| AlexNet | 2012 | 56.5% | 60M | 0.7 |
| VGG-16 | 2014 | 71.6% | 138M | 15.3 |
| GoogLeNet | 2014 | 69.8% | 6.8M | 1.4 |
| ResNet-50 | 2016 | 76.1% | 25.6M | 4.1 |
| ResNet-152 | 2016 | 77.8% | 60.2M | 11.3 |
| EfficientNet-B4 | 2019 | 82.9% | 19M | 4.2 |
| EfficientNet-B7 | 2019 | 84.3% | 66M | 37.0 |
| ViT-L/16 | 2021 | 87.1% | 307M | 190.7 |
| ConvNeXt-L | 2022 | 87.5% | 198M | 34.4 |
Note that EfficientNet-B4 achieves better accuracy than ResNet-152 with fewer parameters and computations. The FLOP count matters for deployment β a model that requires 37 GFLOPs cannot run in real time on a mobile device.
Receptive Field: How Much Does Each Neuron See?
A neuron in a deep layer "sees" a much larger region of the input than its immediate kernel would suggest. This is the receptive field.
For a stack of n layers with kernel size k and no pooling:
Receptive field = 1 + n Γ (k - 1)
With pooling (stride 2) every p layers:
Effective receptive field grows as ~2^(n/p)
This is why networks go deep: each layer compounds the previous, allowing final layers to integrate information from the entire image while early layers respond to fine-grained local patterns.
After 5 layers of 3Γ3 convolutions with 2Γ2 max pooling every 2 layers, a neuron can "see" an area of roughly 64Γ64 pixels.
Data Augmentation: Making the Most of Limited Data
CNNs are hungry for data. When you have limited training images, augmentation creates effective diversity:
from torchvision import transforms
train_transform = transforms.Compose([
transforms.RandomCrop(32, padding=4), # random crop with padding
transforms.RandomHorizontalFlip(p=0.5), # horizontal flip
transforms.ColorJitter(
brightness=0.2, contrast=0.2,
saturation=0.2, hue=0.1
),
transforms.RandomRotation(degrees=15),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406], # ImageNet statistics
std=[0.229, 0.224, 0.225]
),
])
# Validation β no augmentation, only normalization
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
Augmentation exploits known invariances: a dog is still a dog when flipped horizontally. Color jitter teaches the network that object identity does not depend on exact lighting. Random crops teach scale and position invariance.
Using a Pre-Trained ResNet
Almost always, you should start with a pre-trained backbone rather than training from scratch. Training ImageNet-scale models from scratch requires hundreds of GPU-hours and millions of images.
import torchvision.models as models
# Load pre-trained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# Freeze early layers β they have learned general features
for name, param in backbone.named_parameters():
if 'layer4' not in name and 'fc' not in name:
param.requires_grad = False
# Replace the classification head for your number of classes
num_classes = 5 # your task
backbone.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(backbone.fc.in_features, num_classes)
)
# Only the final layers are trainable
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total = sum(p.numel() for p in backbone.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")
This is transfer learning in practice. For more detail on fine-tuning strategies, see the Transfer Learning Explained article.
Practical Training Tips
A few things that make a real difference:
Learning rate warmup: start with a very small learning rate (1/10th of target) for the first few epochs, then increase linearly. This prevents the optimizer from making large destructive updates when weights are still randomly initialized.
Cosine annealing: decay the learning rate following a cosine curve rather than step decay. It tends to find flatter minima that generalize better.
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=0.01,
epochs=50,
steps_per_epoch=len(train_loader),
pct_start=0.3, # 30% of training for warmup
)
Mixed precision training: use float16 for forward/backward pass, float32 for parameter updates. Cuts memory usage nearly in half, often speeds up training 2-3Γ on modern GPUs.
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
output = model(input)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
From CNNs to Transformers
CNNs dominated computer vision from AlexNet (2012) through 2020. Then Vision Transformers (ViT, Dosovitskiy et al., 2021) showed that transformers could match or beat CNNs on ImageNet when trained on enough data.
The tension is unresolved β ConvNeXt (Liu et al., 2022) showed that modernizing CNNs with transformer-inspired design choices closes most of the gap. Both architectures remain important, and hybrid approaches are increasingly common.
For sequence problems β text, audio, time series β CNNs play a secondary role. That territory belongs to recurrent networks and transformers, covered in LSTM vs Transformer.
Test your CNN knowledge with the Deep Learning Quiz, and for more context on how these architectures fit together, the ML Algorithms Quiz covers the broader landscape.
The Machine Learning course includes hands-on CNN projects, and LLM concepts notes explain how visual features connect to language model representations.
π¬ DiscussionPowered by GitHub Discussions
Frequently Asked Questions
AiTechWorlds Team
β Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Deep Learning Explained: Neural Networks from Zero to Understanding
Most tutorials teach you the API. This guide teaches you what's actually happening inside a neural network β forward pass, backprop, and why depth matters.
LSTM vs Transformer: The Evolution of Sequence Learning in AI
LSTMs ruled NLP for a decade. Transformers replaced them in three years. This is the technical story of why β and what each architecture actually computes.
Building Your First Deep Learning Model with PyTorch: Practical Guide
Learn to build deep learning models with PyTorch from scratch. Covers tensors, neural networks, training loops, and your first image classifier β hands-on for real beginners.
Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes
Transfer learning lets you use ResNet, BERT, and ViT weights trained on millions of examples for your own dataset. Fine-tune in 30 minutes with real code and benchmark comparisons.