How do you choose the kernel size and number of filters?

Kernel size (3x3, 5x5, 7x7) determines the receptive field of each filter — how large a patch it looks at. The trend since VGGNet (Simonyan & Zisserman, 2014) is toward small 3x3 kernels: two 3x3 layers have the same receptive field as one 5x5 but fewer parameters and one more nonlinearity. Number of filters determines representational capacity. The standard practice is to double the number of filters after each max-pooling operation (e.g., 64 → 128 → 256 → 512), which compensates for the spatial resolution being halved. Start with these standard choices and adjust based on your validation performance.

What are skip connections and why do they matter?

Skip connections, introduced in ResNet (He et al., 2016), add the input of a block directly to its output: `output = F(x) + x`. This has two critical effects. First, it ensures gradient flow: even if the intermediate layers have small gradients, the skip connection provides a direct path from loss to input, preventing vanishing gradients. Second, it changes what the network learns: instead of learning to transform `x` into `output`, it only needs to learn the residual `F(x) = output - x`. Learning small corrections is much easier than learning a full transformation. This is why ResNets can train networks with 50, 101, even 152 layers that were impossible to train without skip connections.

How does transfer learning work with CNNs?

Convolutional networks trained on ImageNet learn a rich hierarchy of visual features — edges, textures, shapes — that transfer remarkably well to other visual tasks. The standard approach: take a pre-trained backbone (ResNet, EfficientNet), freeze the early convolutional layers (which have learned generic features), replace the final classification head with one matching your number of classes, and fine-tune the last few layers on your data. If you have very little data (fewer than 1000 images), freeze everything except the head. More data lets you fine-tune more layers. This is covered in depth in the Transfer Learning article.

AiTechWorlds

Data visualization grid showing feature maps and filters in a convolutional neural network

Deep Learning

Convolutional Neural Networks (CNNs): How Image Recognition Works

⚡ Quick Answer

CNNs learn to see by sharing weights across space. Here's the math behind convolution, pooling, and why ResNets can train 100+ layers without vanishing gradients.

Abdullah Al Arman Emon June 5, 2026 13 min read

#cnn #convolutional-neural-networks #computer-vision #image-recognition #deep-learning #resnet

📚Part of the Deep Learning guide — explore all Deep Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Convolutional Neural Networks (CNNs): How Image Recognition Works

A convolutional neural network (CNN) is a neural network that hard-codes two assumptions about images directly into its architecture: nearby pixels are related, and the same visual pattern means the same thing no matter where it sits in the frame. Most explanations stop at "CNNs use sliding filters" and move on — that skips the actual insight.

Think of a CNN as a detective who only needs to learn what a fingerprint looks like once, then can spot that same fingerprint anywhere on the page. A regular network would have to relearn it separately for every possible location.

Those two priors — locality and translation invariance — are why CNNs need far fewer parameters than a generic network and why they became the default architecture for vision.

Why Fully-Connected Networks Fail for Images

A fully-connected layer treats every pixel as unrelated to every other pixel, which throws away the one fact you know for certain about images: neighboring pixels belong together.

A 224×224 RGB image has 224 × 224 × 3 = 150,528 input values. A single fully-connected hidden layer with 4,096 neurons would need 150,528 × 4,096 ≈ 617 million parameters — just for the first layer. AlexNet (Krizhevsky et al., 2012), the network that reset the field of computer vision, has only 60 million total parameters, and most of that budget went elsewhere.

Beyond the parameter count, fully-connected layers ignore spatial structure entirely. Every input pixel connects to every neuron with an independent weight. The network has no built-in way to notice that adjacent pixels form edges, or that an eye looks the same in the top-left of an image as in the bottom-right — it has to learn each case from scratch, separately, at a staggering parameter cost.

Convolutional layers solve both problems at once.

The Convolution Operation

A convolution slides a small weight matrix — the kernel, or filter — across the input and computes a dot product at every position, the way a stamp presses the same design onto every page of a notebook. If the input is a 2D feature map I and the kernel is K of size k×k:

(I * K)[i, j] = Σₘ Σₙ I[i+m, j+n] · K[m, n]

The output is called a feature map or activation map. Each value answers one question: how strongly does this kernel's pattern appear at this location?

The key mechanism is weight sharing: the same kernel weights K are applied at every position. A filter that detects a horizontal edge uses the same nine numbers, for a 3×3 kernel, whether it looks at position (0,0) or position (100,100). This reduces parameters from H × W × k × k, the fully-connected cost, down to just k × k.

A single convolutional layer applies multiple filters, each learning to detect a different pattern — one for edges, one for curves, one for a particular texture. With 64 filters of size 3×3×3 applied to an RGB image:

Parameters = 64 × (3 × 3 × 3) + 64 (bias) = 1,792

Compare that to the millions a fully-connected layer would need.

Convolution in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

# A single convolutional layer
conv = nn.Conv2d(
    in_channels=3,    # RGB input
    out_channels=64,  # 64 different filters
    kernel_size=3,    # 3x3 kernels
    padding=1,        # 'same' padding — output same spatial size as input
    stride=1
)

# Manual convolution to see the math
x = torch.randn(1, 3, 32, 32)  # batch_size=1, channels=3, 32x32
output = conv(x)
print(f"Input shape:  {x.shape}")        # [1, 3, 32, 32]
print(f"Output shape: {output.shape}")   # [1, 64, 32, 32]

# Visualizing what a learned filter looks like
# After training, conv.weight[0] is the first filter
# Shape: [3, 3, 3] — 3 input channels, 3x3 spatial

Pooling: Downsampling with Purpose

Pooling is a layer that shrinks a feature map's spatial size by summarizing each local neighborhood into a single value — the CNN equivalent of zooming out to see the bigger picture instead of every brick in the wall. Max pooling takes the maximum value in each local window:

pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Halves spatial dimensions: [1, 64, 32, 32] → [1, 64, 16, 16]

Max pooling wins over the alternatives for two reasons: it tolerates small shifts — if a feature moves slightly within the pooling window, the maximum still fires — and it cuts computation, since every later layer processes fewer spatial positions.

Global average pooling (GAP) averages across the entire spatial dimension, collapsing a feature map to a single value per channel. GAP replaced fully-connected layers in modern architectures, cutting millions of parameters to zero while improving generalization — proof that fewer parameters can mean a better model, not a weaker one.

Building a CNN from Scratch

class ConvNet(nn.Module):
    """
    A small CNN for CIFAR-10 (32x32 RGB, 10 classes).
    Architecture: 3 conv blocks + global average pooling + classifier
    """
    def __init__(self):
        super().__init__()
        
        # Block 1: 3 → 64 channels, 32x32 → 16x16
        self.block1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2)   # 32x32 → 16x16
        )
        
        # Block 2: 64 → 128 channels, 16x16 → 8x8
        self.block2 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2)   # 16x16 → 8x8
        )
        
        # Block 3: 128 → 256 channels, 8x8 → 4x4
        self.block3 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2)   # 8x8 → 4x4
        )
        
        # Global average pooling: 4x4 → 1x1
        self.gap = nn.AdaptiveAvgPool2d(1)
        
        # Classifier
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.5),
            nn.Linear(256, 10)
        )
    
    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.gap(x)
        return self.classifier(x)

model = ConvNet()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# ~1.1M parameters — lean but effective

Architecture Evolution

Each architecture introduced an idea that became standard practice:

AlexNet popularized ReLU: it swapped tanh for ReLU activations, added dropout regularization, and proved GPU training could scale.
VGGNet chose simplicity: deep stacks of plain 3×3 convolutions beat hand-tuned exotic kernel shapes.
GoogLeNet added multi-scale vision: Inception modules process the input at several receptive-field sizes simultaneously.
ResNet solved depth: skip connections are the single most important CNN innovation, discussed next.
EfficientNet scaled systematically: compound scaling grows width, depth, and resolution together instead of tuning one at a time.

Residual Networks: Solving the Depth Problem

A residual network (ResNet) is a CNN that adds a shortcut path around each block, letting the block learn only the correction to its input rather than the whole transformation from scratch.

Before ResNets, deeper networks were paradoxically worse than shallower ones — not from overfitting, but from optimization difficulty. Gradients vanished as they backpropagated through many layers. A 56-layer network trained worse than a 20-layer network on the training set itself (He et al., 2016), which ruled out overfitting as the culprit.

The residual block solves this elegantly:

class ResidualBlock(nn.Module):
    """
    Basic ResNet residual block (He et al., 2016).
    F(x) + x: the block learns the residual, not the full transformation.
    """
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)
    
    def forward(self, x):
        identity = x                           # skip connection
        
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))        # no activation yet
        
        out = out + identity                   # add skip connection
        out = F.relu(out)                      # activate after addition
        
        return out

The skip connection does two things. At initialization, F(x) is near zero, so F(x) + x ≈ x — the block starts as an identity mapping and only has to learn small deviations from it, which is far easier than learning a full transformation from scratch. During backprop, the gradient flows through two paths at once: through the residual branch F(x) and directly through the skip connection. Even if the residual branch's gradient vanishes, the skip connection keeps information flowing — like a highway bypass that stays open even when the local streets are jammed.

ImageNet Benchmark Results

ImageNet is the standard benchmark for image classification: 1.2 million training images across 1,000 classes. Top-1 accuracy is the fraction of images where the model's top prediction is correct.

Model	Year	Top-1 Accuracy	Parameters	GFLOPs
AlexNet	2012	56.5%	60M	0.7
VGG-16	2014	71.6%	138M	15.3
GoogLeNet	2014	69.8%	6.8M	1.4
ResNet-50	2016	76.1%	25.6M	4.1
ResNet-152	2016	77.8%	60.2M	11.3
EfficientNet-B4	2019	82.9%	19M	4.2
EfficientNet-B7	2019	84.3%	66M	37.0
ViT-L/16	2021	87.1%	307M	190.7
ConvNeXt-L	2022	87.5%	198M	34.4

EfficientNet-B4 beats ResNet-152 on accuracy with a third of the parameters — proof that architecture design matters as much as raw scale. The FLOP count matters for deployment specifically: a model that needs 37 GFLOPs cannot run in real time on a mobile device, no matter how accurate it is.

Receptive Field: How Much Does Each Neuron See?

The receptive field is the region of the original input that influences a single neuron's output — and in a deep network it is much larger than that neuron's own kernel size suggests.

Think of it like standing progressively farther back from a mosaic: each step back lets your eye take in more tiles at once, even though your vision hasn't physically changed.

For a stack of n layers with kernel size k and no pooling:

Receptive field = 1 + n × (k - 1)

With pooling (stride 2) every p layers:

Effective receptive field grows as ~2^(n/p)

This is why networks go deep: each layer compounds the previous one, letting final layers integrate information from the entire image while early layers stay focused on fine-grained local patterns.

After five layers of 3×3 convolutions with 2×2 max pooling every two layers, a neuron can "see" an area of roughly 64×64 pixels — far beyond the 3×3 patch its own weights cover.

Data Augmentation: Making the Most of Limited Data

Data augmentation is the practice of generating modified copies of your training images — flipped, cropped, recolored — so a limited dataset behaves like a larger, more diverse one. CNNs are hungry for data, and augmentation is the cheapest way to feed that hunger:

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),         # random crop with padding
    transforms.RandomHorizontalFlip(p=0.5),        # horizontal flip
    transforms.ColorJitter(
        brightness=0.2, contrast=0.2,
        saturation=0.2, hue=0.1
    ),
    transforms.RandomRotation(degrees=15),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],               # ImageNet statistics
        std=[0.229, 0.224, 0.225]
    ),
])

# Validation — no augmentation, only normalization
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

Each transform exploits a known invariance: a dog is still a dog when flipped horizontally, so horizontal flip is safe; lighting shouldn't change object identity, so color jitter teaches that; scale and position shouldn't matter either, so random crops teach that too.

Using a Pre-Trained ResNet

Transfer learning means starting from a backbone already trained on ImageNet instead of training from scratch — almost always the right default, since training ImageNet-scale models from zero requires hundreds of GPU-hours and millions of images.

import torchvision.models as models

# Load pre-trained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Freeze early layers — they have learned general features
for name, param in backbone.named_parameters():
    if 'layer4' not in name and 'fc' not in name:
        param.requires_grad = False

# Replace the classification head for your number of classes
num_classes = 5  # your task
backbone.fc = nn.Sequential(
    nn.Dropout(0.3),
    nn.Linear(backbone.fc.in_features, num_classes)
)

# Only the final layers are trainable
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total = sum(p.numel() for p in backbone.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")

Freezing the early layers keeps the features that transfer well and only spends training budget on the layers specific to your task. For more detail on fine-tuning strategies, see the Transfer Learning Explained article.

Practical Training Tips

A few habits separate models that train cleanly from ones that stall:

Learning rate warmup avoids early destructive updates: start with a small learning rate — about a tenth of the target — for the first few epochs, then ramp up linearly, since weights are still randomly initialized and large steps can wreck them.
Cosine annealing finds flatter, more generalizable minima: decay the learning rate along a cosine curve instead of a step schedule.

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.01,
    epochs=50,
    steps_per_epoch=len(train_loader),
    pct_start=0.3,    # 30% of training for warmup
)

Mixed precision training cuts memory and speeds up training: run the forward and backward pass in float16 and keep parameter updates in float32, which nearly halves memory use and often speeds up training 2-3x on modern GPUs.

scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast():
    output = model(input)
    loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

From CNNs to Transformers

CNNs dominated computer vision from AlexNet in 2012 through roughly 2020. Then Vision Transformers (ViT) (Dosovitskiy et al., 2021) showed that transformers could match or beat CNNs on ImageNet — but only when trained on enough data.

The competition is unresolved: ConvNeXt (Liu et al., 2022) showed that modernizing CNNs with transformer-inspired design choices closes most of the accuracy gap. Both architectures remain relevant, and hybrid designs combining both are increasingly common in production systems.

For sequence problems — text, audio, time series — CNNs play a secondary role. That territory belongs to recurrent networks and transformers, covered in LSTM vs Transformer.

Test your CNN knowledge with the Deep Learning Quiz, and for more context on how these architectures fit together, the ML Algorithms Quiz covers the broader landscape.

The Machine Learning course includes hands-on CNN projects, and LLM concepts notes explain how visual features connect to language model representations.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Three properties of convolutional layers give CNNs an enormous advantage over dense layers for images. Local connectivity: each neuron connects only to a small patch of the input, appropriate because nearby pixels are more related than distant ones. Weight sharing: the same filter is applied at every position, so a cat-ear detector learned in the top-left corner works in the bottom-right corner too. Spatial hierarchy: repeated pooling and convolution build representations from small local features (edges) up to large global ones (objects). A fully-connected network applied to a 224x224 image would need 150,000 input connections per neuron — CNNs cut that to a handful while achieving better accuracy.

Abdullah Al Arman Emon✓ Verified Writer

Software Testing Expert & Prompt Engineering

Ensures every release is bug-free through rigorous testing, and crafts high-precision prompts that power our AI-driven workflows. Abdullah Al Arman Emon leads QA and prompt engineering across AiTechWorlds.

💻 GitHub View Profile →

Not sure yet? Ask AI about this article

Get an instant, unbiased AI summary of “Convolutional Neural Networks (CNNs): How Image Recognition Works”.

Ask ChatGPT Ask Claude Ask Perplexity

Abstract neural network visualization with glowing nodes and connections representing deep learning

AI Learning

Deep Learning Explained: Neural Networks from Zero to Understanding

Most tutorials teach you the API. This guide teaches you what's actually happening inside a neural network — forward pass, backprop, and why depth matters.

June 5, 2026 12 min read

Abstract AI brain visualization representing sequence learning and attention mechanisms in neural networks

AI Learning

LSTM vs Transformer: The Evolution of Sequence Learning in AI

LSTMs ruled NLP for a decade. Transformers replaced them in three years. This is the technical story of why — and what each architecture actually computes.

June 5, 2026 12 min read

Code editor showing deep learning Python code on a dark monitor

AI Learning

Building Your First Deep Learning Model with PyTorch: Practical Guide

Learn to build deep learning models with PyTorch from scratch. Covers tensors, neural networks, training loops, and your first image classifier — hands-on for real beginners.

June 5, 2026 10 min read

Neural network architecture diagram showing layers of a pre-trained deep learning model

AI Learning

Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes

Transfer learning lets you use ResNet, BERT, and ViT weights trained on millions of examples for your own dataset. Fine-tune in 30 minutes with real code and benchmark comparisons.

June 5, 2026 12 min read

Go deeper on this topic

NotesTransformer Architecture Cheat Sheet NotesActivation & Loss Functions Reference CourseMachine Learning NotesPrompt Engineering Cheat Sheet NotesLLM Core Concepts Explained NotesChatGPT Tips & Tricks Cheat Sheet

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Deep Learning

Convolutional Neural Networks (CNNs): How Image Recognition Works

⚡ Quick Answer

CNNs learn to see by sharing weights across space. Here's the math behind convolution, pooling, and why ResNets can train 100+ layers without vanishing gradients.

Abdullah Al Arman Emon June 5, 2026 13 min read

#cnn #convolutional-neural-networks #computer-vision #image-recognition #deep-learning #resnet

📚Part of the Deep Learning guide — explore all Deep Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Convolutional Neural Networks (CNNs): How Image Recognition Works

Those two priors — locality and translation invariance — are why CNNs need far fewer parameters than a generic network and why they became the default architecture for vision.

Why Fully-Connected Networks Fail for Images

A fully-connected layer treats every pixel as unrelated to every other pixel, which throws away the one fact you know for certain about images: neighboring pixels belong together.

Convolutional layers solve both problems at once.

The Convolution Operation

(I * K)[i, j] = Σₘ Σₙ I[i+m, j+n] · K[m, n]

The output is called a feature map or activation map. Each value answers one question: how strongly does this kernel's pattern appear at this location?

Parameters = 64 × (3 × 3 × 3) + 64 (bias) = 1,792

Compare that to the millions a fully-connected layer would need.

Convolution in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

# A single convolutional layer
conv = nn.Conv2d(
    in_channels=3,    # RGB input
    out_channels=64,  # 64 different filters
    kernel_size=3,    # 3x3 kernels
    padding=1,        # 'same' padding — output same spatial size as input
    stride=1
)

# Manual convolution to see the math
x = torch.randn(1, 3, 32, 32)  # batch_size=1, channels=3, 32x32
output = conv(x)
print(f"Input shape:  {x.shape}")        # [1, 3, 32, 32]
print(f"Output shape: {output.shape}")   # [1, 64, 32, 32]

# Visualizing what a learned filter looks like
# After training, conv.weight[0] is the first filter
# Shape: [3, 3, 3] — 3 input channels, 3x3 spatial

Pooling: Downsampling with Purpose

pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Halves spatial dimensions: [1, 64, 32, 32] → [1, 64, 16, 16]

Building a CNN from Scratch

class ConvNet(nn.Module):
    """
    A small CNN for CIFAR-10 (32x32 RGB, 10 classes).
    Architecture: 3 conv blocks + global average pooling + classifier
    """
    def __init__(self):
        super().__init__()
        
        # Block 1: 3 → 64 channels, 32x32 → 16x16
        self.block1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2)   # 32x32 → 16x16
        )
        
        # Block 2: 64 → 128 channels, 16x16 → 8x8
        self.block2 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2)   # 16x16 → 8x8
        )
        
        # Block 3: 128 → 256 channels, 8x8 → 4x4
        self.block3 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2)   # 8x8 → 4x4
        )
        
        # Global average pooling: 4x4 → 1x1
        self.gap = nn.AdaptiveAvgPool2d(1)
        
        # Classifier
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.5),
            nn.Linear(256, 10)
        )
    
    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.gap(x)
        return self.classifier(x)

model = ConvNet()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# ~1.1M parameters — lean but effective

Architecture Evolution

Each architecture introduced an idea that became standard practice:

AlexNet popularized ReLU: it swapped tanh for ReLU activations, added dropout regularization, and proved GPU training could scale.
VGGNet chose simplicity: deep stacks of plain 3×3 convolutions beat hand-tuned exotic kernel shapes.
GoogLeNet added multi-scale vision: Inception modules process the input at several receptive-field sizes simultaneously.
ResNet solved depth: skip connections are the single most important CNN innovation, discussed next.
EfficientNet scaled systematically: compound scaling grows width, depth, and resolution together instead of tuning one at a time.

Residual Networks: Solving the Depth Problem

A residual network (ResNet) is a CNN that adds a shortcut path around each block, letting the block learn only the correction to its input rather than the whole transformation from scratch.

The residual block solves this elegantly:

class ResidualBlock(nn.Module):
    """
    Basic ResNet residual block (He et al., 2016).
    F(x) + x: the block learns the residual, not the full transformation.
    """
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)
    
    def forward(self, x):
        identity = x                           # skip connection
        
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))        # no activation yet
        
        out = out + identity                   # add skip connection
        out = F.relu(out)                      # activate after addition
        
        return out

ImageNet Benchmark Results

Model	Year	Top-1 Accuracy	Parameters	GFLOPs
AlexNet	2012	56.5%	60M	0.7
VGG-16	2014	71.6%	138M	15.3
GoogLeNet	2014	69.8%	6.8M	1.4
ResNet-50	2016	76.1%	25.6M	4.1
ResNet-152	2016	77.8%	60.2M	11.3
EfficientNet-B4	2019	82.9%	19M	4.2
EfficientNet-B7	2019	84.3%	66M	37.0
ViT-L/16	2021	87.1%	307M	190.7
ConvNeXt-L	2022	87.5%	198M	34.4

Receptive Field: How Much Does Each Neuron See?

The receptive field is the region of the original input that influences a single neuron's output — and in a deep network it is much larger than that neuron's own kernel size suggests.

Think of it like standing progressively farther back from a mosaic: each step back lets your eye take in more tiles at once, even though your vision hasn't physically changed.

For a stack of n layers with kernel size k and no pooling:

Receptive field = 1 + n × (k - 1)

With pooling (stride 2) every p layers:

Effective receptive field grows as ~2^(n/p)

This is why networks go deep: each layer compounds the previous one, letting final layers integrate information from the entire image while early layers stay focused on fine-grained local patterns.

After five layers of 3×3 convolutions with 2×2 max pooling every two layers, a neuron can "see" an area of roughly 64×64 pixels — far beyond the 3×3 patch its own weights cover.

Data Augmentation: Making the Most of Limited Data

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),         # random crop with padding
    transforms.RandomHorizontalFlip(p=0.5),        # horizontal flip
    transforms.ColorJitter(
        brightness=0.2, contrast=0.2,
        saturation=0.2, hue=0.1
    ),
    transforms.RandomRotation(degrees=15),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],               # ImageNet statistics
        std=[0.229, 0.224, 0.225]
    ),
])

# Validation — no augmentation, only normalization
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

Using a Pre-Trained ResNet

import torchvision.models as models

# Load pre-trained ResNet-50
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Freeze early layers — they have learned general features
for name, param in backbone.named_parameters():
    if 'layer4' not in name and 'fc' not in name:
        param.requires_grad = False

# Replace the classification head for your number of classes
num_classes = 5  # your task
backbone.fc = nn.Sequential(
    nn.Dropout(0.3),
    nn.Linear(backbone.fc.in_features, num_classes)
)

# Only the final layers are trainable
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total = sum(p.numel() for p in backbone.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")

Practical Training Tips

A few habits separate models that train cleanly from ones that stall:

Learning rate warmup avoids early destructive updates: start with a small learning rate — about a tenth of the target — for the first few epochs, then ramp up linearly, since weights are still randomly initialized and large steps can wreck them.
Cosine annealing finds flatter, more generalizable minima: decay the learning rate along a cosine curve instead of a step schedule.

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.01,
    epochs=50,
    steps_per_epoch=len(train_loader),
    pct_start=0.3,    # 30% of training for warmup
)

Mixed precision training cuts memory and speeds up training: run the forward and backward pass in float16 and keep parameter updates in float32, which nearly halves memory use and often speeds up training 2-3x on modern GPUs.

scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast():
    output = model(input)
    loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

From CNNs to Transformers

For sequence problems — text, audio, time series — CNNs play a secondary role. That territory belongs to recurrent networks and transformers, covered in LSTM vs Transformer.

Test your CNN knowledge with the Deep Learning Quiz, and for more context on how these architectures fit together, the ML Algorithms Quiz covers the broader landscape.

The Machine Learning course includes hands-on CNN projects, and LLM concepts notes explain how visual features connect to language model representations.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Abdullah Al Arman Emon✓ Verified Writer

Software Testing Expert & Prompt Engineering

💻 GitHub View Profile →

Not sure yet? Ask AI about this article

Get an instant, unbiased AI summary of “Convolutional Neural Networks (CNNs): How Image Recognition Works”.

Ask ChatGPT Ask Claude Ask Perplexity

AI Learning

Deep Learning Explained: Neural Networks from Zero to Understanding

Most tutorials teach you the API. This guide teaches you what's actually happening inside a neural network — forward pass, backprop, and why depth matters.

June 5, 2026 12 min read

AI Learning

LSTM vs Transformer: The Evolution of Sequence Learning in AI

LSTMs ruled NLP for a decade. Transformers replaced them in three years. This is the technical story of why — and what each architecture actually computes.

June 5, 2026 12 min read

AI Learning

Building Your First Deep Learning Model with PyTorch: Practical Guide

Learn to build deep learning models with PyTorch from scratch. Covers tensors, neural networks, training loops, and your first image classifier — hands-on for real beginners.

June 5, 2026 10 min read

AI Learning

Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes

Transfer learning lets you use ResNet, BERT, and ViT weights trained on millions of examples for your own dataset. Fine-tune in 30 minutes with real code and benchmark comparisons.

June 5, 2026 12 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Convolutional Neural Networks (CNNs): How Image Recognition Works

Convolutional Neural Networks (CNNs): How Image Recognition Works

Why Fully-Connected Networks Fail for Images

The Convolution Operation

Convolution in PyTorch

Pooling: Downsampling with Purpose

Building a CNN from Scratch

Architecture Evolution

Residual Networks: Solving the Depth Problem

ImageNet Benchmark Results

Receptive Field: How Much Does Each Neuron See?

Data Augmentation: Making the Most of Limited Data

Using a Pre-Trained ResNet

Practical Training Tips

From CNNs to Transformers

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Not sure yet? Ask AI about this article

Related Articles

Deep Learning Explained: Neural Networks from Zero to Understanding

LSTM vs Transformer: The Evolution of Sequence Learning in AI

Building Your First Deep Learning Model with PyTorch: Practical Guide

Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes

Go deeper on this topic

Get Free AI Notes Daily

Convolutional Neural Networks (CNNs): How Image Recognition Works

Convolutional Neural Networks (CNNs): How Image Recognition Works

Why Fully-Connected Networks Fail for Images

The Convolution Operation

Convolution in PyTorch

Pooling: Downsampling with Purpose

Building a CNN from Scratch

Architecture Evolution

Residual Networks: Solving the Depth Problem

ImageNet Benchmark Results

Receptive Field: How Much Does Each Neuron See?

Data Augmentation: Making the Most of Limited Data

Using a Pre-Trained ResNet

Practical Training Tips

From CNNs to Transformers

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Not sure yet? Ask AI about this article

Related Articles

Deep Learning Explained: Neural Networks from Zero to Understanding

LSTM vs Transformer: The Evolution of Sequence Learning in AI

Building Your First Deep Learning Model with PyTorch: Practical Guide

Transfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes

Go deeper on this topic

Get Free AI Notes Daily