Convolutional Networks (CNN)

Convolutional Neural Networks

Difficulty:Advanced

Reading Time:25 min

Track:

Deep Learning

Neural networks for grid-like data that reuse small filters across space to learn local, translation-aware features.

Prerequisites

Neural Networks & Deep Learning

Deep LearningModule 5 of 19

Convolutional Neural Networks

26%

TL;DR

A CNN slides small, learned kernels across an input grid instead of connecting every pixel to every neuron, so each output unit only looks at a local receptive field.
Weight sharing means the same kernel is reused at every spatial position, which slashes parameter count and gives the network translation equivariance.
Output spatial size follows $O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1$ — padding and stride are knobs you tune to control how much the map shrinks.
Stacking convolutions grows the effective receptive field layer by layer, letting early layers see edges and later layers see whole objects.
CNNs trade a strong spatial inductive bias for data efficiency on grid-like data (images, audio, video) compared to architectures with fewer built-in assumptions, like Vision Transformers.
Pooling or strided convolutions downsample feature maps, compounding the receptive-field growth while keeping compute manageable.

Learning Objectives

Explain the mathematical convolution operation as applied to grid-like data
Describe the concepts of local receptive fields and weight sharing
Calculate dimensions of output activation maps given filter size, stride, and padding
Compare convolutional layers, pooling layers, and fully connected layers

Intuition

How to think conceptually about this topic

Think of a kernel as a learned template. A vertical-edge kernel gives a large positive response when the left side of a small image patch is bright and the right side is dark. Sliding that kernel across an image produces a feature map: high values mark positions where that pattern appears.

Training learns the kernel values rather than hand-coding them. A first layer might learn edges and color contrasts; later layers combine those maps into corners, repeated textures, eyes, wheels, or other task-relevant evidence. Pooling or strided convolution then trades exact location for a more compact representation.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Convolutional Neural Networks (CNNs) are built around a simple constraint: nearby pixels matter together, and the same visual pattern can appear in many positions. Instead of connecting every input pixel to every hidden unit, a convolutional layer learns a small kernel and applies it across the image.

This gives the model three useful biases. Local receptive fields make early layers sensitive to edges and textures. Weight sharing keeps the parameter count small. Stacking layers expands the effective receptive field, so later layers can combine local evidence into object parts and class-level features.

CNNs are not magic image recognizers. They are a way of saying: "look for the same learned pattern everywhere, then compose the detected patterns into higher-level evidence."

How It Compares

Fully Connected Network vs CNN vs Vision Transformer (ViT)

Dimension	Fully Connected Network	CNN	Vision Transformer (ViT)
Parameter count (on images)	Very high — scales with input size $\times$ output size	Low — kernels are shared across all spatial positions	Moderate-to-high — depends on patch size and embedding dimension, but no spatial weight sharing like convolution
Inductive bias	None — must learn every spatial relationship from scratch	Strong: locality + translation equivariance built in	Weak/minimal — mostly learns spatial relationships via attention and positional embeddings
Data efficiency	Poor on raw pixels — needs huge data to learn spatial structure	Good — strong bias lets it learn from comparatively modest datasets	Poor at small scale — typically needs large-scale pretraining to match or beat CNNs
Long-range dependencies	Possible in principle, but parameter-inefficient	Limited per-layer; requires depth or large kernels to grow receptive field	Strong — self-attention connects any two patches in one layer
Typical use case	Tabular data, or as the final classifier head	Image/audio/video tasks, especially with limited data or compute	Large-scale vision tasks with abundant data/pretraining and a need for global context

TakeawayCNNs sit between unstructured fully connected networks and ViTs: they trade ViT’s flexibility for a strong spatial prior that makes them far more data- and parameter-efficient on grid-like data, especially when large-scale pretraining is not available.

When to Use It

Reach for this when

Your data has a grid-like spatial or temporal structure (images, spectrograms, video frames, sensor grids) where nearby values are correlated.
You need a data-efficient model and cannot rely on massive pretraining corpora the way Vision Transformers typically do.
You want translation-aware features, e.g. detecting an object or pattern regardless of where it appears in the frame.
Inference needs to run on constrained hardware (mobile/embedded) where a parameter-efficient architecture matters.

Avoid it when

The input has no meaningful spatial locality (e.g. shuffled tabular features) — convolution’s locality assumption buys you nothing.
You need to model long-range global dependencies across the whole input in a single layer — attention-based architectures do this more directly.
You have very large labeled datasets and compute and the absolute best accuracy matters more than data efficiency — a well-pretrained Vision Transformer may outperform CNNs.
Spatial position itself is semantically meaningful and should not be treated as interchangeable (e.g. fixed-layout forms) — translation equivariance becomes a liability rather than a benefit.

Rules of thumb

Start with “same” padding ( $P = \lfloor K/2 \rfloor$ for odd $K$ ) and stride 1 in early layers to preserve resolution while learning features.
Prefer small kernels (3x3) stacked deep over a single large kernel — it is more parameter-efficient and gives a comparable receptive field.
Use strided convolutions or pooling deliberately to grow receptive field exponentially rather than relying on depth alone.

Implementation

Reference code implementation

Python

model_fitting.py

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# A simple Convolutional Neural Network in PyTorch
6class SimpleCNN(nn.Module):
7    def __init__(self):
8        super().__init__()
9        # Input has 3 channels (RGB), outputs 16 channels, filter is 3x3
10        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
11        # Pooling window is 2x2, cuts width/height in half
12        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
13        # Final fully connected classifier
14        self.fc = nn.Linear(16 * 16 * 16, 10) # Assumes 32x32 input size
15
16    def forward(self, x):
17        # 1. Slide filters + apply non-linear activation
18        x = F.relu(self.conv1(x))
19        # 2. Downsample
20        x = self.pool(x)
21        # 3. Flatten representation for classifier
22        x = torch.flatten(x, 1)
23        # 4. Compute output class scores
24        return self.fc(x)
25
26# Create model and run fake RGB image batch through it
27model = SimpleCNN()
28fake_images = torch.randn(4, 3, 32, 32) # Batch of 4 images
29logits = model(fake_images)
30print(f"Output shape: {logits.shape}") # Expect [4, 10]

Strengths & Advantages

Extremely efficient parameters: Weight sharing drastically reduces parameter count compared to fully connected layers.
Translation-aware features: The same learned detector is reused across the image, so a pattern can be recognized in multiple positions.
Hierarchical feature learning: Automatically builds features from simple edges to complex shapes.

Limitations & Drawbacks

Requires useful locality: Works best when nearby values have meaning, as in images, spectrograms, and some time-series data.
GPU dependent: Convolutions are highly parallelizable but computationally expensive, demanding GPU acceleration.
Adversarial vulnerability: Tiny, imperceptible changes in pixels can completely disrupt model classifications.

Real-World Case Studies

AlexNet and the 2012 ImageNet breakthrough

Computer vision / image classification

Scenario

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 required classifying images into 1,000 categories across roughly 1.2 million training images. Prior winning approaches relied on hand-engineered features (e.g. SIFT) combined with classical classifiers, and progress had plateaued.

Approach

Krizhevsky, Sutskever, and Hinton trained “AlexNet,” an 8-layer CNN (5 convolutional layers plus 3 fully connected layers) with ReLU activations, dropout regularization, and data augmentation, trained on GPUs using stacked small-stride convolutions and max pooling to progressively build hierarchical features from raw pixels.

Outcome

AlexNet achieved a top-5 test error of 15.3%, compared to 26.2% for the second-place entry that used traditional hand-engineered features — a dramatic margin that is widely credited with triggering the deep learning boom in computer vision over the following decade.

Source: ImageNet Classification with Deep Convolutional Neural Networks — Krizhevsky, A., Sutskever, I. and Hinton, G. E.

Common Misconceptions

MisconceptionCNNs are only useful for 2D images.

CorrectionCNNs can be applied to 1D data (such as audio signals or text sequences) and 3D data (such as videos or medical MRI scans) by adjusting the kernel dimensions.

MisconceptionPooling layers are always required in a CNN.

CorrectionMany modern CNN architectures replace pooling layers with strided convolutions to downsample while keeping the model fully differentiable.

References & Further Reading

Gradient-Based Learning Applied to Document Recognitiontextbook
By LeCun, Y. et al
View Resource →
ImageNet Classification with Deep Convolutional Neural Networkstextbook
By Krizhevsky, A., Sutskever, I. and Hinton, G. E
View Resource →

Convolutional Neural Networks

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

How It Compares

Fully Connected Network vs CNN vs Vision Transformer (ViT)

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

AlexNet and the 2012 ImageNet breakthrough

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Neural Networks & Deep Learning

Computer Vision Foundations

Image Segmentation

Vision Transformers (ViT)