Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are built around a simple constraint: nearby pixels matter together, and the same visual pattern can appear in many positions. Instead of connecting every input pixel to every hidden unit, a convolutional layer learns a small kernel and applies it across the image.
This gives the model three useful biases. Local receptive fields make early layers sensitive to edges and textures. Weight sharing keeps the parameter count small. Stacking layers expands the effective receptive field, so later layers can combine local evidence into object parts and class-level features.
CNNs are not magic image recognizers. They are a way of saying: "look for the same learned pattern everywhere, then compose the detected patterns into higher-level evidence."
Extremely efficient parameters: Weight sharing drastically reduces parameter count compared to fully connected layers.
Requires useful locality: Works best when nearby values have meaning, as in images, spectrograms, and some time-series data.
Intuition
How to think about this algorithm
Think of a kernel as a learned template. A vertical-edge kernel gives a large positive response when the left side of a small image patch is bright and the right side is dark. Sliding that kernel across an image produces a feature map: high values mark positions where that pattern appears.
Training learns the kernel values rather than hand-coding them. A first layer might learn edges and color contrasts; later layers combine those maps into corners, repeated textures, eyes, wheels, or other task-relevant evidence. Pooling or strided convolution then trades exact location for a more compact representation.
Spatial Filter Convolutions
Click Step controls to move the kernel sliding window (yellow). Observe the mathematical sum mapping to the output feature map (pink).
The Logic
Mathematical core for convolutional neural networks
1. Cross-correlation used in CNNs
Most deep-learning libraries implement cross-correlation, often still called convolution. For input image and kernel , the output activation at location is:
With learned kernels, the layer produces feature maps.
2. Output Spatial Dimensions
For one spatial dimension, input size , kernel size , padding , dilation , and stride produce:
3. Parameters and why sharing matters
A dense layer from a image to hidden units needs weights. A convolution with output channels needs only:
That parameter sharing is the main reason CNNs train efficiently on images.
Code Example
convolutional_neural_networks.py · pytorch example
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# A simple Convolutional Neural Network in PyTorch
6class SimpleCNN(nn.Module):
7 def __init__(self):
8 super().__init__()
9 # Input has 3 channels (RGB), outputs 16 channels, filter is 3x3
10 self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
11 # Pooling window is 2x2, cuts width/height in half
12 self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
13 # Final fully connected classifier
14 self.fc = nn.Linear(16 * 16 * 16, 10) # Assumes 32x32 input size
15
16 def forward(self, x):
17 # 1. Slide filters + apply non-linear activation
18 x = F.relu(self.conv1(x))
19 # 2. Downsample
20 x = self.pool(x)
21 # 3. Flatten representation for classifier
22 x = torch.flatten(x, 1)
23 # 4. Compute output class scores
24 return self.fc(x)
25
26# Create model and run fake RGB image batch through it
27model = SimpleCNN()
28fake_images = torch.randn(4, 3, 32, 32) # Batch of 4 images
29logits = model(fake_images)
30print(f"Output shape: {logits.shape}") # Expect [4, 10]Strengths
Extremely efficient parameters: Weight sharing drastically reduces parameter count compared to fully connected layers.
Translation-aware features: The same learned detector is reused across the image, so a pattern can be recognized in multiple positions.
Hierarchical feature learning: Automatically builds features from simple edges to complex shapes.
Limitations
Requires useful locality: Works best when nearby values have meaning, as in images, spectrograms, and some time-series data.
GPU dependent: Convolutions are highly parallelizable but computationally expensive, demanding GPU acceleration.
Adversarial vulnerability: Tiny, imperceptible changes in pixels can completely disrupt model classifications.
Key Assumptions
Scope conditions and interpretation notes
- 1
The input dataset has local spatial relationships (neighboring pixels are related).
- 2
Features are translation invariant across the grid space.
References
Books and papers for deeper study
LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998) 'Gradient-based learning applied to document recognition', Proceedings of the IEEE, 86(11), pp. 2278-2324.