Convolutional Networks (CNN)

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are built around a simple constraint: nearby pixels matter together, and the same visual pattern can appear in many positions. Instead of connecting every input pixel to every hidden unit, a convolutional layer learns a small kernel and applies it across the image.

This gives the model three useful biases. Local receptive fields make early layers sensitive to edges and textures. Weight sharing keeps the parameter count small. Stacking layers expands the effective receptive field, so later layers can combine local evidence into object parts and class-level features.

CNNs are not magic image recognizers. They are a way of saying: "look for the same learned pattern everywhere, then compose the detected patterns into higher-level evidence."

Best use

Extremely efficient parameters: Weight sharing drastically reduces parameter count compared to fully connected layers.

Watch out for

Requires useful locality: Works best when nearby values have meaning, as in images, spectrograms, and some time-series data.

i

Intuition

How to think about this algorithm

Think of a kernel as a learned template. A vertical-edge kernel gives a large positive response when the left side of a small image patch is bright and the right side is dark. Sliding that kernel across an image produces a feature map: high values mark positions where that pattern appears.

Training learns the kernel values rather than hand-coding them. A first layer might learn edges and color contrasts; later layers combine those maps into corners, repeated textures, eyes, wheels, or other task-relevant evidence. Pooling or strided convolution then trades exact location for a more compact representation.

Interactive Diagram

Spatial Filter Convolutions

Click Step controls to move the kernel sliding window (yellow). Observe the mathematical sum mapping to the output feature map (pink).

Input Window
Output Cell
(1·1) + (1·0) + (1·(-1)) + (1·1) + (1·0) + (1·(-1)) + (1·1) + (1·0) + (1·(-1)) = 0
Stride Step size
Key InsightConvolutional layers slide parameterized kernels over grids to scan for localized features. This retains spatial structure while summarizing activation densities.

The Logic

Mathematical core for convolutional neural networks

1. Cross-correlation used in CNNs

Most deep-learning libraries implement cross-correlation, often still called convolution. For input image XX and kernel KK, the output activation at location (i,j)(i,j) is:

Yi,j=b+c=1Cinu=0kh1v=0kw1Kc,u,vXc,i+u,j+vY_{i,j} = b + \sum_{c=1}^{C_{in}}\sum_{u=0}^{k_h-1}\sum_{v=0}^{k_w-1} K_{c,u,v}X_{c,i+u,j+v}

With CoutC_{out} learned kernels, the layer produces CoutC_{out} feature maps.

2. Output Spatial Dimensions

For one spatial dimension, input size WW, kernel size FF, padding PP, dilation DD, and stride SS produce:

O=W+2PD(F1)1S+1O = \left\lfloor \frac{W + 2P - D(F-1) - 1}{S} \right\rfloor + 1

3. Parameters and why sharing matters

A dense layer from a 32×32×332\times32\times3 image to 6464 hidden units needs 3232364=196,60832\cdot32\cdot3\cdot64=196{,}608 weights. A 3×33\times3 convolution with 6464 output channels needs only:

33364=1,7283\cdot3\cdot3\cdot64 = 1{,}728

That parameter sharing is the main reason CNNs train efficiently on images.

Code Example

convolutional_neural_networks.py · pytorch example

Python
model_fitting.py
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# A simple Convolutional Neural Network in PyTorch
6class SimpleCNN(nn.Module):
7    def __init__(self):
8        super().__init__()
9        # Input has 3 channels (RGB), outputs 16 channels, filter is 3x3
10        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
11        # Pooling window is 2x2, cuts width/height in half
12        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
13        # Final fully connected classifier
14        self.fc = nn.Linear(16 * 16 * 16, 10) # Assumes 32x32 input size
15
16    def forward(self, x):
17        # 1. Slide filters + apply non-linear activation
18        x = F.relu(self.conv1(x))
19        # 2. Downsample
20        x = self.pool(x)
21        # 3. Flatten representation for classifier
22        x = torch.flatten(x, 1)
23        # 4. Compute output class scores
24        return self.fc(x)
25
26# Create model and run fake RGB image batch through it
27model = SimpleCNN()
28fake_images = torch.randn(4, 3, 32, 32) # Batch of 4 images
29logits = model(fake_images)
30print(f"Output shape: {logits.shape}") # Expect [4, 10]

Strengths

  • Extremely efficient parameters: Weight sharing drastically reduces parameter count compared to fully connected layers.

  • Translation-aware features: The same learned detector is reused across the image, so a pattern can be recognized in multiple positions.

  • Hierarchical feature learning: Automatically builds features from simple edges to complex shapes.

!

Limitations

  • Requires useful locality: Works best when nearby values have meaning, as in images, spectrograms, and some time-series data.

  • GPU dependent: Convolutions are highly parallelizable but computationally expensive, demanding GPU acceleration.

  • Adversarial vulnerability: Tiny, imperceptible changes in pixels can completely disrupt model classifications.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    The input dataset has local spatial relationships (neighboring pixels are related).

  • 2

    Features are translation invariant across the grid space.

R

References

Books and papers for deeper study

  • LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998) 'Gradient-based learning applied to document recognition', Proceedings of the IEEE, 86(11), pp. 2278-2324.