Deep Learning Architectures

Neural Networks & Deep Learning

Difficulty:Advanced

Reading Time:30 min

Track:

Deep Learning

Layered differentiable models that learn feature representations by optimizing weights with gradient descent.

Prerequisites

Logistic Regression

Deep LearningModule 1 of 19

Neural Networks & Deep Learning

TL;DR

A feedforward neural network (MLP) stacks layers of $z = Wx + b$ followed by a non-linear activation $f(z)$ , letting it model curved decision boundaries that a single linear layer cannot.
Non-linearity is the whole point: stacking purely linear layers collapses back into one linear layer, so the activation function is what gives the network its expressive power.
The Universal Approximation Theorem says a single hidden layer with enough units can approximate any continuous function on a bounded domain — but "enough" can mean exponentially many units.
Depth lets the network compose features hierarchically (edges to shapes to objects), often representing the same function with far fewer total parameters than a single very wide layer.
Training uses backpropagation (the chain rule applied layer by layer) plus an optimizer like SGD or Adam to adjust every weight and bias to reduce the loss.
XOR is the classic proof that you need at least one hidden layer: it is not linearly separable, so no single-layer perceptron can solve it.

Learning Objectives

Explain the forward pass through a Multi-Layer Perceptron (MLP)
Describe how non-linear activations (ReLU, Sigmoid) enable networks to model non-linear functions
Contrast forward propagation and backpropagation steps
Explain the role of gradient descent in optimization of weights and biases

Intuition

How to think conceptually about this topic

Imagine a model trying to identify whether a photo contains a car.

The workers at the very start of the line (the first layer) only look at tiny, zoomed-in pixels. They can only recognize basic things like straight edges or simple curves. They pass their findings to the next group of workers.

The second layer of workers looks at those edges and curves and combines them. They might say, "Ah, these curves make a circle!" or "These edges make a rectangle!" They pass this up the chain.

The deeper layers combine those shapes into task-level evidence: wheels, windows, chassis-like structure, and background context. The output layer converts that evidence into class scores or probabilities.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Neural networks are layered differentiable functions. Each layer applies a linear transformation followed by a non-linear activation, and training adjusts the weights so the final output minimizes a loss function.

Deep learning means using many such layers. Depth lets the model build representations hierarchically: early layers learn simple signals, middle layers combine them, and later layers specialize them for the target task.

Where is it used?

They are used in language models, recommender systems, image recognition, speech recognition, translation, scientific modeling, and medical imaging. They are strongest when you have enough data, compute, and evaluation discipline to justify their flexibility.

How It Compares

Logistic Regression vs Single Hidden Layer MLP vs Deep MLP

Dimension	Logistic Regression	Single Hidden Layer MLP	Deep MLP
Representational capacity	Linear decision boundary only	Can approximate any continuous function given enough hidden units (universal approximation)	Can represent many structured functions with far fewer total parameters via hierarchical composition
Risk of overfitting	Low — few parameters, strong bias	Moderate to high if hidden layer is very wide relative to data size	High — many parameters; needs regularization, dropout, and enough data
Training difficulty	Easy — convex loss, fast convergence	Moderate — non-convex but generally well-behaved with modern optimizers	Harder — vanishing/exploding gradients, more hyperparameters, longer training
Typical use case	Simple, interpretable binary classification on linearly separable or near-linear data	Small-to-medium tabular problems with mild non-linearity	Images, text, audio, and other high-dimensional, highly non-linear problems with abundant data

TakeawayStart with logistic regression as a cheap, interpretable baseline; reach for a single hidden layer when the data shows clear non-linear structure but stays modest in size; go deep only when the problem has rich hierarchical structure (vision, language, audio) and you have enough data and compute to justify the added overfitting risk and training difficulty.

When to Use It

Reach for this when

The relationship between inputs and outputs is non-linear and cannot be captured by simple feature engineering on a linear or logistic model.
You have a large labeled dataset (thousands to millions of examples) relative to the number of parameters you plan to use.
The input has rich structure that benefits from hierarchical features — images, audio, text, or other high-dimensional signals.
You have access to enough compute (GPUs/TPUs) and time to iterate on architecture and hyperparameters.

Avoid it when

You have a small dataset (hundreds of rows) — a deep network will likely overfit; prefer linear/logistic regression or tree ensembles.
You need a fully interpretable model for regulatory or stakeholder reasons — favor simpler, transparent models or add explicit explainability tooling.
Latency or memory budgets are extremely tight (e.g. embedded devices) and a much smaller model would meet accuracy requirements.
The relationship is genuinely close to linear — a neural network adds complexity and tuning burden without meaningfully improving accuracy.

Rules of thumb

Start with the simplest model that could plausibly work (logistic regression or a shallow MLP) before reaching for depth.
Watch the gap between training and validation loss — a widening gap signals overfitting; add dropout, weight decay, or more data.
Prefer increasing depth over width when the problem has hierarchical structure; prefer width when the function is simple but needs fine resolution.

Implementation

Reference code implementation

Python

model_fitting.py

1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# Build a simple Neural Network with PyTorch
6class SimpleNetwork(nn.Module):
7    def __init__(self):
8        super().__init__()
9        self.layers = nn.Sequential(
10            nn.Linear(10, 32), # Input layer
11            nn.ReLU(),         # Activation function
12            nn.Linear(32, 16), # Hidden layer
13            nn.ReLU(),         # Activation function
14            nn.Linear(16, 1),  # Output layer
15            nn.Sigmoid()       # Squish output between 0 and 1
16        )
17        
18    def forward(self, x):
19        return self.layers(x)
20
21model = SimpleNetwork()
22criterion = nn.BCELoss() # How we measure mistakes
23optimizer = optim.Adam(model.parameters(), lr=0.01) # How we update weights
24
25# Fake data: 8 examples, 10 features each
26x_batch = torch.randn(8, 10)
27y_batch = torch.empty(8, 1).random_(2)
28
29# One training step
30optimizer.zero_grad()                 # Clear previous gradients
31predictions = model(x_batch)          # 1. Forward pass
32loss = criterion(predictions, y_batch)# 2. Compute loss
33loss.backward()                       # 3. Backpropagate gradients
34optimizer.step()                      # 4. Update weights
35
36print(f"Current Error (Loss): {loss.item():.4f}")

Strengths & Advantages

They can learn useful feature representations directly from data instead of relying only on hand-engineered features.
They scale well with data and compute, especially on GPUs and other accelerator hardware.
They support many architectures: dense networks, CNNs, RNNs, transformers, autoencoders, and graph neural networks.

Limitations & Drawbacks

They can be difficult to interpret, especially when many layers and millions of parameters interact.
They often need more data and tuning than simpler models. On small tabular datasets, tree ensembles are frequently stronger baselines.
Training and serving large networks can be expensive and requires careful monitoring for overfitting, drift, and failure modes.

Real-World Case Studies

The XOR problem and the first "AI winter" critique of perceptrons

History of AI / foundations of deep learning

Scenario

In 1969, Marvin Minsky and Seymour Papert published Perceptrons, proving that a single-layer perceptron cannot learn the XOR function because its four labeled points are not linearly separable. At the time, single-layer perceptrons were the dominant trainable neural model, and this result was widely read as a fundamental limitation of neural networks.

Approach

The eventual resolution was architectural: adding a hidden layer of just 2 units lets the network first map the XOR inputs into a new 2D space where the classes become linearly separable, after which a final linear output layer can solve the problem. Backpropagation (popularized in the mid-1980s by Rumelhart, Hinton, and Williams) provided a practical algorithm for training such multi-layer networks end-to-end.

Outcome

A minimal 2-2-1 MLP (2 inputs, 2 hidden ReLU/sigmoid units, 1 output) solves XOR with 100% accuracy on the 4 truth-table examples, something no single-layer perceptron can do for any choice of weights. This single example reframed the field’s understanding: the limitation was not "neural networks," but specifically linear, single-layer networks — motivating decades of subsequent work on deeper architectures and ultimately today’s deep learning systems.

Source: Neural Networks and Deep Learning, Ch. 1 ("Perceptrons" and the XOR limitation) — Nielsen, M. A.

Common Misconceptions

MisconceptionMore layers always lead to better performance without any downside.

CorrectionIncreasing depth increases model capacity, which can lead to severe overfitting if data is insufficient. It also increases training time and can cause vanishing/exploding gradient problems.

MisconceptionNeural networks simulate the human brain exactly.

CorrectionNeural networks are inspired by biological networks, but their mathematical training (backpropagation) and structures differ significantly from biological brains.

References & Further Reading

Deep Learningtextbook
By Goodfellow, I., Bengio, Y. and Courville, A
View Resource →
Neural Networks and Deep Learningtextbook
By Nielsen, M. A
View Resource →

Neural Networks & Deep Learning

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

Where is it used?

How It Compares

Logistic Regression vs Single Hidden Layer MLP vs Deep MLP

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

The XOR problem and the first "AI winter" critique of perceptrons

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Logistic Regression

Convolutional Neural Networks

Transformers