Deep Learning Architectures

Neural Networks & Deep Learning

Neural networks are layered differentiable functions. Each layer applies a linear transformation followed by a non-linear activation, and training adjusts the weights so the final output minimizes a loss function.

Deep learning means using many such layers. Depth lets the model build representations hierarchically: early layers learn simple signals, middle layers combine them, and later layers specialize them for the target task.

Where is it used?

They are used in language models, recommender systems, image recognition, speech recognition, translation, scientific modeling, and medical imaging. They are strongest when you have enough data, compute, and evaluation discipline to justify their flexibility.

Best use

They can learn useful feature representations directly from data instead of relying only on hand-engineered features.

Watch out for

They can be difficult to interpret, especially when many layers and millions of parameters interact.

i

Intuition

How to think about this algorithm

Imagine a model trying to identify whether a photo contains a car.

The workers at the very start of the line (the first layer) only look at tiny, zoomed-in pixels. They can only recognize basic things like straight edges or simple curves. They pass their findings to the next group of workers.

The second layer of workers looks at those edges and curves and combines them. They might say, "Ah, these curves make a circle!" or "These edges make a rectangle!" They pass this up the chain.

The deeper layers combine those shapes into task-level evidence: wheels, windows, chassis-like structure, and background context. The output layer converts that evidence into class scores or probabilities.

Interactive Diagram

Multi-Layer Perceptron Activations

Slide inputs x1 and x2, or adjust parameters directly. Observe synapse connections (thickness represents weights) and activation flows.

Positive Weights
Negative Weights
Output Activation
Output prediction: 0.6536
z_out = w^T · h + b = 0.635
Input x10.50
Input x2-0.20
Tuning weights (synapses)
w11 (Input1 → Hidden1)0.80
w21 (Input2 → Hidden1)-0.40
wOut1 (Hidden1 → Output)0.90
Key InsightNeural networks use cascaded matrices of weights and biases to transform input vectors. Non-linear activation functions (like tanh/sigmoid) allow boundaries to bend.

The Logic

Mathematical core for neural networks & deep learning

1. The Forward Pass

A basic Neural Network layer takes your input data (xx), multiplies it by a set of weights (WW), adds a bias (bb), and then passes the result through an "activation function" (ff) to introduce non-linearity (so it can learn curves, not just straight lines):

z(1)=W(1)x+b(1)z^{(1)} = W^{(1)}x + b^{(1)}

a(1)=f(z(1))a^{(1)} = f(z^{(1)})

y^=W(2)a(1)+b(2)\hat{y} = W^{(2)}a^{(1)} + b^{(2)}

2. Backpropagation (How it learns)

When the network makes a prediction, the loss L\mathcal{L} measures the error. Backpropagation applies the chain rule to compute gradients of that loss with respect to every trainable weight:

LW(1)=La(1)a(1)z(1)z(1)W(1)\frac{\partial \mathcal{L}}{\partial W^{(1)}} = \frac{\partial \mathcal{L}}{\partial a^{(1)}} \frac{\partial a^{(1)}}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial W^{(1)}}

An optimizer such as stochastic gradient descent or Adam then updates the weights in a direction that tends to reduce the loss on future batches.

Code Example

neural_networks_&_deep_learning.py · pytorch example

Python
model_fitting.py
1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# Build a simple Neural Network with PyTorch
6class SimpleNetwork(nn.Module):
7    def __init__(self):
8        super().__init__()
9        self.layers = nn.Sequential(
10            nn.Linear(10, 32), # Input layer
11            nn.ReLU(),         # Activation function
12            nn.Linear(32, 16), # Hidden layer
13            nn.ReLU(),         # Activation function
14            nn.Linear(16, 1),  # Output layer
15            nn.Sigmoid()       # Squish output between 0 and 1
16        )
17        
18    def forward(self, x):
19        return self.layers(x)
20
21model = SimpleNetwork()
22criterion = nn.BCELoss() # How we measure mistakes
23optimizer = optim.Adam(model.parameters(), lr=0.01) # How we update weights
24
25# Fake data: 8 examples, 10 features each
26x_batch = torch.randn(8, 10)
27y_batch = torch.empty(8, 1).random_(2)
28
29# One training step
30optimizer.zero_grad()                 # Clear previous gradients
31predictions = model(x_batch)          # 1. Forward pass
32loss = criterion(predictions, y_batch)# 2. Compute loss
33loss.backward()                       # 3. Backpropagate gradients
34optimizer.step()                      # 4. Update weights
35
36print(f"Current Error (Loss): {loss.item():.4f}")

Strengths

  • They can learn useful feature representations directly from data instead of relying only on hand-engineered features.

  • They scale well with data and compute, especially on GPUs and other accelerator hardware.

  • They support many architectures: dense networks, CNNs, RNNs, transformers, autoencoders, and graph neural networks.

!

Limitations

  • They can be difficult to interpret, especially when many layers and millions of parameters interact.

  • They often need more data and tuning than simpler models. On small tabular datasets, tree ensembles are frequently stronger baselines.

  • Training and serving large networks can be expensive and requires careful monitoring for overfitting, drift, and failure modes.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    A sufficiently large, representative training dataset is available to avoid overfitting.

  • 2

    Optimization via gradient descent converges to a satisfactory local minimum.

R

References

Books and papers for deeper study

  • Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. Cambridge, MA: MIT Press.