Neural Networks & Deep Learning
Neural networks are layered differentiable functions. Each layer applies a linear transformation followed by a non-linear activation, and training adjusts the weights so the final output minimizes a loss function.
Deep learning means using many such layers. Depth lets the model build representations hierarchically: early layers learn simple signals, middle layers combine them, and later layers specialize them for the target task.
Where is it used?
They are used in language models, recommender systems, image recognition, speech recognition, translation, scientific modeling, and medical imaging. They are strongest when you have enough data, compute, and evaluation discipline to justify their flexibility.
They can learn useful feature representations directly from data instead of relying only on hand-engineered features.
They can be difficult to interpret, especially when many layers and millions of parameters interact.
Intuition
How to think about this algorithm
Imagine a model trying to identify whether a photo contains a car.
The workers at the very start of the line (the first layer) only look at tiny, zoomed-in pixels. They can only recognize basic things like straight edges or simple curves. They pass their findings to the next group of workers.
The second layer of workers looks at those edges and curves and combines them. They might say, "Ah, these curves make a circle!" or "These edges make a rectangle!" They pass this up the chain.
The deeper layers combine those shapes into task-level evidence: wheels, windows, chassis-like structure, and background context. The output layer converts that evidence into class scores or probabilities.
Multi-Layer Perceptron Activations
Slide inputs x1 and x2, or adjust parameters directly. Observe synapse connections (thickness represents weights) and activation flows.
The Logic
Mathematical core for neural networks & deep learning
1. The Forward Pass
A basic Neural Network layer takes your input data (), multiplies it by a set of weights (), adds a bias (), and then passes the result through an "activation function" () to introduce non-linearity (so it can learn curves, not just straight lines):
2. Backpropagation (How it learns)
When the network makes a prediction, the loss measures the error. Backpropagation applies the chain rule to compute gradients of that loss with respect to every trainable weight:
An optimizer such as stochastic gradient descent or Adam then updates the weights in a direction that tends to reduce the loss on future batches.
Code Example
neural_networks_&_deep_learning.py · pytorch example
1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# Build a simple Neural Network with PyTorch
6class SimpleNetwork(nn.Module):
7 def __init__(self):
8 super().__init__()
9 self.layers = nn.Sequential(
10 nn.Linear(10, 32), # Input layer
11 nn.ReLU(), # Activation function
12 nn.Linear(32, 16), # Hidden layer
13 nn.ReLU(), # Activation function
14 nn.Linear(16, 1), # Output layer
15 nn.Sigmoid() # Squish output between 0 and 1
16 )
17
18 def forward(self, x):
19 return self.layers(x)
20
21model = SimpleNetwork()
22criterion = nn.BCELoss() # How we measure mistakes
23optimizer = optim.Adam(model.parameters(), lr=0.01) # How we update weights
24
25# Fake data: 8 examples, 10 features each
26x_batch = torch.randn(8, 10)
27y_batch = torch.empty(8, 1).random_(2)
28
29# One training step
30optimizer.zero_grad() # Clear previous gradients
31predictions = model(x_batch) # 1. Forward pass
32loss = criterion(predictions, y_batch)# 2. Compute loss
33loss.backward() # 3. Backpropagate gradients
34optimizer.step() # 4. Update weights
35
36print(f"Current Error (Loss): {loss.item():.4f}")Strengths
They can learn useful feature representations directly from data instead of relying only on hand-engineered features.
They scale well with data and compute, especially on GPUs and other accelerator hardware.
They support many architectures: dense networks, CNNs, RNNs, transformers, autoencoders, and graph neural networks.
Limitations
They can be difficult to interpret, especially when many layers and millions of parameters interact.
They often need more data and tuning than simpler models. On small tabular datasets, tree ensembles are frequently stronger baselines.
Training and serving large networks can be expensive and requires careful monitoring for overfitting, drift, and failure modes.
Key Assumptions
Scope conditions and interpretation notes
- 1
A sufficiently large, representative training dataset is available to avoid overfitting.
- 2
Optimization via gradient descent converges to a satisfactory local minimum.
References
Books and papers for deeper study
Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. Cambridge, MA: MIT Press.