Transformers

Difficulty:Expert

Reading Time:30 min

Track:

Deep Learning

Sequence models built around attention, letting each token form a context-dependent weighted mixture of other token representations.

Prerequisites

Neural Networks & Deep Learning Natural Language Processing

Deep LearningModule 11 of 19

Transformers

58%

TL;DR

A transformer replaces recurrence with self-attention: each token builds a new representation as a weighted mixture of every token's value vector.
Scores come from query–key dot products, scaled by $\sqrt{d_k}$ and normalized by softmax: $\text{softmax}(QK^T/\sqrt{d_k})V$ .
Multi-head attention runs several attention maps in parallel so different heads capture different relationships (pronouns, syntax, position).
All tokens are processed simultaneously, making transformers highly parallelizable — but they need positional encodings to recover word order.
The cost is quadratic in sequence length ( $O(N^2)$ ), which motivates sparse, linear, and flash-attention variants.

Learning Objectives

Explain the limitations of recurrent neural networks (RNNs) and why attention is superior
Formulate the Scaled Dot-Product Attention equation and define Query, Key, and Value vectors
Describe Multi-Head Attention and its utility in attending to information at different positions
Explain the role of positional encodings in preserving token sequence order

Intuition

How to think conceptually about this topic

Imagine reading the sentence: "The animal did not cross the street because it was tired."

The representation for "it" should borrow heavily from "animal." If the sentence ended with "because it was too wide," the useful source would shift toward "street." Attention makes that borrowing explicit: each token creates scores against other tokens, normalizes them into weights, and averages their value vectors.

Multi-head attention repeats this process in parallel. One head can track pronouns, another can track syntactic roles, and another can track nearby phrase structure. The model learns these heads from data rather than being given grammar rules.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Transformers replace recurrence with attention. Instead of updating a hidden state one token at a time, a Transformer layer lets each token compare itself with the other tokens and build a new representation from the most relevant ones.

For encoders, tokens can attend bidirectionally across the whole input. For decoder language models, causal masking prevents a token from attending to future tokens. In both cases, self-attention creates direct paths between distant positions, while feed-forward layers and residual connections refine the representation.

The key tradeoff is compute: full attention forms an $N\times N$ matrix for a sequence of length $N$ . That gives powerful long-range interactions, but the cost grows quadratically with context length.

How It Compares

Sequence-modeling architectures

Dimension	RNN / LSTM	1D CNN	Transformer
Parallel across sequence	No — sequential per timestep	Yes	Yes
Path length between distant tokens	$O(N)$	$O(N/k)$ , less with dilation	$O(1)$ — direct attention
Compute per layer	$O(N d^2)$	$O(N k d^2)$	$O(N^2 d)$
Very long context	Struggles (vanishing gradients)	Limited receptive field	Strong, but quadratic cost
Order awareness	Built-in via recurrence	Built-in via locality	Needs positional encodings

TakeawayTransformers trade a quadratic compute cost for

O(1)

path length and full parallelism — which is exactly why they overtook RNNs once data and compute became abundant.

When to Use It

Reach for this when

You have enough data and compute to pretrain or fine-tune a large model and need to capture long-range dependencies.
The task benefits from modeling global context — translation, summarization, code, or protein structure.
You want one architecture that transfers across modalities (text, vision, audio).

Avoid it when

Sequences are extremely long and you are memory/latency constrained without access to efficient-attention variants — the $O(N^2)$ cost dominates.
You have little data and no pretrained model — a transformer's weak inductive bias makes it data-hungry; a CNN, RNN, or classical model may win.
The problem is simple or low-dimensional, where a lighter model is faster and just as accurate.

Rules of thumb

Prefer fine-tuning a pretrained transformer over training one from scratch whenever you can.
For long documents, reach for sparse / linear / flash attention before naively growing the context window.
Always inject positional information (sinusoidal, learned, or rotary/RoPE).

Implementation

Reference code implementation

Python

model_fitting.py

1import torch
2import torch.nn.functional as F
3
4# A demonstration of Scaled Dot-Product Self-Attention
5def self_attention(x, d_k=8):
6    # Assumes input x has shape [seq_len, d_model] -> [3, 8]
7    # Linear projection matrices (initialized randomly)
8    W_q = torch.randn(8, d_k)
9    W_k = torch.randn(8, d_k)
10    W_v = torch.randn(8, 8)
11    
12    # 1. Project into Query, Key, Value spaces
13    Q = torch.matmul(x, W_q) # Shape: [3, d_k]
14    K = torch.matmul(x, W_k) # Shape: [3, d_k]
15    V = torch.matmul(x, W_v) # Shape: [3, 8]
16    
17    # 2. Compute similarity matrix [seq_len, seq_len]
18    scores = torch.matmul(Q, K.transpose(0, 1)) / (d_k ** 0.5)
19    
20    # 3. Softmax weights
21    attention_weights = F.softmax(scores, dim=-1)
22    
23    # 4. Multiply weights by Values
24    output = torch.matmul(attention_weights, V)
25    return output, attention_weights
26
27# Fake sequence of 3 tokens (e.g., "AI is cool"), model dim is 8
28x_sequence = torch.randn(3, 8)
29output_vectors, weights = self_attention(x_sequence)
30
31print("Attention weight matrix:")
32print(weights.tolist()) # Shows how much each token attends to others
33print("Output representation shape:", output_vectors.shape)

Strengths & Advantages

Highly parallelizable: Processes sequences all at once, allowing training across thousands of GPUs.
No long-range decay: Directly links tokens over thousands of words without information decay.
Extremely versatile: Scales well beyond text to vision (Vision Transformers), audio, and biological structures (AlphaFold).

Limitations & Drawbacks

Quadratic compute complexity: Calculating the attention matrix scales at $\mathcal{O}(N^2)$ with sequence length $N$ , limiting context size.
No built-in order: Because they process all tokens at once, they require manual positional encodings to know word order.
Massive dataset requirement: Lacks spatial assumptions (inductive biases) of CNNs, requiring huge datasets to train from scratch.

Real-World Case Studies

Attention Is All You Need — translation without recurrence

Natural language processing

Scenario

In 2017, the strongest machine-translation systems were deep LSTMs/GRUs with attention bolted on. They were slow to train because recurrence forbids parallelizing across sequence positions — each timestep waits for the previous one.

Approach

The Transformer removed recurrence entirely, relying only on multi-head self-attention plus position-wise feed-forward layers and residual connections. This let the whole sequence be processed in parallel during training.

Outcome

On the WMT 2014 English→German benchmark the Transformer reached 28.4 BLEU, a new state of the art, while training in a small fraction of the time and cost of the recurrent competitors. The architecture went on to become the foundation of BERT, GPT, and essentially every modern large language model.

Source: Attention Is All You Need — Vaswani, A. et al.

Common Misconceptions

MisconceptionTransformers process tokens sequentially like LSTMs.

CorrectionTransformers process all tokens in a sequence simultaneously (parallel computation), which makes them extremely fast to train on GPUs compared to LSTMs.

MisconceptionAttention weights always represent exact causal relationships.

CorrectionAttention weights represent correlation and context mixture, not direct causal relationships or proof of logic.

References & Further Reading

Attention Is All You Needtextbook
By Vaswani, A. et al
View Resource →
Illustrated Transformertextbook
By Alammar, J
View Resource →

Transformers

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

How It Compares

Sequence-modeling architectures

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

Attention Is All You Need — translation without recurrence

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Neural Networks & Deep Learning

Natural Language Processing

Large Language Models

Vision Transformers (ViT)