Transformers
Transformers replace recurrence with attention. Instead of updating a hidden state one token at a time, a Transformer layer lets each token compare itself with the other tokens and build a new representation from the most relevant ones.
For encoders, tokens can attend bidirectionally across the whole input. For decoder language models, causal masking prevents a token from attending to future tokens. In both cases, self-attention creates direct paths between distant positions, while feed-forward layers and residual connections refine the representation.
The key tradeoff is compute: full attention forms an matrix for a sequence of length . That gives powerful long-range interactions, but the cost grows quadratically with context length.
Highly parallelizable: Processes sequences all at once, allowing training across thousands of GPUs.
Quadratic compute complexity: Calculating the attention matrix scales at with sequence length , limiting context size.
Intuition
How to think about this algorithm
Imagine reading the sentence: "The animal did not cross the street because it was tired."
The representation for "it" should borrow heavily from "animal." If the sentence ended with "because it was too wide," the useful source would shift toward "street." Attention makes that borrowing explicit: each token creates scores against other tokens, normalizes them into weights, and averages their value vectors.
Multi-head attention repeats this process in parallel. One head can track pronouns, another can track syntactic roles, and another can track nearby phrase structure. The model learns these heads from data rather than being given grammar rules.
Transformers Attention Maps
Hover or click over query tokens on the top row to lock visual pathways (yellow). Observe attention mappings resolving dependencies.
The Logic
Mathematical core for transformers
1. Query, key, and value projections
Given token representations , a layer projects them into queries, keys, and values:
2. Scaled dot-product attention
Compatibility scores come from dot products between queries and keys. Dividing by keeps the softmax from saturating when vector dimensions are large:
For decoder-only generation, a mask sets scores for future positions to before the softmax.
3. Multi-Head Attention
To model several relationships at once, attention runs in learned subspaces:
Code Example
transformers.py · pytorch example
1import torch
2import torch.nn.functional as F
3
4# A demonstration of Scaled Dot-Product Self-Attention
5def self_attention(x, d_k=8):
6 # Assumes input x has shape [seq_len, d_model] -> [3, 8]
7 # Linear projection matrices (initialized randomly)
8 W_q = torch.randn(8, d_k)
9 W_k = torch.randn(8, d_k)
10 W_v = torch.randn(8, 8)
11
12 # 1. Project into Query, Key, Value spaces
13 Q = torch.matmul(x, W_q) # Shape: [3, d_k]
14 K = torch.matmul(x, W_k) # Shape: [3, d_k]
15 V = torch.matmul(x, W_v) # Shape: [3, 8]
16
17 # 2. Compute similarity matrix [seq_len, seq_len]
18 scores = torch.matmul(Q, K.transpose(0, 1)) / (d_k ** 0.5)
19
20 # 3. Softmax weights
21 attention_weights = F.softmax(scores, dim=-1)
22
23 # 4. Multiply weights by Values
24 output = torch.matmul(attention_weights, V)
25 return output, attention_weights
26
27# Fake sequence of 3 tokens (e.g., "AI is cool"), model dim is 8
28x_sequence = torch.randn(3, 8)
29output_vectors, weights = self_attention(x_sequence)
30
31print("Attention weight matrix:")
32print(weights.tolist()) # Shows how much each token attends to others
33print("Output representation shape:", output_vectors.shape)Strengths
Highly parallelizable: Processes sequences all at once, allowing training across thousands of GPUs.
No long-range decay: Directly links tokens over thousands of words without information decay.
Extremely versatile: Scales well beyond text to vision (Vision Transformers), audio, and biological structures (AlphaFold).
Limitations
Quadratic compute complexity: Calculating the attention matrix scales at with sequence length , limiting context size.
No built-in order: Because they process all tokens at once, they require manual positional encodings to know word order.
Massive dataset requirement: Lacks spatial assumptions (inductive biases) of CNNs, requiring huge datasets to train from scratch.
Key Assumptions
Scope conditions and interpretation notes
- 1
Contextual relationships can be represented as scaled dot-product attention weights.
- 2
Sequence token order is encoded via positional variables.
References
Books and papers for deeper study
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017) 'Attention is all you need', in Advances in Neural Information Processing Systems (NeurIPS), pp. 5998-6008.