Transformers

Transformers

Transformers replace recurrence with attention. Instead of updating a hidden state one token at a time, a Transformer layer lets each token compare itself with the other tokens and build a new representation from the most relevant ones.

For encoders, tokens can attend bidirectionally across the whole input. For decoder language models, causal masking prevents a token from attending to future tokens. In both cases, self-attention creates direct paths between distant positions, while feed-forward layers and residual connections refine the representation.

The key tradeoff is compute: full attention forms an N×NN\times N matrix for a sequence of length NN. That gives powerful long-range interactions, but the cost grows quadratically with context length.

Best use

Highly parallelizable: Processes sequences all at once, allowing training across thousands of GPUs.

Watch out for

Quadratic compute complexity: Calculating the attention matrix scales at O(N2)\mathcal{O}(N^2) with sequence length NN, limiting context size.

i

Intuition

How to think about this algorithm

Imagine reading the sentence: "The animal did not cross the street because it was tired."

The representation for "it" should borrow heavily from "animal." If the sentence ended with "because it was too wide," the useful source would shift toward "street." Attention makes that borrowing explicit: each token creates scores against other tokens, normalizes them into weights, and averages their value vectors.

Multi-head attention repeats this process in parallel. One head can track pronouns, another can track syntactic roles, and another can track nearby phrase structure. The model learns these heads from data rather than being given grammar rules.

Interactive Diagram

Transformers Attention Maps

Hover or click over query tokens on the top row to lock visual pathways (yellow). Observe attention mappings resolving dependencies.

Selected Query
Resolved Keys
Attention Weights
Vocabulary Sentence
Attention details:Locked Query: it
itanimal48%
ittired28%
itstreet8%
Key InsightSelf-attention mechanisms allow tokens to build contextual representations by computing vector similarity with all other tokens, regardless of distance.

The Logic

Mathematical core for transformers

1. Query, key, and value projections

Given token representations XRN×dmodelX \in \mathbb{R}^{N\times d_{model}}, a layer projects them into queries, keys, and values:

Q=XWQ(Query)Q = X W_Q \quad \text{(Query)}

K=XWK(Key)K = X W_K \quad \text{(Key)}

V=XWV(Value)V = X W_V \quad \text{(Value)}

2. Scaled dot-product attention

Compatibility scores come from dot products between queries and keys. Dividing by dk\sqrt{d_k} keeps the softmax from saturating when vector dimensions are large:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) V

For decoder-only generation, a mask sets scores for future positions to -\infty before the softmax.

3. Multi-Head Attention

To model several relationships at once, attention runs in hh learned subspaces:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)

Code Example

transformers.py · pytorch example

Python
model_fitting.py
1import torch
2import torch.nn.functional as F
3
4# A demonstration of Scaled Dot-Product Self-Attention
5def self_attention(x, d_k=8):
6    # Assumes input x has shape [seq_len, d_model] -> [3, 8]
7    # Linear projection matrices (initialized randomly)
8    W_q = torch.randn(8, d_k)
9    W_k = torch.randn(8, d_k)
10    W_v = torch.randn(8, 8)
11    
12    # 1. Project into Query, Key, Value spaces
13    Q = torch.matmul(x, W_q) # Shape: [3, d_k]
14    K = torch.matmul(x, W_k) # Shape: [3, d_k]
15    V = torch.matmul(x, W_v) # Shape: [3, 8]
16    
17    # 2. Compute similarity matrix [seq_len, seq_len]
18    scores = torch.matmul(Q, K.transpose(0, 1)) / (d_k ** 0.5)
19    
20    # 3. Softmax weights
21    attention_weights = F.softmax(scores, dim=-1)
22    
23    # 4. Multiply weights by Values
24    output = torch.matmul(attention_weights, V)
25    return output, attention_weights
26
27# Fake sequence of 3 tokens (e.g., "AI is cool"), model dim is 8
28x_sequence = torch.randn(3, 8)
29output_vectors, weights = self_attention(x_sequence)
30
31print("Attention weight matrix:")
32print(weights.tolist()) # Shows how much each token attends to others
33print("Output representation shape:", output_vectors.shape)

Strengths

  • Highly parallelizable: Processes sequences all at once, allowing training across thousands of GPUs.

  • No long-range decay: Directly links tokens over thousands of words without information decay.

  • Extremely versatile: Scales well beyond text to vision (Vision Transformers), audio, and biological structures (AlphaFold).

!

Limitations

  • Quadratic compute complexity: Calculating the attention matrix scales at O(N2)\mathcal{O}(N^2) with sequence length NN, limiting context size.

  • No built-in order: Because they process all tokens at once, they require manual positional encodings to know word order.

  • Massive dataset requirement: Lacks spatial assumptions (inductive biases) of CNNs, requiring huge datasets to train from scratch.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    Contextual relationships can be represented as scaled dot-product attention weights.

  • 2

    Sequence token order is encoded via positional variables.

R

References

Books and papers for deeper study

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017) 'Attention is all you need', in Advances in Neural Information Processing Systems (NeurIPS), pp. 5998-6008.