Large Language Models

Difficulty:Expert

Reading Time:30 min

Track:

Deep Learning

Decoder-only Transformer language models trained to predict the next token and then adapted to follow instructions.

Prerequisites

Transformers

Deep LearningModule 12 of 19

Large Language Models

63%

TL;DR

An LLM is an autoregressive model that predicts the next token given all previous tokens: $p(x_{1:n}) = \prod_t p(x_t \mid x_{<t})$ .
Training minimizes the cross-entropy / negative log-likelihood of the next token; the exponentiated loss is perplexity, the model’s effective branching factor.
Performance improves predictably with scale — more parameters, data, and compute — following empirical power-law scaling laws $L(N) \propto N^{-\alpha}$ .
New emergent abilities (in-context learning, multi-step reasoning, tool use) appear as scale grows, without being explicitly trained for.
The dominant recipe is pretraining on a huge unlabeled corpus, then fine-tuning (instruction tuning + RLHF/DPO) to align behavior with human intent.

Learning Objectives

Explain the autoregressive next-token prediction task
Describe the stages of LLM development: Pretraining, Instruction Tuning, and Preference Optimization (RLHF/DPO)
Explain decoding strategies such as greedy search, top-k, top-p, and temperature sampling
Contrast hallucination, context limits, and retrieval-augmented generation architectures

Intuition

How to think conceptually about this topic

An LLM is best understood as a conditional probability model over tokens. If the prompt is "The capital of France is", a well-trained model assigns high probability to "Paris" because that continuation fits its learned representation of language and world facts.

Generation is iterative. The model samples or selects one token, appends it to the context, and repeats. Temperature, top-p, and other decoding settings decide how sharply or broadly the model samples from the next-token distribution.

This is why LLM behavior can feel both powerful and brittle. The same mechanism can produce useful code, summarize a paper, or confidently continue a false premise if the prompt and context make that continuation likely.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Large Language Models (LLMs) are usually decoder-only Transformers trained with an autoregressive objective: predict the next token from the previous tokens. A token can be a word, part of a word, punctuation mark, or byte-like unit depending on the tokenizer.

Pretraining teaches broad statistical structure from text and code. Instruction tuning then trains the model on examples of useful task-following behavior. Preference optimization methods, including RLHF-style pipelines, can further steer responses toward human preferences such as helpfulness, honesty, and harmlessness.

The central tension is that the training signal rewards plausible next-token distributions, not guaranteed truth. Good systems combine the model with retrieval, tools, evaluation, constraints, and careful prompting when factual accuracy matters.

How It Compares

Three stages of building an aligned LLM

Dimension	Pretraining	Supervised Fine-tuning	RLHF
Objective	Next-token cross-entropy (self-supervised NLL)	Next-token cross-entropy on curated prompt-response pairs	Maximize a learned reward (human preference) with a KL penalty to stay near the SFT policy
Data type	Massive unlabeled web text and code (trillions of tokens)	Smaller labeled demonstrations of desired instruction-following	Human preference comparisons (ranked pairs of responses) used to train a reward model
What it changes	Builds broad world knowledge and a general next-token predictor	Teaches the model the format and behavior of following instructions	Shifts the response distribution toward human-preferred (helpful, honest, harmless) outputs
Relative cost	Highest — dominates total compute (huge data + long training runs)	Low — thousands to millions of examples, short training	Moderate — needs human labeling plus reward-model and RL training loops

TakeawayPretraining creates raw capability at enormous cost; SFT cheaply teaches instruction-following; RLHF then aligns the model’s output distribution with human preferences. Each stage builds on the previous one rather than replacing it.

When to Use It

Reach for this when

The task is open-ended language work — drafting, summarizing, rewriting, translating, or answering questions — where flexible natural-language input and output are valuable.
You need few-shot or zero-shot adaptation to a new task without collecting a labeled dataset or training a model.
The problem benefits from broad world knowledge or code synthesis, and approximate, reviewable answers are acceptable.
You can pair the model with retrieval, tools, or verification to ground its outputs when accuracy matters.

Avoid it when

You need guaranteed factual correctness or deterministic outputs without a verification/grounding layer — LLMs optimize plausibility, not truth.
The task is a narrow, well-defined prediction problem (e.g. tabular classification) where a small, cheap, auditable model would be more accurate and far less costly.
Latency, cost, or privacy constraints rule out large-model inference, and a smaller specialized model suffices.
Decisions are high-stakes and unsupervised (medical, legal, financial) where hallucinations or bias could cause real harm without human review.

Rules of thumb

Use low temperature (or greedy decoding) for factual/extractive tasks and higher temperature only for creative generation.
Ground factual queries with retrieval (RAG) instead of relying on parametric memory, especially for recent or niche information.
Measure quality with task-specific evals, not vibes — perplexity alone does not capture instruction-following or safety.
Right-size the model: try the smallest model that passes your evals before reaching for the largest.

Implementation

Reference code implementation

Python

model_fitting.py

1import torch
2import torch.nn.functional as F
3
4# Simulate an autoregressive next-token generator loop
5def generate_tokens(model_logits, prompt_indices, temperature=0.7, max_new_tokens=4):
6    generated = list(prompt_indices)
7    
8    for _ in range(max_new_tokens):
9        # 1. Grab model logits for the current step (simulated random logits here)
10        # Vocabulary size is 5: ["the", "dog", "barked", "loudly", "."]
11        logits = torch.randn(5) * 2.0 
12        
13        # 2. Apply temperature scaling
14        scaled_logits = logits / temperature
15        
16        # 3. Softmax to turn into probabilities
17        probs = F.softmax(scaled_logits, dim=-1)
18        
19        # 4. Sample next token index from probability distribution
20        next_token = torch.multinomial(probs, num_samples=1).item()
21        
22        generated.append(next_token)
23        
24    return generated
25
26# Vocabulary lookup map
27vocab = {0: "the", 1: "dog", 2: "barked", 3: "loudly", 4: "."}
28prompt = [0, 1] # "the dog"
29
30# Generate 4 additional tokens
31output_indices = generate_tokens(None, prompt, temperature=0.8, max_new_tokens=3)
32generated_words = [vocab[idx] for idx in output_indices]
33
34print("Prompt tokens: ['the', 'dog']")
35print("Full generated sequence:", " ".join(generated_words))

Strengths & Advantages

Unified interface: Natural language functions as a general-purpose programming language/API.
Few-shot adaptation: Can solve entirely new tasks simply by showing a few examples in the prompt.
Broad emergent skills: Capable of software engineering, logical reasoning, and document synthesis.

Limitations & Drawbacks

Hallucination prone: Optimizes plausible continuation, so factual claims need grounding and verification.
High training and inference cost: State-of-the-art models require substantial data, accelerators, energy, and serving infrastructure.
Safety and alignment risks: Tends to reflect biases, toxic statements, or misinformation present in its training data unless heavily aligned.

Real-World Case Studies

GPT-3: scale unlocks few-shot in-context learning

Natural language processing

Scenario

Before GPT-3, adapting a language model to a new task typically required fine-tuning on a task-specific labeled dataset. Brown et al. (2020) asked whether simply scaling up an autoregressive Transformer would let a single frozen model perform new tasks from only a natural-language description and a handful of in-context examples — no gradient updates.

Approach

They pretrained a 175-billion-parameter decoder-only Transformer on roughly 300 billion tokens of filtered web text, books, and Wikipedia, using the standard next-token cross-entropy objective. They then evaluated the same frozen weights across dozens of benchmarks in zero-shot, one-shot, and few-shot settings, supplying task examples purely in the prompt context.

Outcome

GPT-3 (175B parameters, about 100x larger than its 1.5B-parameter predecessor GPT-2) achieved strong few-shot performance on many tasks, in some cases approaching fine-tuned baselines, and demonstrated that in-context learning strengthens systematically with scale. On the LAMBADA word-prediction benchmark, few-shot accuracy reached roughly 86%. The work established few-shot prompting as a practical paradigm and is a canonical demonstration of an emergent capability appearing with scale.

Source: Language Models are Few-Shot Learners — Brown, T. B. et al.

Common Misconceptions

MisconceptionLLMs search a database of facts to answer questions.

CorrectionLLMs do not search databases or copy stored text; they generate text token-by-token based on statistical associations learned during training. They do not have access to real-time information unless combined with external retrieval tools.

MisconceptionHigher temperature sampling increases the factual accuracy of the model.

CorrectionHigher temperature increases logit entropy, making predictions more random and creative but also increasing the probability of hallucinations and factually incorrect statements.

References & Further Reading

Language Models are Few-Shot Learnerstextbook
By Brown, T. B. et al
View Resource →
Introducing LLaMA: A foundational, 65-billion-parameter large language modeltextbook
By Touvron, H. et al
View Resource →

Large Language Models

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

How It Compares

Three stages of building an aligned LLM

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

GPT-3: scale unlocks few-shot in-context learning

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Transformers

Natural Language Processing