Large Language Models

Large Language Models

Large Language Models (LLMs) are usually decoder-only Transformers trained with an autoregressive objective: predict the next token from the previous tokens. A token can be a word, part of a word, punctuation mark, or byte-like unit depending on the tokenizer.

Pretraining teaches broad statistical structure from text and code. Instruction tuning then trains the model on examples of useful task-following behavior. Preference optimization methods, including RLHF-style pipelines, can further steer responses toward human preferences such as helpfulness, honesty, and harmlessness.

The central tension is that the training signal rewards plausible next-token distributions, not guaranteed truth. Good systems combine the model with retrieval, tools, evaluation, constraints, and careful prompting when factual accuracy matters.

Best use

Unified interface: Natural language functions as a general-purpose programming language/API.

Watch out for

Hallucination prone: Optimizes plausible continuation, so factual claims need grounding and verification.

i

Intuition

How to think about this algorithm

An LLM is best understood as a conditional probability model over tokens. If the prompt is "The capital of France is", a well-trained model assigns high probability to "Paris" because that continuation fits its learned representation of language and world facts.

Generation is iterative. The model samples or selects one token, appends it to the context, and repeats. Temperature, top-p, and other decoding settings decide how sharply or broadly the model samples from the next-token distribution.

This is why LLM behavior can feel both powerful and brittle. The same mechanism can produce useful code, summarize a paper, or confidently continue a false premise if the prompt and context make that continuation likely.

Interactive Diagram

Autoregressive Decoding Probability

Adjust the sampling Temperature slider. Watch candidate distributions flatten (high temperature = random outputs) or sharpen (low temperature = deterministic).

Top candidate
Alternative tokens
Shannon Entropy: 1.294
Context Prompt
Temperature (T)0.7
Key InsightLLMs output a vector of raw logit scores for the next vocabulary word. Applying Softmax with temperature controls the randomness of token selections.

The Logic

Mathematical core for large language models

1. Autoregressive sequence modeling

An LLM factorizes the probability of a token sequence x1:Tx_{1:T} into left-to-right conditional probabilities:

P(x1:T)=t=1TP(xtx<t)P(x_{1:T}) = \prod_{t=1}^{T} P(x_t \mid x_{<t})

2. Next-token cross-entropy loss

For parameters θ\theta and vocabulary distribution pθp_\theta, pretraining minimizes negative log likelihood:

L(θ)=1Tt=1Tlogpθ(xtx<t)\mathcal{L}(\theta) = -\frac{1}{T}\sum_{t=1}^{T}\log p_\theta(x_t \mid x_{<t})

3. Temperature scaling

The network outputs logits ziz_i over vocabulary items. Temperature τ>0\tau > 0 rescales logits before softmax:

P(xi)=ezi/τjezj/τP(x_i) = \frac{e^{z_i / \tau}}{\sum_j e^{z_j / \tau}}

Lower τ\tau concentrates probability on the highest-logit tokens. Higher τ\tau flattens the distribution and increases sampling diversity.

Code Example

large_language_models.py · pytorch example

Python
model_fitting.py
1import torch
2import torch.nn.functional as F
3
4# Simulate an autoregressive next-token generator loop
5def generate_tokens(model_logits, prompt_indices, temperature=0.7, max_new_tokens=4):
6    generated = list(prompt_indices)
7    
8    for _ in range(max_new_tokens):
9        # 1. Grab model logits for the current step (simulated random logits here)
10        # Vocabulary size is 5: ["the", "dog", "barked", "loudly", "."]
11        logits = torch.randn(5) * 2.0 
12        
13        # 2. Apply temperature scaling
14        scaled_logits = logits / temperature
15        
16        # 3. Softmax to turn into probabilities
17        probs = F.softmax(scaled_logits, dim=-1)
18        
19        # 4. Sample next token index from probability distribution
20        next_token = torch.multinomial(probs, num_samples=1).item()
21        
22        generated.append(next_token)
23        
24    return generated
25
26# Vocabulary lookup map
27vocab = {0: "the", 1: "dog", 2: "barked", 3: "loudly", 4: "."}
28prompt = [0, 1] # "the dog"
29
30# Generate 4 additional tokens
31output_indices = generate_tokens(None, prompt, temperature=0.8, max_new_tokens=3)
32generated_words = [vocab[idx] for idx in output_indices]
33
34print("Prompt tokens: ['the', 'dog']")
35print("Full generated sequence:", " ".join(generated_words))

Strengths

  • Unified interface: Natural language functions as a general-purpose programming language/API.

  • Few-shot adaptation: Can solve entirely new tasks simply by showing a few examples in the prompt.

  • Broad emergent skills: Capable of software engineering, logical reasoning, and document synthesis.

!

Limitations

  • Hallucination prone: Optimizes plausible continuation, so factual claims need grounding and verification.

  • High training and inference cost: State-of-the-art models require substantial data, accelerators, energy, and serving infrastructure.

  • Safety and alignment risks: Tends to reflect biases, toxic statements, or misinformation present in its training data unless heavily aligned.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    Autoregressive next-token prediction simulates coherent thinking trajectories.

  • 2

    Temperature settings properly scale token logits to prevent repetitive loops.

R

References

Books and papers for deeper study

  • Brown, T. B. et al. (2020) 'Language models are few-shot learners', in Advances in Neural Information Processing Systems (NeurIPS).