Large Language Models
Large Language Models (LLMs) are usually decoder-only Transformers trained with an autoregressive objective: predict the next token from the previous tokens. A token can be a word, part of a word, punctuation mark, or byte-like unit depending on the tokenizer.
Pretraining teaches broad statistical structure from text and code. Instruction tuning then trains the model on examples of useful task-following behavior. Preference optimization methods, including RLHF-style pipelines, can further steer responses toward human preferences such as helpfulness, honesty, and harmlessness.
The central tension is that the training signal rewards plausible next-token distributions, not guaranteed truth. Good systems combine the model with retrieval, tools, evaluation, constraints, and careful prompting when factual accuracy matters.
Unified interface: Natural language functions as a general-purpose programming language/API.
Hallucination prone: Optimizes plausible continuation, so factual claims need grounding and verification.
Intuition
How to think about this algorithm
An LLM is best understood as a conditional probability model over tokens. If the prompt is "The capital of France is", a well-trained model assigns high probability to "Paris" because that continuation fits its learned representation of language and world facts.
Generation is iterative. The model samples or selects one token, appends it to the context, and repeats. Temperature, top-p, and other decoding settings decide how sharply or broadly the model samples from the next-token distribution.
This is why LLM behavior can feel both powerful and brittle. The same mechanism can produce useful code, summarize a paper, or confidently continue a false premise if the prompt and context make that continuation likely.
Autoregressive Decoding Probability
Adjust the sampling Temperature slider. Watch candidate distributions flatten (high temperature = random outputs) or sharpen (low temperature = deterministic).
The Logic
Mathematical core for large language models
1. Autoregressive sequence modeling
An LLM factorizes the probability of a token sequence into left-to-right conditional probabilities:
2. Next-token cross-entropy loss
For parameters and vocabulary distribution , pretraining minimizes negative log likelihood:
3. Temperature scaling
The network outputs logits over vocabulary items. Temperature rescales logits before softmax:
Lower concentrates probability on the highest-logit tokens. Higher flattens the distribution and increases sampling diversity.
Code Example
large_language_models.py · pytorch example
1import torch
2import torch.nn.functional as F
3
4# Simulate an autoregressive next-token generator loop
5def generate_tokens(model_logits, prompt_indices, temperature=0.7, max_new_tokens=4):
6 generated = list(prompt_indices)
7
8 for _ in range(max_new_tokens):
9 # 1. Grab model logits for the current step (simulated random logits here)
10 # Vocabulary size is 5: ["the", "dog", "barked", "loudly", "."]
11 logits = torch.randn(5) * 2.0
12
13 # 2. Apply temperature scaling
14 scaled_logits = logits / temperature
15
16 # 3. Softmax to turn into probabilities
17 probs = F.softmax(scaled_logits, dim=-1)
18
19 # 4. Sample next token index from probability distribution
20 next_token = torch.multinomial(probs, num_samples=1).item()
21
22 generated.append(next_token)
23
24 return generated
25
26# Vocabulary lookup map
27vocab = {0: "the", 1: "dog", 2: "barked", 3: "loudly", 4: "."}
28prompt = [0, 1] # "the dog"
29
30# Generate 4 additional tokens
31output_indices = generate_tokens(None, prompt, temperature=0.8, max_new_tokens=3)
32generated_words = [vocab[idx] for idx in output_indices]
33
34print("Prompt tokens: ['the', 'dog']")
35print("Full generated sequence:", " ".join(generated_words))Strengths
Unified interface: Natural language functions as a general-purpose programming language/API.
Few-shot adaptation: Can solve entirely new tasks simply by showing a few examples in the prompt.
Broad emergent skills: Capable of software engineering, logical reasoning, and document synthesis.
Limitations
Hallucination prone: Optimizes plausible continuation, so factual claims need grounding and verification.
High training and inference cost: State-of-the-art models require substantial data, accelerators, energy, and serving infrastructure.
Safety and alignment risks: Tends to reflect biases, toxic statements, or misinformation present in its training data unless heavily aligned.
Key Assumptions
Scope conditions and interpretation notes
- 1
Autoregressive next-token prediction simulates coherent thinking trajectories.
- 2
Temperature settings properly scale token logits to prevent repetitive loops.
References
Books and papers for deeper study
Brown, T. B. et al. (2020) 'Language models are few-shot learners', in Advances in Neural Information Processing Systems (NeurIPS).