Natural Language Processing

Difficulty:Advanced

Reading Time:25 min

Track:

Deep Learning

Techniques to translate human speech and text into vector mathematics, enabling machines to read, translate, and synthesize language.

Prerequisites

Neural Networks & Deep Learning

Deep LearningModule 6 of 19

Natural Language Processing

32%

TL;DR

Classical NLP represents text as counts: Bag-of-Words and TF-IDF weight each term by how often it appears in a document, discounted by how common it is across the corpus.
TF-IDF is $\text{tf-idf}(t,d) = tf(t,d) \times \log\frac{N}{df(t)}$ — frequent-in-this-document, rare-across-the-corpus terms score highest.
N-gram language models use the chain rule plus a Markov assumption to approximate $P(w_1,\ldots,w_n)$ as a product of short, local conditional probabilities.
Laplace (add-one) smoothing prevents zero probabilities for n-grams unseen during training by pretending every n-gram occurred at least once.
Bag-of-words and n-gram counts ignore long-range meaning — they capture surface statistics, not semantics, which is why dense word embeddings were developed.
These statistical methods (Naive Bayes spam filters, n-gram language models) predate deep learning and still serve as fast, interpretable baselines.

Learning Objectives

Explain the concept of word embeddings and dense vector representations
Calculate Term Frequency-Inverse Document Frequency (TF-IDF) weights for a vocabulary
Describe the mechanics of n-gram models and recurrent neural networks (RNNs)
Distinguish between lexical, syntactic, and semantic levels of language processing

Intuition

How to think conceptually about this topic

Imagine words as points in a giant multi-dimensional "meaning space". In this space, the spatial distance between points represents how related they are.

If you take the vector for "King", subtract the vector for "Man", and add the vector for "Woman", you land almost exactly at the vector coordinates for "Queen". NLP is the process of translating human writing into these neat, coordinate-based maps.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Natural Language Processing (NLP) is the mathematical discipline that allows computer chips to process, understand, and generate human languages. Because computers operate only on numbers, NLP maps text into continuous vector spaces.

Traditional statistical NLP relies on word frequencies and counts (like TF-IDF). Modern deep learning NLP embeds words, phrases, or full documents into dense vectors (Word Embeddings). In this geometric vector space, words with similar meanings are mapped close together, allowing models to capture subtle semantic nuances, context, and syntax.

How It Compares

Bag-of-Words / TF-IDF vs N-gram Language Models vs Word Embeddings

Dimension	Bag-of-Words / TF-IDF	N-gram Language Models	Word Embeddings (e.g. Word2Vec)
Captures word order / local context	No — a document is an unordered multiset of word counts	Yes, but only within a fixed short window (e.g. the previous 1-2 words)	No inherent order capture in the static vectors themselves, but training context windows shape the vectors
Representation dimensionality	One dimension per vocabulary term — typically tens of thousands, very sparse	Conceptually a table of $V^k$ possible n-grams for vocabulary size $V$ — also huge and sparse	Fixed, small dense vectors (commonly 50-300 dimensions)
Captures semantic similarity	No — "good" and "great" are entirely unrelated dimensions	No — similarity is not represented, only sequential likelihood	Yes — similar words land close together in vector space (e.g. king - man + woman ≈ queen)
Data / compute needed	Very low — simple counting	Low to moderate — counting plus smoothing	Higher — requires training over large text corpora
Typical use case	Document classification, search/retrieval ranking, spam filtering	Speech recognition rescoring, autocomplete, early machine translation	Semantic search, clustering, as input features to neural networks

TakeawayBag-of-words/TF-IDF and n-gram models are simple, fast, and interpretable count-based statistics, but neither represents meaning — they just represent presence or local sequence likelihood. Word embeddings trade interpretability and simplicity for the ability to place semantically related words near each other, which is why they became the foundation for modern neural NLP.

When to Use It

Reach for this when

You need a fast, interpretable baseline for text classification or search ranking and have limited training data (TF-IDF + a linear model or Naive Bayes).
You are building speech recognition rescoring or autocomplete features where local word-sequence likelihood matters more than deep semantics (n-gram models).
Your corpus is small, your compute budget is tight, or you must explain exactly why a document scored the way it did to a non-technical stakeholder.

Avoid it when

The task depends on semantic similarity between words with no lexical overlap (e.g. matching "automobile" to "car") — bag-of-words and n-grams will miss this entirely.
You need to model long-range dependencies spanning many words — fixed-window n-grams and order-blind bag-of-words both fail here.
You have abundant text data and compute — modern contextual embeddings (e.g. Transformer-based models) will outperform these classical statistics on almost every semantic task.

Rules of thumb

Always apply smoothing (Laplace, or better, more advanced techniques like Kneser-Ney) before deploying any n-gram language model — zero counts will otherwise zero out entire sequence probabilities.
For TF-IDF, strip or downweight stopwords first; otherwise ubiquitous function words can still dominate raw term-frequency counts before IDF has a chance to discount them.
Treat bag-of-words/TF-IDF and n-gram counts as a cheap, strong baseline to beat — if a fancier embedding-based model cannot clearly outperform them, the added complexity may not be worth it.

Implementation

Reference code implementation

Python

model_fitting.py

1import torch
2import torch.nn as nn
3
4# Create a lookup vocabulary of 5 words, mapping each to a 3-dimensional vector
5vocab = {"hello": 0, "world": 1, "machine": 2, "learning": 3, "ai": 4}
6embedding_layer = nn.Embedding(num_embeddings=5, embedding_dim=3)
7
8# Initialize weights manually to demonstrate
9with torch.no_grad():
10    embedding_layer.weight.copy_(torch.tensor([
11        [0.1, 2.0, -0.5], # hello
12        [0.8, -1.0, 0.4], # world
13        [-1.2, 0.3, 1.5], # machine
14        [-0.9, 0.2, 1.4], # learning
15        [-1.5, 0.5, 2.0]  # ai
16    ]))
17
18# Look up coordinates for "machine" and "ai" (indices 2 and 4)
19input_indices = torch.tensor([2, 4])
20vectors = embedding_layer(input_indices)
21
22print("Vector representation of 'machine':")
23print(vectors[0].tolist())
24print("Vector representation of 'ai':")
25print(vectors[1].tolist())

Strengths & Advantages

Unstructured data extraction: Converts millions of paragraphs, logs, and emails into structured databases.
Captures semantic intent: Understands synonyms, context shifts, and metaphors via word embedding vector geometry.
Powers global interaction: Enables real-time language translations and responsive chatbots.

Limitations & Drawbacks

Context ambiguity: Sarcasm, slang, and cultural double-meanings frequently trip up representations.
Vocabulary sparsity: Rare terms or spelling mistakes can result in empty or inaccurate vectors.
Computation bottlenecks: Sequential processing in recurrent architectures limits model training speed.

Real-World Case Studies

Early statistical machine translation with n-gram language models

Machine Translation

Scenario

Before neural machine translation became dominant, systems like the influential IBM translation models and later phrase-based statistical MT systems (e.g. Moses) needed a way to judge whether a candidate translated sentence was fluent, natural-sounding output in the target language, separate from whether it was a faithful translation of the source.

Approach

A target-language n-gram model (commonly trigram or higher, trained on large monolingual corpora) was combined with a translation model in a log-linear framework. The translation model proposed candidate phrase reorderings and substitutions, while the n-gram language model scored each candidate sentence using $P(w_1,\ldots,w_n) \approx \prod_i P(w_i \mid w_{i-2}, w_{i-1})$ , with smoothing (e.g. Kneser-Ney) handling n-grams unseen in training. The decoder searched for the candidate maximizing the combined translation-model and language-model score.

Outcome

Adding a well-tuned n-gram language model component was one of the largest single contributors to translation quality in these systems, often improving BLEU scores by several points over translation-model-only baselines, simply by rejecting grammatically broken or unnatural word orderings that a pure phrase-substitution model would otherwise output. This pattern — pairing a content model with a separate fluency model — predates and foreshadows the encoder-decoder architectures used in later neural MT.

Source: Speech and Language Processing (Ch. on N-gram Language Models and Machine Translation) — Jurafsky, D. and Martin, J. H

Common Misconceptions

MisconceptionTF-IDF is a deep learning technique.

CorrectionTF-IDF is a classical, statistics-based frequency calculation that does not use neural networks or learn word embeddings.

MisconceptionWord embeddings like Word2Vec capture the exact contextual meaning of a word in a specific sentence.

CorrectionWord2Vec assigns a single, static vector to each word regardless of context. Modern models (like Transformers) produce dynamic, context-dependent embeddings.

References & Further Reading

Speech and Language Processingtextbook
By Jurafsky, D. and Martin, J. H
View Resource →
Foundations of Statistical Natural Language Processingtextbook
By Manning, C. D. and Schütze, H
View Resource →

Natural Language Processing

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

How It Compares

Bag-of-Words / TF-IDF vs N-gram Language Models vs Word Embeddings

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

Early statistical machine translation with n-gram language models

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Neural Networks & Deep Learning

Transformers