Natural Language Processing
Natural Language Processing (NLP) is the mathematical discipline that allows computer chips to process, understand, and generate human languages. Because computers operate only on numbers, NLP maps text into continuous vector spaces.
Traditional statistical NLP relies on word frequencies and counts (like TF-IDF). Modern deep learning NLP embeds words, phrases, or full documents into dense vectors (Word Embeddings). In this geometric vector space, words with similar meanings are mapped close together, allowing models to capture subtle semantic nuances, context, and syntax.
Unstructured data extraction: Converts millions of paragraphs, logs, and emails into structured databases.
Context ambiguity: Sarcasm, slang, and cultural double-meanings frequently trip up representations.
Intuition
How to think about this algorithm
Imagine words as points in a giant multi-dimensional "meaning space". In this space, the spatial distance between points represents how related they are.
If you take the vector for "King", subtract the vector for "Man", and add the vector for "Woman", you land almost exactly at the vector coordinates for "Queen". NLP is the process of translating human writing into these neat, coordinate-based maps.
Semantic Embedding Vector Spaces
Click different vocabulary buttons, or click anywhere inside the plot to add a custom word vector and watch semantic analogies compute.
The Logic
Mathematical core for natural language processing
1. Cosine Similarity
To measure semantic similarity between two text vectors ( and ), we calculate the cosine of the angle between them. This focuses on direction rather than magnitude:
The value is bounded between -1 (opposite meaning) and +1 (identical direction).
2. Term Frequency-Inverse Document Frequency (TF-IDF)
A classic metric representing how important a word is to a specific document within a corpus :
Where is term count and scales down words that appear too frequently across all documents (like "the", "and").
Code Example
natural_language_processing.py · pytorch example
1import torch
2import torch.nn as nn
3
4# Create a lookup vocabulary of 5 words, mapping each to a 3-dimensional vector
5vocab = {"hello": 0, "world": 1, "machine": 2, "learning": 3, "ai": 4}
6embedding_layer = nn.Embedding(num_embeddings=5, embedding_dim=3)
7
8# Initialize weights manually to demonstrate
9with torch.no_grad():
10 embedding_layer.weight.copy_(torch.tensor([
11 [0.1, 2.0, -0.5], # hello
12 [0.8, -1.0, 0.4], # world
13 [-1.2, 0.3, 1.5], # machine
14 [-0.9, 0.2, 1.4], # learning
15 [-1.5, 0.5, 2.0] # ai
16 ]))
17
18# Look up coordinates for "machine" and "ai" (indices 2 and 4)
19input_indices = torch.tensor([2, 4])
20vectors = embedding_layer(input_indices)
21
22print("Vector representation of 'machine':")
23print(vectors[0].tolist())
24print("Vector representation of 'ai':")
25print(vectors[1].tolist())Strengths
Unstructured data extraction: Converts millions of paragraphs, logs, and emails into structured databases.
Captures semantic intent: Understands synonyms, context shifts, and metaphors via word embedding vector geometry.
Powers global interaction: Enables real-time language translations and responsive chatbots.
Limitations
Context ambiguity: Sarcasm, slang, and cultural double-meanings frequently trip up representations.
Vocabulary sparsity: Rare terms or spelling mistakes can result in empty or inaccurate vectors.
Computation bottlenecks: Sequential processing in recurrent architectures limits model training speed.
Key Assumptions
Scope conditions and interpretation notes
- 1
Text can be tokenized into discrete lexical units.
- 2
Semantic similarity correlates with proximity in a continuous vector space.
References
Books and papers for deeper study
Jurafsky, D. and Martin, J. H. (2023) Speech and Language Processing. 3rd edn. Draft available online.