Natural Language Processing

Natural Language Processing

Natural Language Processing (NLP) is the mathematical discipline that allows computer chips to process, understand, and generate human languages. Because computers operate only on numbers, NLP maps text into continuous vector spaces.

Traditional statistical NLP relies on word frequencies and counts (like TF-IDF). Modern deep learning NLP embeds words, phrases, or full documents into dense vectors (Word Embeddings). In this geometric vector space, words with similar meanings are mapped close together, allowing models to capture subtle semantic nuances, context, and syntax.

Best use

Unstructured data extraction: Converts millions of paragraphs, logs, and emails into structured databases.

Watch out for

Context ambiguity: Sarcasm, slang, and cultural double-meanings frequently trip up representations.

i

Intuition

How to think about this algorithm

Imagine words as points in a giant multi-dimensional "meaning space". In this space, the spatial distance between points represents how related they are.

If you take the vector for "King", subtract the vector for "Man", and add the vector for "Woman", you land almost exactly at the vector coordinates for "Queen". NLP is the process of translating human writing into these neat, coordinate-based maps.

Interactive Diagram

Semantic Embedding Vector Spaces

Click different vocabulary buttons, or click anywhere inside the plot to add a custom word vector and watch semantic analogies compute.

Active Embedding
Neighbor Embeddings
Analogy Vector Shift
[Click plot space to add custom word embeddings]
Selected Word: king
Cosine similarity target: 0.85
Word Vocabulary
Key InsightWord embedding models represent language tokens in a continuous spatial coordinate geometry, where vector differences map algebraic linguistic analogies.

The Logic

Mathematical core for natural language processing

1. Cosine Similarity

To measure semantic similarity between two text vectors (uu and vv), we calculate the cosine of the angle between them. This focuses on direction rather than magnitude:

Cosine Similarity(u,v)=uvuv=iuiviiui2ivi2\text{Cosine Similarity}(u, v) = \frac{u \cdot v}{\|u\| \|v\|} = \frac{\sum_{i} u_i v_i}{\sqrt{\sum_{i} u_i^2} \sqrt{\sum_{i} v_i^2}}

The value is bounded between -1 (opposite meaning) and +1 (identical direction).

2. Term Frequency-Inverse Document Frequency (TF-IDF)

A classic metric representing how important a word tt is to a specific document dd within a corpus DD:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)

IDF(t,D)=log(D1+{dD:td})\text{IDF}(t, D) = \log \left( \frac{|D|}{1 + |\{d \in D : t \in d\}|} \right)

Where TF\text{TF} is term count and IDF\text{IDF} scales down words that appear too frequently across all documents (like "the", "and").

Code Example

natural_language_processing.py · pytorch example

Python
model_fitting.py
1import torch
2import torch.nn as nn
3
4# Create a lookup vocabulary of 5 words, mapping each to a 3-dimensional vector
5vocab = {"hello": 0, "world": 1, "machine": 2, "learning": 3, "ai": 4}
6embedding_layer = nn.Embedding(num_embeddings=5, embedding_dim=3)
7
8# Initialize weights manually to demonstrate
9with torch.no_grad():
10    embedding_layer.weight.copy_(torch.tensor([
11        [0.1, 2.0, -0.5], # hello
12        [0.8, -1.0, 0.4], # world
13        [-1.2, 0.3, 1.5], # machine
14        [-0.9, 0.2, 1.4], # learning
15        [-1.5, 0.5, 2.0]  # ai
16    ]))
17
18# Look up coordinates for "machine" and "ai" (indices 2 and 4)
19input_indices = torch.tensor([2, 4])
20vectors = embedding_layer(input_indices)
21
22print("Vector representation of 'machine':")
23print(vectors[0].tolist())
24print("Vector representation of 'ai':")
25print(vectors[1].tolist())

Strengths

  • Unstructured data extraction: Converts millions of paragraphs, logs, and emails into structured databases.

  • Captures semantic intent: Understands synonyms, context shifts, and metaphors via word embedding vector geometry.

  • Powers global interaction: Enables real-time language translations and responsive chatbots.

!

Limitations

  • Context ambiguity: Sarcasm, slang, and cultural double-meanings frequently trip up representations.

  • Vocabulary sparsity: Rare terms or spelling mistakes can result in empty or inaccurate vectors.

  • Computation bottlenecks: Sequential processing in recurrent architectures limits model training speed.

A

Key Assumptions

Scope conditions and interpretation notes

  • 1

    Text can be tokenized into discrete lexical units.

  • 2

    Semantic similarity correlates with proximity in a continuous vector space.

R

References

Books and papers for deeper study

  • Jurafsky, D. and Martin, J. H. (2023) Speech and Language Processing. 3rd edn. Draft available online.