Maximum Likelihood

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a core concept in statistics and machine learning. It answers a very simple question: "Given the data we just saw, what are the most likely rules that created it?" It's the mathematical engine behind how many algorithms learn from data, from simple trend lines to massive neural networks.

Where is it used?

MLE is incredibly useful when you have a good guess about the general "shape" of your data (like a bell curve), but you don't know the exact details (like where the center of the curve is). It's used everywhere: figuring out the error rate of a factory machine, calculating the true conversion rate of a new website button in an A/B test, or predicting how long a lightbulb will last before burning out.

Best use

It's mathematically proven to give you the most accurate possible estimate as you get more and more data.

Watch out for

If you only have a tiny amount of data, MLE can make terrible, overconfident guesses (overfitting).

Intuition

How to think about this algorithm

Imagine you find a weird, weighted coin on the ground. You flip it 10 times, and it lands on heads 7 times. What's your best guess for the true probability of this coin landing on heads?

MLE says your best guess is exactly 70% (or 0.7). Why? Because if the true probability was 0.7, that makes the data you actually saw (7 heads out of 10) more likely to happen than if the probability was 0.5, or 0.9, or anything else. MLE simply finds the exact numbers that make your real-world observations the most mathematically probable outcome.

Interactive Diagram

Maximum Likelihood Estimation

Adjust the Gaussian model mean (μ) and standard deviation (σ). Click anywhere on the plot horizontal space to place observation points.

Observed Data

Gaussian Model

Point Likelihood

[Click plot space to place observations]

ln L(μ, σ) = -16.55

Optimal solve: μ = 5.30, σ = 1.88

Model Mean (μ)5.00

Model Std Dev (σ)1.20

Key InsightMLE searches for parameter values that maximize the joint probability density of the observed data. Graphically, this pulls the curve to cover the points.

∑

The Logic

Mathematical core for maximum likelihood estimation

1. The Likelihood Function

Let's say we have a bunch of data points $X = \{x_1, x_2, \dots, x_n\}$ . We assume these data points come from some probability distribution (like a bell curve) that is controlled by some unknown settings, which we call parameters $\theta$ .

The "Likelihood" of those parameters, given the data we saw, is calculated by multiplying the probability of seeing each individual data point:

$\mathcal{L}(\theta | X) = \prod_{i=1}^{n} P(x_i | \theta)$

2. Log-Likelihood

Multiplying a bunch of tiny probabilities together (like $0.01 \times 0.05 \times 0.02$ ) quickly results in numbers so small that computers round them down to zero. Plus, multiplying is annoying to do calculus on.

To fix this, we take the logarithm of the whole thing. In math, the log of a product becomes the sum of the logs. This turns our multiplication problem into an addition problem, which is much easier for computers to handle:

$\log \mathcal{L}(\theta | X) = \sum_{i=1}^{n} \log P(x_i | \theta)$

3. Finding the Maximum

The Maximum Likelihood Estimate (MLE), written as $\hat{\theta}_{\text{MLE}}$ , is simply the parameter value that makes that log-likelihood equation as big as possible:

$\hat{\theta}_{\text{MLE}} = \arg\max_{\theta} \log \mathcal{L}(\theta | X)$

To find this peak, we use calculus. We take the derivative of the log-likelihood equation, set it to zero, and solve for $\theta$ :

$\frac{\partial}{\partial \theta} \sum_{i=1}^{n} \log P(x_i | \theta) = 0$

Example: Flipping a Coin

For a coin flip with probability $p$ of getting heads, the math for a single flip is $P(x_i | p) = p^{x_i}(1-p)^{1-x_i}$ . The log-likelihood for a bunch of flips is:

$\log \mathcal{L}(p) = \sum_{i=1}^n \left[ x_i \log p + (1-x_i) \log(1-p) \right]$

If you do the calculus (take the derivative and set it to zero), the math perfectly proves that your best guess for $p$ is just the average number of heads you saw: $\hat{p} = \frac{1}{n} \sum_{i=1}^n x_i$ .

Code Example

maximum_likelihood_estimation.py · reference implementation

Python

model_fitting.py

1import numpy as np
2from scipy.optimize import minimize
3
4data = np.array([2.3, 1.9, 2.5, 2.8, 1.7])
5
6def neg_log_likelihood(params):
7    mu, sigma = params
8    if sigma <= 0: return np.inf
9    n = len(data)
10    # Minimizing negative log-likelihood is the same as maximizing log-likelihood
11    ll = - (n/2)*np.log(2*np.pi) - n*np.log(sigma) - np.sum((data - mu)**2)/(2*sigma**2)
12    return -ll
13
14result = minimize(neg_log_likelihood, [0.0, 1.0])
15mu_est, sigma_est = result.x
16
17print(f"MLE Derived Mean: {mu_est:.2f}")
18print(f"MLE Derived Std Dev: {sigma_est:.2f}")

✓

Strengths

It's mathematically proven to give you the most accurate possible estimate as you get more and more data.
It's the foundation for how we measure errors in machine learning (for example, Mean Squared Error is just MLE in disguise).
It's consistent—if you transform the math, the best estimate transforms perfectly with it.

Limitations

If you only have a tiny amount of data, MLE can make terrible, overconfident guesses (overfitting).
It completely relies on you guessing the right 'shape' for your data. If you assume the data is a bell curve, but it's actually something else, MLE will give you the wrong answer.
For really complex AI models, finding the exact maximum point using calculus is incredibly difficult or impossible.

Key Assumptions

Scope conditions and interpretation notes

1
The samples are independent and identically distributed, or the dependence structure is modeled explicitly.
2
The chosen probability family is a reasonable approximation to the true process.
3
The likelihood surface can be optimized reliably enough for the application.

References

Books and papers for deeper study

Bishop, C. M. (2006) Pattern Recognition and Machine Learning. New York: Springer.
Fisher, R. A. (1922) 'On the mathematical foundations of theoretical statistics', Philosophical Transactions of the Royal Society of London. Series A, 222(594-604), pp. 309-368.