Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is a core concept in statistics and machine learning. It answers a very simple question: "Given the data we just saw, what are the most likely rules that created it?" It's the mathematical engine behind how many algorithms learn from data, from simple trend lines to massive neural networks.
Where is it used?
MLE is incredibly useful when you have a good guess about the general "shape" of your data (like a bell curve), but you don't know the exact details (like where the center of the curve is). It's used everywhere: figuring out the error rate of a factory machine, calculating the true conversion rate of a new website button in an A/B test, or predicting how long a lightbulb will last before burning out.
It's mathematically proven to give you the most accurate possible estimate as you get more and more data.
If you only have a tiny amount of data, MLE can make terrible, overconfident guesses (overfitting).
Intuition
How to think about this algorithm
Imagine you find a weird, weighted coin on the ground. You flip it 10 times, and it lands on heads 7 times. What's your best guess for the true probability of this coin landing on heads?
MLE says your best guess is exactly 70% (or 0.7). Why? Because if the true probability was 0.7, that makes the data you actually saw (7 heads out of 10) more likely to happen than if the probability was 0.5, or 0.9, or anything else. MLE simply finds the exact numbers that make your real-world observations the most mathematically probable outcome.
Maximum Likelihood Estimation
Adjust the Gaussian model mean (μ) and standard deviation (σ). Click anywhere on the plot horizontal space to place observation points.
The Logic
Mathematical core for maximum likelihood estimation
1. The Likelihood Function
Let's say we have a bunch of data points . We assume these data points come from some probability distribution (like a bell curve) that is controlled by some unknown settings, which we call parameters .
The "Likelihood" of those parameters, given the data we saw, is calculated by multiplying the probability of seeing each individual data point:
2. Log-Likelihood
Multiplying a bunch of tiny probabilities together (like ) quickly results in numbers so small that computers round them down to zero. Plus, multiplying is annoying to do calculus on.
To fix this, we take the logarithm of the whole thing. In math, the log of a product becomes the sum of the logs. This turns our multiplication problem into an addition problem, which is much easier for computers to handle:
3. Finding the Maximum
The Maximum Likelihood Estimate (MLE), written as , is simply the parameter value that makes that log-likelihood equation as big as possible:
To find this peak, we use calculus. We take the derivative of the log-likelihood equation, set it to zero, and solve for :
Example: Flipping a Coin
For a coin flip with probability of getting heads, the math for a single flip is . The log-likelihood for a bunch of flips is:
If you do the calculus (take the derivative and set it to zero), the math perfectly proves that your best guess for is just the average number of heads you saw: .
Code Example
maximum_likelihood_estimation.py · reference implementation
1import numpy as np
2from scipy.optimize import minimize
3
4data = np.array([2.3, 1.9, 2.5, 2.8, 1.7])
5
6def neg_log_likelihood(params):
7 mu, sigma = params
8 if sigma <= 0: return np.inf
9 n = len(data)
10 # Minimizing negative log-likelihood is the same as maximizing log-likelihood
11 ll = - (n/2)*np.log(2*np.pi) - n*np.log(sigma) - np.sum((data - mu)**2)/(2*sigma**2)
12 return -ll
13
14result = minimize(neg_log_likelihood, [0.0, 1.0])
15mu_est, sigma_est = result.x
16
17print(f"MLE Derived Mean: {mu_est:.2f}")
18print(f"MLE Derived Std Dev: {sigma_est:.2f}")Strengths
It's mathematically proven to give you the most accurate possible estimate as you get more and more data.
It's the foundation for how we measure errors in machine learning (for example, Mean Squared Error is just MLE in disguise).
It's consistent—if you transform the math, the best estimate transforms perfectly with it.
Limitations
If you only have a tiny amount of data, MLE can make terrible, overconfident guesses (overfitting).
It completely relies on you guessing the right 'shape' for your data. If you assume the data is a bell curve, but it's actually something else, MLE will give you the wrong answer.
For really complex AI models, finding the exact maximum point using calculus is incredibly difficult or impossible.
Key Assumptions
Scope conditions and interpretation notes
- 1
The samples are independent and identically distributed, or the dependence structure is modeled explicitly.
- 2
The chosen probability family is a reasonable approximation to the true process.
- 3
The likelihood surface can be optimized reliably enough for the application.
References
Books and papers for deeper study
Bishop, C. M. (2006) Pattern Recognition and Machine Learning. New York: Springer.
Fisher, R. A. (1922) 'On the mathematical foundations of theoretical statistics', Philosophical Transactions of the Royal Society of London. Series A, 222(594-604), pp. 309-368.