Ensemble Learning

Ensemble Learning (RFs & GBMs)

Difficulty:Advanced

Reading Time:30 min

Track:

Practitioner

The strategy of combining hundreds of 'okay' models together to create one unstoppable super-model.

Prerequisites

Decision Trees

ML PractitionerModule 10 of 17

Ensemble Learning (RFs & GBMs)

59%

TL;DR

An ensemble combines many base learners so the crowd outperforms any single model — the two workhorses are bagging (parallel) and boosting (sequential).
Bagging (e.g. Random Forest) trains decorrelated trees on bootstrap samples and averages them; this mainly cuts variance without raising bias.
Boosting (e.g. Gradient Boosting / XGBoost) fits learners sequentially to the residuals/gradients of the current model; this mainly cuts bias.
Averaging $M$ models with per-model variance $\sigma^2$ and pairwise correlation $\rho$ gives variance $\rho\sigma^2 + \frac{1-\rho}{M}\sigma^2$ , so decorrelation ( $\rho \to 0$ ) is what makes the forest powerful.
Boosting can reach lower error than bagging on tabular data but overfits if you add too many trees or use too large a learning rate — control it with shrinkage and early stopping.
For structured/tabular problems, gradient-boosted trees are the go-to default; random forests are the low-tuning, high-robustness baseline.

Learning Objectives

Distinguish between bagging and boosting ensemble methodologies
Explain how Random Forests reduce correlation between base learners via feature bagging
Formulate the gradient boosting update step and define pseudo-residuals
Evaluate the trade-off between ensemble size, model diversity, and overfitting

Intuition

How to think conceptually about this topic

Bagging (Random Forests): Imagine you want to guess the exact number of jellybeans in a jar. If you ask one person, they might be way off. But if you ask 1,000 random people and average their guesses, the final answer is usually incredibly close to the truth. The errors of the individuals cancel each other out.

Boosting (Gradient Boosting): Imagine you are taking a difficult math test. You take it once and get a 60%. A tutor looks at your test, ignores the questions you got right, and forces you to study only the specific types of questions you got wrong. You take the test again and get an 80%. A second tutor looks at your new mistakes and focuses only on those. By the end, you are getting a 100%.

Interactive Diagram

Test the intuition above by changing the model parameters

In Depth

Detailed explanations, contexts, and details

Ensemble learning is based on a very simple idea: the wisdom of the crowd. Instead of trying to build one massive, incredibly complex algorithm that gets everything right, you build hundreds of simple, slightly flawed algorithms and have them vote on the final answer.

There are two main ways to do this:

Bagging (like Random Forests): You build hundreds of independent Decision Trees at the same time, but you give each tree a slightly different, randomized version of the data. When it's time to make a prediction, all the trees vote, and the majority wins.
Boosting (like XGBoost or Gradient Boosting): You build trees one at a time. The first tree makes a prediction and inevitably gets some things wrong. The second tree is built specifically to fix the mistakes of the first tree. The third tree fixes the mistakes of the second, and so on.

Where is it used?

If you have structured data in a spreadsheet (rows and columns), Ensemble Learning is almost certainly the best tool for the job. It absolutely dominates machine learning competitions like Kaggle. It is used heavily in algorithmic stock trading, credit risk scoring, and predicting customer behavior.

How It Compares

Bagging vs Boosting vs Stacking

Dimension	Bagging (Random Forest)	Boosting (Gradient Boosting / XGBoost)	Stacking
Primarily reduces	Variance	Bias (and some variance)	Both — corrects systematic errors of base models
Training scheme	Parallel — trees are independent	Sequential — each learner fixes the last	Two-level — base models in parallel, a meta-learner on top
Base learners	Deep, low-bias trees (decorrelated)	Shallow, high-bias weak learners	Heterogeneous models (trees, linear, kNN, ...)
Overfitting risk	Low — more trees rarely hurts	High — too many trees / large $\nu$ overfit	Moderate — meta-learner can overfit without CV
Tuning effort	Low — strong defaults	High — learning rate, depth, n_estimators, early stopping	High — design base models and meta-learner
Interpretability	Low (but cheap importances / OOB)	Low (importances + SHAP common)	Lowest — stacked layers obscure reasoning

TakeawayReach for a Random Forest when you want a robust, low-tuning baseline; use gradient boosting to squeeze out the best tabular accuracy at the cost of careful tuning; use stacking to combine genuinely diverse models when the last bit of performance matters.

When to Use It

Reach for this when

You have structured / tabular data (rows and columns) and want strong off-the-shelf accuracy.
A single decision tree overfits (high variance) — bagging/Random Forest will stabilize it.
You need to squeeze out maximum accuracy on a tabular benchmark and can afford to tune — gradient boosting usually wins.
Features are a mix of types with missing values and outliers, which tree ensembles handle gracefully.

Avoid it when

You need a transparent, fully interpretable model for regulators or stakeholders — a single tree or linear model is easier to explain.
The signal is genuinely linear and low-dimensional — a regularized linear model is faster and just as accurate.
You are working with unstructured data (images, audio, raw text) where deep learning dominates.
You have a very tight latency or memory budget at inference — hundreds of trees can be too heavy.

Rules of thumb

Start with a Random Forest as a tough baseline before investing time tuning a boosting model.
For boosting, use a small learning rate ( $\nu \approx 0.01$ – $0.1$ ) with early stopping on a validation set rather than guessing n_estimators.
Diversity is the whole game: decorrelated base learners help, near-identical ones do not.
Use out-of-bag (OOB) error for a free validation estimate on bagged ensembles.

Implementation

Reference code implementation

Python

model_fitting.py

1import numpy as np
2from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
3
4X = np.random.rand(100, 5)
5y = (X[:, 0] + X[:, 1] > 1.0).astype(int)
6
7# Random Forest implementation
8rf = RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42)
9rf.fit(X, y)
10print(f"Random Forest Accuracy: {rf.score(X, y):.2f}")
11
12# Gradient Boosting implementation
13gbm = GradientBoostingClassifier(n_estimators=50, learning_rate=0.1, max_depth=3)
14gbm.fit(X, y)
15print(f"Gradient Boosting Accuracy: {gbm.score(X, y):.2f}")

Strengths & Advantages

It is widely considered the absolute best approach for standard, tabular data (like Excel spreadsheets or SQL databases).
Random Forests require almost no tuning. You can usually just run them with default settings and get an excellent result.
They naturally handle missing data, weird outliers, and a mix of numbers and categories without breaking a sweat.

Limitations & Drawbacks

They are 'black boxes'. Because the final answer is a combination of hundreds of trees, it is very difficult to explain exactly why the model made a specific prediction.
Boosting models (like XGBoost) are very sensitive to their settings. If you set the learning rate wrong, they will overfit and memorize the training data.
They are large and slow to train compared to simple models like Linear Regression.

Real-World Case Studies

Gradient boosting dominating tabular machine-learning competitions

Competitive ML / tabular prediction

Scenario

Teams competing on structured tabular datasets (click prediction, ranking, risk scoring) need the highest possible accuracy. The question is which model family consistently delivers winning results on this kind of data.

Approach

Chen and Guestrin introduced XGBoost, a scalable, regularized gradient-boosted tree system with sparsity-aware split finding and a clever cache/out-of-core design, and benchmarked it across competition and production workloads.

Outcome

XGBoost became the dominant tool for tabular problems: among the 29 winning solutions published on Kaggle in 2015, 17 used XGBoost (with 8 using it as the sole learner), and it powered the top entries of the KDDCup 2015. It also ran roughly an order of magnitude faster than existing implementations on a single machine and scaled to billions of examples. The lesson: well-tuned gradient boosting, not a single tree or linear model, is the default winner on structured data.

Source: XGBoost: A Scalable Tree Boosting System — Chen, T. and Guestrin, C.

Common Misconceptions

MisconceptionRandom Forests and Gradient Boosting always give identical predictions.

CorrectionRandom Forests average predictions of independent trees (reducing variance), whereas Gradient Boosting adds predictions of sequential correction trees (reducing bias). Their decision boundaries and error profiles are different.

MisconceptionAn ensemble model with 1,000 trees is always much better than one with 100 trees.

CorrectionVariance reduction slows down dramatically after a certain number of trees (e.g. 100). Adding more trees increases training and inference time with negligible gains in accuracy.

References & Further Reading

Greedy Function Approximation: A Gradient Boosting Machinetextbook
By Friedman, J.H
View Resource →
Ensemble Methods: Foundations and Algorithmstextbook
By Zhou, Z.-H
View Resource →

Ensemble Learning (RFs & GBMs)

Prerequisites

TL;DR

Learning Objectives

Intuition

Interactive Diagram

The Mathematics

In Depth

Where is it used?

How It Compares

Bagging vs Boosting vs Stacking

When to Use It

Reach for this when

Avoid it when

Rules of thumb

Implementation

Strengths & Advantages

Limitations & Drawbacks

Real-World Case Studies

Gradient boosting dominating tabular machine-learning competitions

Common Misconceptions

Self-Check Quiz

References & Further Reading

Related Topics

Decision Trees