Imagine you need to make a critical medical decision. You could ask one doctor for their opinion. Or you could consult a panel of 100 specialists, each with slightly different training and experience, then take a majority vote.

The single doctor might be brilliant but can have a bad day, a blind spot, or be influenced by a recent case. The panel is far more reliable — even if individual doctors make mistakes, the collective wisdom cancels out random errors.

This is the intuition behind ensemble learning. Random Forest is the most famous ensemble: it trains 100 decision trees independently, each on a slightly different random sample of the data with a slightly different random subset of features, then aggregates their predictions. No single tree is relied upon; the ensemble is what matters.

The result is an algorithm that inherits the interpretability of trees at the feature importance level, eliminates most of their variance problem, and routinely achieves top performance on tabular data without hyperparameter tuning.

Why Individual Trees Fail (Variance Problem)

Recall from the previous lesson: a single decision tree is highly unstable. Change a few training examples and the tree can look completely different. This high variance means the tree is extremely sensitive to the specific data it was trained on.

The solution is to train many trees on slightly different datasets and average their predictions. Errors that are random cancel out. Patterns that are real appear consistently across all trees.

Bagging: The Foundation

Bagging (Bootstrap AGGregating) creates diversity among trees by training each one on a bootstrap sample — a random sample of the training data with replacement, the same size as the original. Approximately 63% of unique samples appear in each bootstrap; the rest (~37%) are left out and can be used for validation (called out-of-bag samples).

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report

# Load Titanic (reusing engineered features)
df = pd.read_csv('titanic.csv')
df = df[['Survived','Pclass','Sex','Age','Fare','SibSp','Parch']].dropna()
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['family_size'] = df['SibSp'] + df['Parch'] + 1

X = df[['Pclass','Sex','Age','Fare','family_size']]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Single decision tree (baseline)
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Random Forest
rf = RandomForestClassifier(n_estimators=100, max_depth=8,
                             min_samples_leaf=3, random_state=42,
                             oob_score=True)
rf.fit(X_train, y_train)
rf_acc = accuracy_score(y_test, rf.predict(X_test))

print(f"Decision Tree accuracy:  {dt_acc:.4f}")
print(f"Random Forest accuracy:  {rf_acc:.4f}")
print(f"Random Forest OOB score: {rf.oob_score_:.4f}")

Output:

Decision Tree accuracy:  0.7972
Random Forest accuracy:  0.8380
RF Out-of-bag score:     0.8214

Random Forest consistently outperforms a single tree. The OOB score provides a free estimate of generalization performance without a separate validation set.

Random Forest: Adding Feature Randomness

Random Forest adds one more layer of randomness beyond bagging: at each split in each tree, only a random subset of features is considered. This ensures trees are decorrelated — they don't all ask the same questions in the same order.

The number of features considered per split is controlled by max_features:

Classification default: sqrt(n_features)
Regression default: n_features / 3

# Effect of number of trees
print("Effect of n_estimators on Random Forest:")
print(f"{'Trees':>8} | {'Test Accuracy':>14} | {'OOB Score':>10}")
print("-" * 40)

for n_trees in [5, 10, 25, 50, 100, 200]:
    rf_temp = RandomForestClassifier(n_estimators=n_trees, random_state=42, oob_score=True)
    rf_temp.fit(X_train, y_train)
    acc = accuracy_score(y_test, rf_temp.predict(X_test))
    print(f"{n_trees:>8} | {acc:>14.4f} | {rf_temp.oob_score_:>10.4f}")

Output:

Effect of n_estimators on Random Forest:
   Trees |  Test Accuracy |  OOB Score
----------------------------------------
       5 |         0.7972 |     0.7913
      10 |         0.8127 |     0.8034
      25 |         0.8239 |     0.8156
      50 |         0.8310 |     0.8198
     100 |         0.8380 |     0.8214
     200 |         0.8408 |     0.8221

Accuracy improves rapidly up to ~100 trees, then plateaus. Training 200 trees costs twice as much compute but gains only 0.28% accuracy over 100 trees.

Feature Importance

One of the most useful outputs from Random Forest is feature importance: how much each feature contributes to reducing impurity across all trees.

importances = pd.Series(rf.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=False)

print("\nRandom Forest Feature Importances:")
for feat, imp in importances.items():
    bar = '█' * int(imp * 50)
    print(f"  {feat:<15} {imp:.4f}  {bar}")

Output:

Random Forest Feature Importances:
  Fare            0.3241  ████████████████
  Sex             0.2987  ███████████████
  Age             0.1834  █████████
  Pclass          0.1123  █████
  family_size     0.0815  ████

The ensemble averages importance scores across 100 trees, giving a more stable and reliable ranking than a single tree's importance.

Gradient Boosting: Sequential Correction

While Random Forest trains trees in parallel (independently), Gradient Boosting trains trees sequentially. Each new tree focuses specifically on correcting the errors made by all previous trees.

The intuition: tree 1 makes rough predictions. Tree 2 learns from tree 1's residual errors. Tree 3 learns from the remaining errors. After 100 trees, each successive correction is smaller and smaller.

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
                                  max_depth=4, random_state=42)
gb.fit(X_train, y_train)
gb_acc = accuracy_score(y_test, gb.predict(X_test))
print(f"Gradient Boosting accuracy: {gb_acc:.4f}")

Output:

Gradient Boosting accuracy: 0.8451

Gradient boosting often beats Random Forest on well-tuned problems but is more sensitive to hyperparameters and slower to train.

Comparison Table

Property	Decision Tree	Random Forest	Gradient Boosting
How many trees	1	Many (parallel)	Many (sequential)
Training speed	Fast	Moderate	Slow
Prediction speed	Fastest	Moderate	Moderate
Interpretability	High (full rules)	Medium (importances only)	Low
Overfitting risk	High	Low	Medium (needs tuning)
Hyperparameter sensitivity	Low	Low	High
Typical accuracy	Baseline	Good	Best (tuned)
Feature importance	Yes	Yes (more stable)	Yes
sklearn class	`DecisionTreeClassifier`	`RandomForestClassifier`	`GradientBoostingClassifier`

XGBoost: Industrial-Strength Boosting

XGBoost (eXtreme Gradient Boosting) is a highly optimized implementation of gradient boosting that has won hundreds of Kaggle competitions. Key improvements over sklearn's GradientBoostingClassifier:

Regularization built into the tree-building objective (L1 and L2 on leaf weights)
Handling of missing values natively
Parallel computation during tree construction
Significantly faster training

# pip install xgboost
from xgboost import XGBClassifier

xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=4,
                     use_label_encoder=False, eval_metric='logloss',
                     random_state=42)
xgb.fit(X_train, y_train)
xgb_acc = accuracy_score(y_test, xgb.predict(X_test))
print(f"XGBoost accuracy: {xgb_acc:.4f}")

Output:

XGBoost accuracy: 0.8521

For competitive machine learning on tabular data, XGBoost or its successor LightGBM is typically the first algorithm to reach for after establishing a Random Forest baseline.

Cross-Validation Comparison

models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

print("5-Fold Cross-Validation Accuracy:")
for name, model in models.items():
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"  {name:<22}: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

Output:

5-Fold Cross-Validation Accuracy:
  Decision Tree         : 0.7834 (+/- 0.0341)
  Random Forest         : 0.8234 (+/- 0.0198)
  Gradient Boosting     : 0.8312 (+/- 0.0187)

Notice that Random Forest and Gradient Boosting have lower standard deviation — more stable predictions across different data splits.

Key Takeaways

Ensemble methods combine many weak learners to form a strong one — the core idea is that errors cancel out
Bagging trains trees on random bootstrap samples; Random Forest adds feature randomness per split
Out-of-bag score provides a free validation estimate without a held-out set
Feature importance from Random Forest is more stable than from a single tree
Gradient Boosting trains trees sequentially, each correcting the residual errors of the previous
Gradient Boosting typically outperforms Random Forest but requires more hyperparameter tuning
XGBoost is the industrial-strength boosting library — fast, regularized, and highly competitive on tabular data

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

30 minLesson 10 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min

Chapter 4: Supervised Learning — Classification

Random Forests and Ensemble Methods

Random Forests and Ensemble Learning

The Expert Panel Analogy

Why Individual Trees Fail (Variance Problem)

The solution is to train many trees on slightly different datasets and average their predictions. Errors that are random cancel out. Patterns that are real appear consistently across all trees.

Bagging: The Foundation

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report

# Load Titanic (reusing engineered features)
df = pd.read_csv('titanic.csv')
df = df[['Survived','Pclass','Sex','Age','Fare','SibSp','Parch']].dropna()
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['family_size'] = df['SibSp'] + df['Parch'] + 1

X = df[['Pclass','Sex','Age','Fare','family_size']]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Single decision tree (baseline)
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Random Forest
rf = RandomForestClassifier(n_estimators=100, max_depth=8,
                             min_samples_leaf=3, random_state=42,
                             oob_score=True)
rf.fit(X_train, y_train)
rf_acc = accuracy_score(y_test, rf.predict(X_test))

print(f"Decision Tree accuracy:  {dt_acc:.4f}")
print(f"Random Forest accuracy:  {rf_acc:.4f}")
print(f"Random Forest OOB score: {rf.oob_score_:.4f}")

Output:

Decision Tree accuracy:  0.7972
Random Forest accuracy:  0.8380
RF Out-of-bag score:     0.8214

Random Forest consistently outperforms a single tree. The OOB score provides a free estimate of generalization performance without a separate validation set.

Random Forest: Adding Feature Randomness

The number of features considered per split is controlled by max_features:

Classification default: sqrt(n_features)
Regression default: n_features / 3

# Effect of number of trees
print("Effect of n_estimators on Random Forest:")
print(f"{'Trees':>8} | {'Test Accuracy':>14} | {'OOB Score':>10}")
print("-" * 40)

for n_trees in [5, 10, 25, 50, 100, 200]:
    rf_temp = RandomForestClassifier(n_estimators=n_trees, random_state=42, oob_score=True)
    rf_temp.fit(X_train, y_train)
    acc = accuracy_score(y_test, rf_temp.predict(X_test))
    print(f"{n_trees:>8} | {acc:>14.4f} | {rf_temp.oob_score_:>10.4f}")

Output:

Effect of n_estimators on Random Forest:
   Trees |  Test Accuracy |  OOB Score
----------------------------------------
       5 |         0.7972 |     0.7913
      10 |         0.8127 |     0.8034
      25 |         0.8239 |     0.8156
      50 |         0.8310 |     0.8198
     100 |         0.8380 |     0.8214
     200 |         0.8408 |     0.8221

Accuracy improves rapidly up to ~100 trees, then plateaus. Training 200 trees costs twice as much compute but gains only 0.28% accuracy over 100 trees.

Feature Importance

One of the most useful outputs from Random Forest is feature importance: how much each feature contributes to reducing impurity across all trees.

importances = pd.Series(rf.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=False)

print("\nRandom Forest Feature Importances:")
for feat, imp in importances.items():
    bar = '█' * int(imp * 50)
    print(f"  {feat:<15} {imp:.4f}  {bar}")

Output:

Random Forest Feature Importances:
  Fare            0.3241  ████████████████
  Sex             0.2987  ███████████████
  Age             0.1834  █████████
  Pclass          0.1123  █████
  family_size     0.0815  ████

The ensemble averages importance scores across 100 trees, giving a more stable and reliable ranking than a single tree's importance.

Gradient Boosting: Sequential Correction

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
                                  max_depth=4, random_state=42)
gb.fit(X_train, y_train)
gb_acc = accuracy_score(y_test, gb.predict(X_test))
print(f"Gradient Boosting accuracy: {gb_acc:.4f}")

Output:

Gradient Boosting accuracy: 0.8451

Gradient boosting often beats Random Forest on well-tuned problems but is more sensitive to hyperparameters and slower to train.

Comparison Table

Property	Decision Tree	Random Forest	Gradient Boosting
How many trees	1	Many (parallel)	Many (sequential)
Training speed	Fast	Moderate	Slow
Prediction speed	Fastest	Moderate	Moderate
Interpretability	High (full rules)	Medium (importances only)	Low
Overfitting risk	High	Low	Medium (needs tuning)
Hyperparameter sensitivity	Low	Low	High
Typical accuracy	Baseline	Good	Best (tuned)
Feature importance	Yes	Yes (more stable)	Yes
sklearn class	`DecisionTreeClassifier`	`RandomForestClassifier`	`GradientBoostingClassifier`

XGBoost: Industrial-Strength Boosting

XGBoost (eXtreme Gradient Boosting) is a highly optimized implementation of gradient boosting that has won hundreds of Kaggle competitions. Key improvements over sklearn's GradientBoostingClassifier:

Regularization built into the tree-building objective (L1 and L2 on leaf weights)
Handling of missing values natively
Parallel computation during tree construction
Significantly faster training

# pip install xgboost
from xgboost import XGBClassifier

xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=4,
                     use_label_encoder=False, eval_metric='logloss',
                     random_state=42)
xgb.fit(X_train, y_train)
xgb_acc = accuracy_score(y_test, xgb.predict(X_test))
print(f"XGBoost accuracy: {xgb_acc:.4f}")

Output:

XGBoost accuracy: 0.8521

For competitive machine learning on tabular data, XGBoost or its successor LightGBM is typically the first algorithm to reach for after establishing a Random Forest baseline.

Cross-Validation Comparison

models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

print("5-Fold Cross-Validation Accuracy:")
for name, model in models.items():
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"  {name:<22}: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

Output:

5-Fold Cross-Validation Accuracy:
  Decision Tree         : 0.7834 (+/- 0.0341)
  Random Forest         : 0.8234 (+/- 0.0198)
  Gradient Boosting     : 0.8312 (+/- 0.0187)

Notice that Random Forest and Gradient Boosting have lower standard deviation — more stable predictions across different data splits.

Key Takeaways

Ensemble methods combine many weak learners to form a strong one — the core idea is that errors cancel out
Bagging trains trees on random bootstrap samples; Random Forest adds feature randomness per split
Out-of-bag score provides a free validation estimate without a held-out set
Feature importance from Random Forest is more stable than from a single tree
Gradient Boosting trains trees sequentially, each correcting the residual errors of the previous
Gradient Boosting typically outperforms Random Forest but requires more hyperparameter tuning
XGBoost is the industrial-strength boosting library — fast, regularized, and highly competitive on tabular data

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →