Random Forests & Ensemble Methods
Random Forests: Power Through Diversity
A single decision tree is like asking one person for their opinion — useful, but limited by that person's blind spots. A Random Forest is like asking 500 people from different backgrounds. Their collective wisdom is far more reliable than any individual.
This intuition explains why Random Forests are one of the most consistently effective ML algorithms across diverse real-world problems.
The Ensemble Idea
Single Decision Tree:
- High variance (overfit to training data)
- Sensitive to small data changes
- Accuracy ceiling: ~85%
100 Diverse Decision Trees (Random Forest):
- Individual errors cancel out
- More stable predictions
- Accuracy ceiling: ~93%+
The key word is diverse. If all trees made the same mistakes, averaging them would help nothing. Random Forests create diversity through two mechanisms:
1. Bootstrap sampling — each tree trains on a different random sample of the data (with replacement)
2. Feature randomness — at each split, each tree considers only a random subset of features
How Random Forests Work
Training:
For each of n_estimators trees:
1. Draw bootstrap sample (random rows with replacement, ~63% unique)
2. For each split, randomly select sqrt(n_features) candidate features
3. Find best split among those candidates
4. Grow tree fully (or to max_depth)
Prediction (Classification):
Each tree votes → majority wins
Prediction (Regression):
Each tree predicts a value → average all predictions
The out-of-bag (OOB) samples (the ~37% not selected in bootstrap) serve as a built-in validation set — no need for a separate validation split.
Building a Random Forest
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import numpy as np
import matplotlib.pyplot as plt
# Classification
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestClassifier(
n_estimators=100, # Number of trees
max_features='sqrt', # Features per split: 'sqrt' for classification
max_depth=None, # Let trees grow fully
min_samples_leaf=1,
bootstrap=True, # Use bootstrap sampling
oob_score=True, # Calculate out-of-bag score
n_jobs=-1, # Use all CPU cores
random_state=42
)
rf.fit(X_train, y_train)
print(f"Test Accuracy: {rf.score(X_test, y_test):.3f}")
print(f"OOB Score: {rf.oob_score_:.3f}") # Out-of-bag estimate
print(classification_report(y_test, rf.predict(X_test), target_names=cancer.target_names))
Feature Importance
Random Forests provide robust feature importance estimates by averaging importance across all trees.
import pandas as pd
# Feature importances (mean decrease in impurity)
importances = pd.Series(
rf.feature_importances_,
index=cancer.feature_names
).sort_values(ascending=False)
print("Top 10 Most Important Features:")
print(importances.head(10))
# Plot
plt.figure(figsize=(10, 8))
importances.head(15).plot(kind='barh')
plt.title('Random Forest Feature Importances')
plt.xlabel('Mean Decrease in Impurity')
plt.tight_layout()
plt.show()
Permutation importance is more reliable than the default impurity-based importance, especially when features have different scales or many categories:
from sklearn.inspection import permutation_importance
perm_imp = permutation_importance(
rf, X_test, y_test,
n_repeats=10,
random_state=42,
n_jobs=-1
)
perm_importances = pd.Series(
perm_imp.importances_mean,
index=cancer.feature_names
).sort_values(ascending=False)
print("Permutation Importances:")
print(perm_importances.head(10))
Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = {
'n_estimators': randint(50, 500),
'max_depth': [None, 5, 10, 20, 30],
'max_features': ['sqrt', 'log2', 0.3, 0.5],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'bootstrap': [True, False]
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42, n_jobs=-1),
param_distributions=param_dist,
n_iter=50, # Try 50 random combinations
cv=5,
scoring='f1',
random_state=42,
n_jobs=-1
)
random_search.fit(X_train, y_train)
print("Best params:", random_search.best_params_)
print("Best CV F1:", random_search.best_score_)
Key hyperparameters and their effects:
| Parameter | Effect | Tuning Strategy |
|---|---|---|
n_estimators | More = better (diminishing returns) | 100 is usually sufficient |
max_depth | Controls overfitting | None (full) often works; tune if overfitting |
max_features | Key diversity lever | 'sqrt' for classification, 1/3 for regression |
min_samples_leaf | Smoother predictions | Increase for noisy data |
Random Forests for Regression
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error, r2_score
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_reg = RandomForestRegressor(
n_estimators=200,
max_features=1/3, # 1/3 of features per split (recommended for regression)
n_jobs=-1,
random_state=42
)
rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}")
print(f"R²: {r2:.3f}")
Prediction Uncertainty with Random Forests
One underused feature: you can get prediction variance from individual trees to measure uncertainty.
# Get predictions from each tree
tree_predictions = np.array([tree.predict(X_test) for tree in rf_reg.estimators_])
# Mean prediction
mean_pred = tree_predictions.mean(axis=0)
# Uncertainty (standard deviation across trees)
uncertainty = tree_predictions.std(axis=0)
# High uncertainty = predictions to scrutinize
high_uncertainty_idx = np.argsort(uncertainty)[-5:]
print("Most uncertain predictions:")
for idx in high_uncertainty_idx:
print(f" Prediction: {mean_pred[idx]:.2f} ± {uncertainty[idx]:.2f}, True: {y_test[idx]:.2f}")
How Many Trees Do You Need?
# Error vs number of trees
test_errors = []
oob_errors = []
for n_trees in range(10, 300, 10):
rf = RandomForestClassifier(n_estimators=n_trees, oob_score=True, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
test_errors.append(1 - rf.score(X_test, y_test))
oob_errors.append(1 - rf.oob_score_)
plt.figure(figsize=(10, 6))
plt.plot(range(10, 300, 10), test_errors, label='Test Error')
plt.plot(range(10, 300, 10), oob_errors, label='OOB Error', linestyle='--')
plt.xlabel('Number of Trees')
plt.ylabel('Error')
plt.title('Error vs Number of Trees')
plt.legend()
plt.show()
# Error typically stops improving significantly around 100-200 trees
When to Use Random Forests
Strong choice when:
- You need high accuracy with minimal tuning
- Interpretability (feature importance) matters
- You want a reliable uncertainty estimate
- Data has missing values (some RF implementations handle them)
- You have a moderate-sized tabular dataset
Consider alternatives when:
- Maximum accuracy on structured data (try XGBoost/LightGBM — often better)
- Prediction speed matters (a single tree is much faster)
- Data is very high-dimensional (linear models may be better)
- You need well-calibrated probabilities (RF probabilities can be poorly calibrated)
Random Forests are an excellent default starting point for any classification or regression problem on tabular data — they're hard to break and often competitive with much more complex models.
Next lesson: Support Vector Machines — finding the optimal boundary between classes.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises