Random Forests & Ensemble Methods | Machine Learning Fundamentals | AiTechWorlds

Random Forests: Power Through Diversity

A single decision tree is like asking one person for their opinion — useful, but limited by that person's blind spots. A Random Forest is like asking 500 people from different backgrounds. Their collective wisdom is far more reliable than any individual.

This intuition explains why Random Forests are one of the most consistently effective ML algorithms across diverse real-world problems.

The Ensemble Idea

Single Decision Tree:
  - High variance (overfit to training data)
  - Sensitive to small data changes
  - Accuracy ceiling: ~85%

100 Diverse Decision Trees (Random Forest):
  - Individual errors cancel out
  - More stable predictions
  - Accuracy ceiling: ~93%+

The key word is diverse. If all trees made the same mistakes, averaging them would help nothing. Random Forests create diversity through two mechanisms:

1. Bootstrap sampling — each tree trains on a different random sample of the data (with replacement)

2. Feature randomness — at each split, each tree considers only a random subset of features

How Random Forests Work

Training:
  For each of n_estimators trees:
    1. Draw bootstrap sample (random rows with replacement, ~63% unique)
    2. For each split, randomly select sqrt(n_features) candidate features
    3. Find best split among those candidates
    4. Grow tree fully (or to max_depth)

Prediction (Classification):
  Each tree votes → majority wins

Prediction (Regression):
  Each tree predicts a value → average all predictions

The out-of-bag (OOB) samples (the ~37% not selected in bootstrap) serve as a built-in validation set — no need for a separate validation split.

Building a Random Forest

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import numpy as np
import matplotlib.pyplot as plt

# Classification
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestClassifier(
    n_estimators=100,         # Number of trees
    max_features='sqrt',      # Features per split: 'sqrt' for classification
    max_depth=None,           # Let trees grow fully
    min_samples_leaf=1,
    bootstrap=True,           # Use bootstrap sampling
    oob_score=True,           # Calculate out-of-bag score
    n_jobs=-1,                # Use all CPU cores
    random_state=42
)

rf.fit(X_train, y_train)

print(f"Test Accuracy: {rf.score(X_test, y_test):.3f}")
print(f"OOB Score: {rf.oob_score_:.3f}")  # Out-of-bag estimate
print(classification_report(y_test, rf.predict(X_test), target_names=cancer.target_names))

Feature Importance

Random Forests provide robust feature importance estimates by averaging importance across all trees.

import pandas as pd

# Feature importances (mean decrease in impurity)
importances = pd.Series(
    rf.feature_importances_,
    index=cancer.feature_names
).sort_values(ascending=False)

print("Top 10 Most Important Features:")
print(importances.head(10))

# Plot
plt.figure(figsize=(10, 8))
importances.head(15).plot(kind='barh')
plt.title('Random Forest Feature Importances')
plt.xlabel('Mean Decrease in Impurity')
plt.tight_layout()
plt.show()

Permutation importance is more reliable than the default impurity-based importance, especially when features have different scales or many categories:

from sklearn.inspection import permutation_importance

perm_imp = permutation_importance(
    rf, X_test, y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

perm_importances = pd.Series(
    perm_imp.importances_mean,
    index=cancer.feature_names
).sort_values(ascending=False)

print("Permutation Importances:")
print(perm_importances.head(10))

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': [None, 5, 10, 20, 30],
    'max_features': ['sqrt', 'log2', 0.3, 0.5],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'bootstrap': [True, False]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_distributions=param_dist,
    n_iter=50,          # Try 50 random combinations
    cv=5,
    scoring='f1',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)
print("Best params:", random_search.best_params_)
print("Best CV F1:", random_search.best_score_)

Key hyperparameters and their effects:

Parameter	Effect	Tuning Strategy
`n_estimators`	More = better (diminishing returns)	100 is usually sufficient
`max_depth`	Controls overfitting	None (full) often works; tune if overfitting
`max_features`	Key diversity lever	'sqrt' for classification, 1/3 for regression
`min_samples_leaf`	Smoother predictions	Increase for noisy data

Random Forests for Regression

from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_reg = RandomForestRegressor(
    n_estimators=200,
    max_features=1/3,    # 1/3 of features per split (recommended for regression)
    n_jobs=-1,
    random_state=42
)
rf_reg.fit(X_train, y_train)

y_pred = rf_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}")
print(f"R²: {r2:.3f}")

Prediction Uncertainty with Random Forests

One underused feature: you can get prediction variance from individual trees to measure uncertainty.

# Get predictions from each tree
tree_predictions = np.array([tree.predict(X_test) for tree in rf_reg.estimators_])

# Mean prediction
mean_pred = tree_predictions.mean(axis=0)

# Uncertainty (standard deviation across trees)
uncertainty = tree_predictions.std(axis=0)

# High uncertainty = predictions to scrutinize
high_uncertainty_idx = np.argsort(uncertainty)[-5:]
print("Most uncertain predictions:")
for idx in high_uncertainty_idx:
    print(f"  Prediction: {mean_pred[idx]:.2f} ± {uncertainty[idx]:.2f}, True: {y_test[idx]:.2f}")

How Many Trees Do You Need?

# Error vs number of trees
test_errors = []
oob_errors = []

for n_trees in range(10, 300, 10):
    rf = RandomForestClassifier(n_estimators=n_trees, oob_score=True, n_jobs=-1, random_state=42)
    rf.fit(X_train, y_train)
    test_errors.append(1 - rf.score(X_test, y_test))
    oob_errors.append(1 - rf.oob_score_)

plt.figure(figsize=(10, 6))
plt.plot(range(10, 300, 10), test_errors, label='Test Error')
plt.plot(range(10, 300, 10), oob_errors, label='OOB Error', linestyle='--')
plt.xlabel('Number of Trees')
plt.ylabel('Error')
plt.title('Error vs Number of Trees')
plt.legend()
plt.show()
# Error typically stops improving significantly around 100-200 trees

When to Use Random Forests

Strong choice when:

You need high accuracy with minimal tuning
Interpretability (feature importance) matters
You want a reliable uncertainty estimate
Data has missing values (some RF implementations handle them)
You have a moderate-sized tabular dataset

Consider alternatives when:

Maximum accuracy on structured data (try XGBoost/LightGBM — often better)
Prediction speed matters (a single tree is much faster)
Data is very high-dimensional (linear models may be better)
You need well-calibrated probabilities (RF probabilities can be poorly calibrated)

Random Forests are an excellent default starting point for any classification or regression problem on tabular data — they're hard to break and often competitive with much more complex models.

Next lesson: Support Vector Machines — finding the optimal boundary between classes.