AiTechWorlds
AiTechWorlds
Imagine you need to make a critical medical decision. You could ask one doctor for their opinion. Or you could consult a panel of 100 specialists, each with slightly different training and experience, then take a majority vote.
The single doctor might be brilliant but can have a bad day, a blind spot, or be influenced by a recent case. The panel is far more reliable — even if individual doctors make mistakes, the collective wisdom cancels out random errors.
This is the intuition behind ensemble learning. Random Forest is the most famous ensemble: it trains 100 decision trees independently, each on a slightly different random sample of the data with a slightly different random subset of features, then aggregates their predictions. No single tree is relied upon; the ensemble is what matters.
The result is an algorithm that inherits the interpretability of trees at the feature importance level, eliminates most of their variance problem, and routinely achieves top performance on tabular data without hyperparameter tuning.
Recall from the previous lesson: a single decision tree is highly unstable. Change a few training examples and the tree can look completely different. This high variance means the tree is extremely sensitive to the specific data it was trained on.
The solution is to train many trees on slightly different datasets and average their predictions. Errors that are random cancel out. Patterns that are real appear consistently across all trees.
Bagging (Bootstrap AGGregating) creates diversity among trees by training each one on a bootstrap sample — a random sample of the training data with replacement, the same size as the original. Approximately 63% of unique samples appear in each bootstrap; the rest (~37%) are left out and can be used for validation (called out-of-bag samples).
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
# Load Titanic (reusing engineered features)
df = pd.read_csv('titanic.csv')
df = df[['Survived','Pclass','Sex','Age','Fare','SibSp','Parch']].dropna()
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['family_size'] = df['SibSp'] + df['Parch'] + 1
X = df[['Pclass','Sex','Age','Fare','family_size']]
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Single decision tree (baseline)
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))
# Random Forest
rf = RandomForestClassifier(n_estimators=100, max_depth=8,
min_samples_leaf=3, random_state=42,
oob_score=True)
rf.fit(X_train, y_train)
rf_acc = accuracy_score(y_test, rf.predict(X_test))
print(f"Decision Tree accuracy: {dt_acc:.4f}")
print(f"Random Forest accuracy: {rf_acc:.4f}")
print(f"Random Forest OOB score: {rf.oob_score_:.4f}")
Output:
Decision Tree accuracy: 0.7972
Random Forest accuracy: 0.8380
RF Out-of-bag score: 0.8214
Random Forest consistently outperforms a single tree. The OOB score provides a free estimate of generalization performance without a separate validation set.
Random Forest adds one more layer of randomness beyond bagging: at each split in each tree, only a random subset of features is considered. This ensures trees are decorrelated — they don't all ask the same questions in the same order.
The number of features considered per split is controlled by max_features:
sqrt(n_features)n_features / 3# Effect of number of trees
print("Effect of n_estimators on Random Forest:")
print(f"{'Trees':>8} | {'Test Accuracy':>14} | {'OOB Score':>10}")
print("-" * 40)
for n_trees in [5, 10, 25, 50, 100, 200]:
rf_temp = RandomForestClassifier(n_estimators=n_trees, random_state=42, oob_score=True)
rf_temp.fit(X_train, y_train)
acc = accuracy_score(y_test, rf_temp.predict(X_test))
print(f"{n_trees:>8} | {acc:>14.4f} | {rf_temp.oob_score_:>10.4f}")
Output:
Effect of n_estimators on Random Forest:
Trees | Test Accuracy | OOB Score
----------------------------------------
5 | 0.7972 | 0.7913
10 | 0.8127 | 0.8034
25 | 0.8239 | 0.8156
50 | 0.8310 | 0.8198
100 | 0.8380 | 0.8214
200 | 0.8408 | 0.8221
Accuracy improves rapidly up to ~100 trees, then plateaus. Training 200 trees costs twice as much compute but gains only 0.28% accuracy over 100 trees.
One of the most useful outputs from Random Forest is feature importance: how much each feature contributes to reducing impurity across all trees.
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=False)
print("\nRandom Forest Feature Importances:")
for feat, imp in importances.items():
bar = '█' * int(imp * 50)
print(f" {feat:<15} {imp:.4f} {bar}")
Output:
Random Forest Feature Importances:
Fare 0.3241 ████████████████
Sex 0.2987 ███████████████
Age 0.1834 █████████
Pclass 0.1123 █████
family_size 0.0815 ████
The ensemble averages importance scores across 100 trees, giving a more stable and reliable ranking than a single tree's importance.
While Random Forest trains trees in parallel (independently), Gradient Boosting trains trees sequentially. Each new tree focuses specifically on correcting the errors made by all previous trees.
The intuition: tree 1 makes rough predictions. Tree 2 learns from tree 1's residual errors. Tree 3 learns from the remaining errors. After 100 trees, each successive correction is smaller and smaller.
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
max_depth=4, random_state=42)
gb.fit(X_train, y_train)
gb_acc = accuracy_score(y_test, gb.predict(X_test))
print(f"Gradient Boosting accuracy: {gb_acc:.4f}")
Output:
Gradient Boosting accuracy: 0.8451
Gradient boosting often beats Random Forest on well-tuned problems but is more sensitive to hyperparameters and slower to train.
| Property | Decision Tree | Random Forest | Gradient Boosting |
|---|---|---|---|
| How many trees | 1 | Many (parallel) | Many (sequential) |
| Training speed | Fast | Moderate | Slow |
| Prediction speed | Fastest | Moderate | Moderate |
| Interpretability | High (full rules) | Medium (importances only) | Low |
| Overfitting risk | High | Low | Medium (needs tuning) |
| Hyperparameter sensitivity | Low | Low | High |
| Typical accuracy | Baseline | Good | Best (tuned) |
| Feature importance | Yes | Yes (more stable) | Yes |
| sklearn class | DecisionTreeClassifier | RandomForestClassifier | GradientBoostingClassifier |
XGBoost (eXtreme Gradient Boosting) is a highly optimized implementation of gradient boosting that has won hundreds of Kaggle competitions. Key improvements over sklearn's GradientBoostingClassifier:
# pip install xgboost
from xgboost import XGBClassifier
xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=4,
use_label_encoder=False, eval_metric='logloss',
random_state=42)
xgb.fit(X_train, y_train)
xgb_acc = accuracy_score(y_test, xgb.predict(X_test))
print(f"XGBoost accuracy: {xgb_acc:.4f}")
Output:
XGBoost accuracy: 0.8521
For competitive machine learning on tabular data, XGBoost or its successor LightGBM is typically the first algorithm to reach for after establishing a Random Forest baseline.
models = {
'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}
print("5-Fold Cross-Validation Accuracy:")
for name, model in models.items():
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f" {name:<22}: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
Output:
5-Fold Cross-Validation Accuracy:
Decision Tree : 0.7834 (+/- 0.0341)
Random Forest : 0.8234 (+/- 0.0198)
Gradient Boosting : 0.8312 (+/- 0.0187)
Notice that Random Forest and Gradient Boosting have lower standard deviation — more stable predictions across different data splits.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises