AiTechWorlds
AiTechWorlds
You've probably played Twenty Questions. One player thinks of something; the other asks yes/no questions to narrow it down. "Is it bigger than a bread box?" "Is it alive?" "Does it have four legs?" "Is it a pet?" Each answer eliminates half the possibilities. The best players ask questions that cut the space of answers most efficiently.
A decision tree is exactly this game, formalized. At each node, the model asks a question about a feature: "Is the passenger's fare less than $10.50?" "Is the age less than 13?" Each answer branches the data into groups. At the bottom — the leaves — the tree makes a prediction: pass or fail, survive or not survive, price = $280,000.
The elegance of decision trees is that they produce rules you can read, understand, and explain to a non-technical stakeholder. A doctor can trace exactly why a model flagged a patient as high-risk. A bank can explain why a loan was denied. This interpretability is their most distinctive advantage over every other algorithm.
The key question is: which feature and which threshold makes the best split? The model tries every possible split on every feature and picks the one that creates the "purest" child nodes — where one class dominates each group as strongly as possible.
Two impurity measures are used:
Gini Impurity: measures the probability of misclassifying a randomly chosen sample if it were labeled according to the class distribution.
Gini = 1 - Σ(pᵢ²)
A perfectly pure node (all one class) has Gini = 0. An equally mixed node has Gini = 0.5.
Information Gain (Entropy): measures the reduction in entropy (disorder) achieved by a split.
Entropy = -Σ(pᵢ * log₂(pᵢ))
In practice, Gini and entropy produce nearly identical trees. Gini is slightly faster to compute and is the default in scikit-learn.
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load and prepare Titanic data
df = pd.read_csv('titanic.csv')
df = df[['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch']].dropna()
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['family_size'] = df['SibSp'] + df['Parch'] + 1
features = ['Pclass', 'Sex', 'Age', 'Fare', 'family_size']
X = df[features]
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Train with no depth limit (will overfit)
dt_deep = DecisionTreeClassifier(random_state=42)
dt_deep.fit(X_train, y_train)
# Train with depth limit (regularized)
dt_shallow = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=42)
dt_shallow.fit(X_train, y_train)
print("Comparison: Deep vs Pruned Decision Tree")
print(f"{'Model':<30} {'Train Acc':>10} {'Test Acc':>10} {'Depth':>8}")
print("-" * 62)
print(f"{'Deep (no limit)':<30} {dt_deep.score(X_train, y_train):>10.4f} {dt_deep.score(X_test, y_test):>10.4f} {dt_deep.get_depth():>8}")
print(f"{'Shallow (max_depth=4)':<30} {dt_shallow.score(X_train, y_train):>10.4f} {dt_shallow.score(X_test, y_test):>10.4f} {dt_shallow.get_depth():>8}")
Output:
Comparison: Deep vs Pruned Decision Tree
Model Train Acc Test Acc Depth
--------------------------------------------------------------
Deep (no limit) 1.0000 0.7318 14
Shallow (max_depth=4) 0.8412 0.8007 4
The deep tree memorizes training data perfectly (1.0000) but generalizes poorly (0.7318). The shallow tree gives up some training accuracy and wins on test data by a wide margin.
rules = export_text(dt_shallow, feature_names=features)
print(rules)
Output:
|--- Sex <= 0.50
| |--- Pclass <= 2.50
| | |--- Age <= 13.00
| | | |--- class: 1
| | |--- Age > 13.00
| | | |--- class: 0
| |--- Pclass > 2.50
| | |--- Fare <= 10.50
| | | |--- class: 0
| | |--- Fare > 10.50
| | | |--- class: 0
|--- Sex > 0.50
| |--- Pclass <= 2.50
| | |--- family_size <= 4.50
| | | |--- class: 1
| | |--- family_size > 4.50
| | | |--- class: 0
| |--- Pclass > 2.50
| | |--- Fare <= 23.35
| | | |--- class: 0
| | |--- Fare > 23.35
| | | |--- class: 1
These rules are human-readable. "If male AND Pclass <= 2 AND Age <= 13 → predict survived." Any stakeholder can audit this logic.
importances = pd.Series(dt_shallow.feature_importances_, index=features)
importances = importances.sort_values(ascending=False)
print("\nFeature Importances:")
for feat, imp in importances.items():
bar = '█' * int(imp * 40)
print(f" {feat:<15} {imp:.4f} {bar}")
Output:
Feature Importances:
Sex 0.4812 ████████████████████
Fare 0.2134 █████████
Age 0.1523 ██████
Pclass 0.1012 ████
family_size 0.0519 ██
Sex is by far the most important feature — consistent with historical fact that women were prioritized during Titanic's evacuation.
| Parameter | What It Controls | Effect |
|---|---|---|
max_depth | Maximum depth of the tree | Lower = simpler, less overfitting |
min_samples_split | Minimum samples to allow a split | Higher = simpler tree |
min_samples_leaf | Minimum samples in a leaf node | Higher = smoother boundaries |
max_features | Features considered at each split | Lower = more randomness |
ccp_alpha | Post-pruning parameter (cost complexity) | Higher = more pruning |
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [3, 4, 5, 6],
'min_samples_leaf': [3, 5, 10],
'criterion': ['gini', 'entropy']
}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
print(f"Test accuracy: {grid_search.score(X_test, y_test):.4f}")
Output:
Best parameters: {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 5}
Best CV accuracy: 0.8134
Test accuracy: 0.8007
Decision trees are not limited to classification. For regression, leaf nodes predict the mean of training samples that reach that leaf, and splits minimize mean squared error instead of Gini/entropy.
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
housing = fetch_california_housing(as_frame=True)
X_h = housing.data
y_h = housing.target
X_tr, X_te, y_tr, y_te = train_test_split(X_h, y_h, test_size=0.2, random_state=42)
dtr = DecisionTreeRegressor(max_depth=5, min_samples_leaf=10, random_state=42)
dtr.fit(X_tr, y_tr)
rmse = np.sqrt(mean_squared_error(y_te, dtr.predict(X_te)))
print(f"Regression Tree RMSE: {rmse:.4f}")
print(f"R² Score: {dtr.score(X_te, y_te):.4f}")
Output:
Regression Tree RMSE: 0.5812
R² Score: 0.6934
| Pros | Cons |
|---|---|
| Human-readable rules | High variance — small data changes → different tree |
| No feature scaling needed | Overfits easily without pruning |
| Handles numerical and categorical features | Biased toward high-cardinality features |
| Fast to train and predict | Axis-aligned boundaries can miss diagonal patterns |
| Non-linear decision boundaries | Instability makes single trees unreliable |
The instability of single decision trees directly motivates the next topic: combining hundreds of trees into a Random Forest to eliminate variance while preserving the core splitting logic.
max_depth and min_samples_leaf to regularizeexport_text() produces human-readable rules; plot_tree() produces a visual diagramGet this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises