You've probably played Twenty Questions. One player thinks of something; the other asks yes/no questions to narrow it down. "Is it bigger than a bread box?" "Is it alive?" "Does it have four legs?" "Is it a pet?" Each answer eliminates half the possibilities. The best players ask questions that cut the space of answers most efficiently.

A decision tree is exactly this game, formalized. At each node, the model asks a question about a feature: "Is the passenger's fare less than $10.50?" "Is the age less than 13?" Each answer branches the data into groups. At the bottom — the leaves — the tree makes a prediction: pass or fail, survive or not survive, price = $280,000.

The elegance of decision trees is that they produce rules you can read, understand, and explain to a non-technical stakeholder. A doctor can trace exactly why a model flagged a patient as high-risk. A bank can explain why a loan was denied. This interpretability is their most distinctive advantage over every other algorithm.

How a Tree Decides Where to Split

The key question is: which feature and which threshold makes the best split? The model tries every possible split on every feature and picks the one that creates the "purest" child nodes — where one class dominates each group as strongly as possible.

Two impurity measures are used:

Gini Impurity: measures the probability of misclassifying a randomly chosen sample if it were labeled according to the class distribution.

Gini = 1 - Σ(pᵢ²)

A perfectly pure node (all one class) has Gini = 0. An equally mixed node has Gini = 0.5.

Information Gain (Entropy): measures the reduction in entropy (disorder) achieved by a split.

Entropy = -Σ(pᵢ * log₂(pᵢ))

In practice, Gini and entropy produce nearly identical trees. Gini is slightly faster to compute and is the default in scikit-learn.

Building a Decision Tree on Titanic Data

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load and prepare Titanic data
df = pd.read_csv('titanic.csv')
df = df[['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch']].dropna()
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['family_size'] = df['SibSp'] + df['Parch'] + 1

features = ['Pclass', 'Sex', 'Age', 'Fare', 'family_size']
X = df[features]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train with no depth limit (will overfit)
dt_deep = DecisionTreeClassifier(random_state=42)
dt_deep.fit(X_train, y_train)

# Train with depth limit (regularized)
dt_shallow = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=42)
dt_shallow.fit(X_train, y_train)

print("Comparison: Deep vs Pruned Decision Tree")
print(f"{'Model':<30} {'Train Acc':>10} {'Test Acc':>10} {'Depth':>8}")
print("-" * 62)
print(f"{'Deep (no limit)':<30} {dt_deep.score(X_train, y_train):>10.4f} {dt_deep.score(X_test, y_test):>10.4f} {dt_deep.get_depth():>8}")
print(f"{'Shallow (max_depth=4)':<30} {dt_shallow.score(X_train, y_train):>10.4f} {dt_shallow.score(X_test, y_test):>10.4f} {dt_shallow.get_depth():>8}")

Output:

Comparison: Deep vs Pruned Decision Tree
Model                          Train Acc   Test Acc    Depth
--------------------------------------------------------------
Deep (no limit)                   1.0000     0.7318       14
Shallow (max_depth=4)             0.8412     0.8007        4

The deep tree memorizes training data perfectly (1.0000) but generalizes poorly (0.7318). The shallow tree gives up some training accuracy and wins on test data by a wide margin.

Reading the Tree Rules

rules = export_text(dt_shallow, feature_names=features)
print(rules)

Output:

|--- Sex <= 0.50
|   |--- Pclass <= 2.50
|   |   |--- Age <= 13.00
|   |   |   |--- class: 1
|   |   |--- Age > 13.00
|   |   |   |--- class: 0
|   |--- Pclass > 2.50
|   |   |--- Fare <= 10.50
|   |   |   |--- class: 0
|   |   |--- Fare > 10.50
|   |   |   |--- class: 0
|--- Sex > 0.50
|   |--- Pclass <= 2.50
|   |   |--- family_size <= 4.50
|   |   |   |--- class: 1
|   |   |--- family_size > 4.50
|   |   |   |--- class: 0
|   |--- Pclass > 2.50
|   |   |--- Fare <= 23.35
|   |   |   |--- class: 0
|   |   |--- Fare > 23.35
|   |   |   |--- class: 1

These rules are human-readable. "If male AND Pclass <= 2 AND Age <= 13 → predict survived." Any stakeholder can audit this logic.

Feature Importance

importances = pd.Series(dt_shallow.feature_importances_, index=features)
importances = importances.sort_values(ascending=False)
print("\nFeature Importances:")
for feat, imp in importances.items():
    bar = '█' * int(imp * 40)
    print(f"  {feat:<15} {imp:.4f}  {bar}")

Output:

Feature Importances:
  Sex             0.4812  ████████████████████
  Fare            0.2134  █████████
  Age             0.1523  ██████
  Pclass          0.1012  ████
  family_size     0.0519  ██

Sex is by far the most important feature — consistent with historical fact that women were prioritized during Titanic's evacuation.

Pre-Pruning Parameters

Parameter	What It Controls	Effect
`max_depth`	Maximum depth of the tree	Lower = simpler, less overfitting
`min_samples_split`	Minimum samples to allow a split	Higher = simpler tree
`min_samples_leaf`	Minimum samples in a leaf node	Higher = smoother boundaries
`max_features`	Features considered at each split	Lower = more randomness
`ccp_alpha`	Post-pruning parameter (cost complexity)	Higher = more pruning

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 4, 5, 6],
    'min_samples_leaf': [3, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
                            param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
print(f"Test accuracy: {grid_search.score(X_test, y_test):.4f}")

Output:

Best parameters: {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 5}
Best CV accuracy: 0.8134
Test accuracy: 0.8007

Decision Tree for Regression

Decision trees are not limited to classification. For regression, leaf nodes predict the mean of training samples that reach that leaf, and splits minimize mean squared error instead of Gini/entropy.

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing(as_frame=True)
X_h = housing.data
y_h = housing.target

X_tr, X_te, y_tr, y_te = train_test_split(X_h, y_h, test_size=0.2, random_state=42)

dtr = DecisionTreeRegressor(max_depth=5, min_samples_leaf=10, random_state=42)
dtr.fit(X_tr, y_tr)

rmse = np.sqrt(mean_squared_error(y_te, dtr.predict(X_te)))
print(f"Regression Tree RMSE: {rmse:.4f}")
print(f"R² Score: {dtr.score(X_te, y_te):.4f}")

Output:

Regression Tree RMSE: 0.5812
R² Score: 0.6934

Pros and Cons

Pros	Cons
Human-readable rules	High variance — small data changes → different tree
No feature scaling needed	Overfits easily without pruning
Handles numerical and categorical features	Biased toward high-cardinality features
Fast to train and predict	Axis-aligned boundaries can miss diagonal patterns
Non-linear decision boundaries	Instability makes single trees unreliable

The instability of single decision trees directly motivates the next topic: combining hundreds of trees into a Random Forest to eliminate variance while preserving the core splitting logic.

Key Takeaways

Decision trees split data by asking questions about feature values, choosing splits that maximize node purity
Gini impurity and information gain are two equivalent measures of split quality; Gini is the default
Unpruned trees memorize training data perfectly but generalize poorly — use max_depth and min_samples_leaf to regularize
export_text() produces human-readable rules; plot_tree() produces a visual diagram
Feature importance scores show which features the tree uses most across all splits
Decision trees work for both classification (Gini/entropy criterion) and regression (MSE criterion)

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

32 minLesson 9 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min

Chapter 4: Supervised Learning — Classification

Decision Trees: How Machines Make Decisions

Decision Trees

The Twenty Questions Analogy

How a Tree Decides Where to Split

Two impurity measures are used:

Gini Impurity: measures the probability of misclassifying a randomly chosen sample if it were labeled according to the class distribution.

Gini = 1 - Σ(pᵢ²)

A perfectly pure node (all one class) has Gini = 0. An equally mixed node has Gini = 0.5.

Information Gain (Entropy): measures the reduction in entropy (disorder) achieved by a split.

Entropy = -Σ(pᵢ * log₂(pᵢ))

In practice, Gini and entropy produce nearly identical trees. Gini is slightly faster to compute and is the default in scikit-learn.

Building a Decision Tree on Titanic Data

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load and prepare Titanic data
df = pd.read_csv('titanic.csv')
df = df[['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'SibSp', 'Parch']].dropna()
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['family_size'] = df['SibSp'] + df['Parch'] + 1

features = ['Pclass', 'Sex', 'Age', 'Fare', 'family_size']
X = df[features]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train with no depth limit (will overfit)
dt_deep = DecisionTreeClassifier(random_state=42)
dt_deep.fit(X_train, y_train)

# Train with depth limit (regularized)
dt_shallow = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=42)
dt_shallow.fit(X_train, y_train)

print("Comparison: Deep vs Pruned Decision Tree")
print(f"{'Model':<30} {'Train Acc':>10} {'Test Acc':>10} {'Depth':>8}")
print("-" * 62)
print(f"{'Deep (no limit)':<30} {dt_deep.score(X_train, y_train):>10.4f} {dt_deep.score(X_test, y_test):>10.4f} {dt_deep.get_depth():>8}")
print(f"{'Shallow (max_depth=4)':<30} {dt_shallow.score(X_train, y_train):>10.4f} {dt_shallow.score(X_test, y_test):>10.4f} {dt_shallow.get_depth():>8}")

Output:

Comparison: Deep vs Pruned Decision Tree
Model                          Train Acc   Test Acc    Depth
--------------------------------------------------------------
Deep (no limit)                   1.0000     0.7318       14
Shallow (max_depth=4)             0.8412     0.8007        4

The deep tree memorizes training data perfectly (1.0000) but generalizes poorly (0.7318). The shallow tree gives up some training accuracy and wins on test data by a wide margin.

Reading the Tree Rules

rules = export_text(dt_shallow, feature_names=features)
print(rules)

Output:

|--- Sex <= 0.50
|   |--- Pclass <= 2.50
|   |   |--- Age <= 13.00
|   |   |   |--- class: 1
|   |   |--- Age > 13.00
|   |   |   |--- class: 0
|   |--- Pclass > 2.50
|   |   |--- Fare <= 10.50
|   |   |   |--- class: 0
|   |   |--- Fare > 10.50
|   |   |   |--- class: 0
|--- Sex > 0.50
|   |--- Pclass <= 2.50
|   |   |--- family_size <= 4.50
|   |   |   |--- class: 1
|   |   |--- family_size > 4.50
|   |   |   |--- class: 0
|   |--- Pclass > 2.50
|   |   |--- Fare <= 23.35
|   |   |   |--- class: 0
|   |   |--- Fare > 23.35
|   |   |   |--- class: 1

These rules are human-readable. "If male AND Pclass <= 2 AND Age <= 13 → predict survived." Any stakeholder can audit this logic.

Feature Importance

importances = pd.Series(dt_shallow.feature_importances_, index=features)
importances = importances.sort_values(ascending=False)
print("\nFeature Importances:")
for feat, imp in importances.items():
    bar = '█' * int(imp * 40)
    print(f"  {feat:<15} {imp:.4f}  {bar}")

Output:

Feature Importances:
  Sex             0.4812  ████████████████████
  Fare            0.2134  █████████
  Age             0.1523  ██████
  Pclass          0.1012  ████
  family_size     0.0519  ██

Sex is by far the most important feature — consistent with historical fact that women were prioritized during Titanic's evacuation.

Pre-Pruning Parameters

Parameter	What It Controls	Effect
`max_depth`	Maximum depth of the tree	Lower = simpler, less overfitting
`min_samples_split`	Minimum samples to allow a split	Higher = simpler tree
`min_samples_leaf`	Minimum samples in a leaf node	Higher = smoother boundaries
`max_features`	Features considered at each split	Lower = more randomness
`ccp_alpha`	Post-pruning parameter (cost complexity)	Higher = more pruning

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 4, 5, 6],
    'min_samples_leaf': [3, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
                            param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
print(f"Test accuracy: {grid_search.score(X_test, y_test):.4f}")

Output:

Best parameters: {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 5}
Best CV accuracy: 0.8134
Test accuracy: 0.8007

Decision Tree for Regression

Decision trees are not limited to classification. For regression, leaf nodes predict the mean of training samples that reach that leaf, and splits minimize mean squared error instead of Gini/entropy.

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing(as_frame=True)
X_h = housing.data
y_h = housing.target

X_tr, X_te, y_tr, y_te = train_test_split(X_h, y_h, test_size=0.2, random_state=42)

dtr = DecisionTreeRegressor(max_depth=5, min_samples_leaf=10, random_state=42)
dtr.fit(X_tr, y_tr)

rmse = np.sqrt(mean_squared_error(y_te, dtr.predict(X_te)))
print(f"Regression Tree RMSE: {rmse:.4f}")
print(f"R² Score: {dtr.score(X_te, y_te):.4f}")

Output:

Regression Tree RMSE: 0.5812
R² Score: 0.6934

Pros and Cons

Pros	Cons
Human-readable rules	High variance — small data changes → different tree
No feature scaling needed	Overfits easily without pruning
Handles numerical and categorical features	Biased toward high-cardinality features
Fast to train and predict	Axis-aligned boundaries can miss diagonal patterns
Non-linear decision boundaries	Instability makes single trees unreliable

The instability of single decision trees directly motivates the next topic: combining hundreds of trees into a Random Forest to eliminate variance while preserving the core splitting logic.

Key Takeaways

Decision trees split data by asking questions about feature values, choosing splits that maximize node purity
Gini impurity and information gain are two equivalent measures of split quality; Gini is the default
Unpruned trees memorize training data perfectly but generalize poorly — use max_depth and min_samples_leaf to regularize
export_text() produces human-readable rules; plot_tree() produces a visual diagram
Feature importance scores show which features the tree uses most across all splits
Decision trees work for both classification (Gini/entropy criterion) and regression (MSE criterion)

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →