Imagine a world-class chef trying to cook a gourmet meal. The recipe is perfect. The technique is flawless. But the ingredients? Half the vegetables are rotten, the salt is measured in random amounts, and some spice jars are simply unlabeled. The result, no matter how skilled the chef, will be inedible.

Machine learning models face exactly the same problem. Your algorithm might be state-of-the-art, your computing power unlimited — but if the data going in is dirty, inconsistent, or poorly prepared, the predictions coming out will be worthless. This principle is so fundamental in ML that it has its own name: garbage in, garbage out.

Data preprocessing is the act of turning raw, messy real-world data into clean, structured input that a model can actually learn from. Studies consistently show that data scientists spend 60–80% of their time on this step. Understanding it well is the difference between a model that works and one that merely runs.

Why Data Is Almost Always Dirty

Real-world data is collected from sensors, forms, databases, and user inputs — each with its own inconsistencies. Sensors fail and record null values. Users skip optional fields. Prices are entered in dollars in one row and cents in another. Ages are sometimes negative. Duplicate entries appear after system migrations.

No clean dataset arrives ready to train. Preprocessing is not optional — it is foundational.

Step 1 — Handling Missing Values

Missing values are recorded as NaN (Not a Number) in pandas. The first job is to find them.

import pandas as pd
import numpy as np

# Load a sample housing dataset
df = pd.read_csv('housing.csv')

# Detect missing values
print(df.isnull().sum())
print(f"\nTotal missing: {df.isnull().sum().sum()}")
print(f"Missing percentage:\n{(df.isnull().sum() / len(df) * 100).round(2)}")

Output:

LotArea        0
YearBuilt      0
GarageArea    81
BedroomAbvGr   8
SalePrice      0

Total missing: 89
Missing percentage:
GarageArea    5.55
BedroomAbvGr  0.55

Once you know where the gaps are, you choose a strategy:

Strategy	When to Use	Code
Drop rows	Very few rows missing (<1%), row is unrecoverable	`df.dropna()`
Drop column	>40% of column is missing	`df.drop(columns=['col'])`
Mean imputation	Numerical, roughly symmetric distribution	`df.fillna(df.mean())`
Median imputation	Numerical, skewed distribution or outliers present	`df.fillna(df.median())`
Mode imputation	Categorical columns	`df.fillna(df.mode()[0])`
Predictive imputation	Critical column, enough related features	Train a regression model

from sklearn.impute import SimpleImputer

# Median imputation for GarageArea (skewed distribution)
imputer = SimpleImputer(strategy='median')
df['GarageArea'] = imputer.fit_transform(df[['GarageArea']])

# Mode imputation for categorical
df['Neighborhood'].fillna(df['Neighborhood'].mode()[0], inplace=True)

Step 2 — Handling Outliers

An outlier is a data point that sits far from the rest. A house listed at $50,000,000 in a neighborhood where prices average $300,000 is either a data entry error or a genuine luxury anomaly. Either way, it distorts what the model learns.

IQR Method (Interquartile Range):

Q1 = df['SalePrice'].quantile(0.25)
Q3 = df['SalePrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['SalePrice'] < lower_bound) | (df['SalePrice'] > upper_bound)]
print(f"Outliers found: {len(outliers)}")

# Remove outliers
df_clean = df[(df['SalePrice'] >= lower_bound) & (df['SalePrice'] <= upper_bound)]

Z-Score Method:

from scipy import stats

z_scores = np.abs(stats.zscore(df['SalePrice']))
df_clean = df[z_scores < 3]  # Keep rows within 3 standard deviations
print(f"Rows removed: {len(df) - len(df_clean)}")

Output:

Outliers found: 23
Rows removed: 19

The IQR method is more robust to extreme values. Z-score works well when the distribution is approximately normal.

Step 3 — Encoding Categorical Variables

Models need numbers. Categories like "Red", "Blue", "Green" or "High", "Medium", "Low" must be converted.

Label Encoding assigns an integer to each category. Use this for ordinal data — where order matters.

from sklearn.preprocessing import LabelEncoder

# Ordinal: Quality has a meaningful order
quality_order = {'Low': 0, 'Medium': 1, 'High': 2, 'Very High': 3}
df['QualityEncoded'] = df['Quality'].map(quality_order)

One-Hot Encoding creates a binary column for each category. Use this for nominal data — where no order exists.

# Nominal: Neighborhood names have no inherent order
df = pd.get_dummies(df, columns=['Neighborhood'], drop_first=True)
print(df.shape)  # Shape grows by (number_of_unique_neighborhoods - 1)

Output:

(1441, 47)

Never apply label encoding to nominal data. If you encode "Cat=0, Dog=1, Fish=2", the model incorrectly learns that Fish is twice as large as Dog.

Step 4 — Feature Scaling

Different features live on different scales. Square footage might range from 500 to 5000. Number of bedrooms ranges from 1 to 6. Without scaling, the model treats square footage as 1000x more important simply because its numbers are larger.

StandardScaler transforms features to have mean=0 and standard deviation=1. Best for algorithms that assume normally distributed data (linear regression, SVM, PCA).

MinMaxScaler compresses values into [0, 1]. Best for neural networks and when you need bounded output.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler_std = StandardScaler()
scaler_mm = MinMaxScaler()

# Before scaling
print("Before scaling:")
print(df[['LotArea', 'GarageArea']].describe())

df_std = scaler_std.fit_transform(df[['LotArea', 'GarageArea']])
df_mm = scaler_mm.fit_transform(df[['LotArea', 'GarageArea']])

print("\nAfter StandardScaler (first 3 rows):")
print(df_std[:3])

print("\nAfter MinMaxScaler (first 3 rows):")
print(df_mm[:3])

Output:

Before scaling:
         LotArea  GarageArea
mean    10516.83      472.98
std      9981.26      213.80
min      1300.00        0.00
max    215245.00     1418.00

After StandardScaler (first 3 rows):
[[-0.21  0.38]
 [-0.09 -0.06]
 [-0.09  0.63]]

After MinMaxScaler (first 3 rows):
[[0.046  0.335]
 [0.053  0.315]
 [0.053  0.360]]

Step 5 — Building a Pipeline

In production, preprocessing steps must be applied consistently to both training and test data. Scikit-learn's Pipeline chains steps together, preventing data leakage and ensuring reproducibility.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

numerical_features = ['LotArea', 'GarageArea', 'YearBuilt']
categorical_features = ['Neighborhood', 'HouseStyle']

# Define transformers
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine into ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, numerical_features),
    ('cat', cat_transformer, categorical_features)
])

# Full pipeline with model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

# Fit and predict — preprocessing happens automatically
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
print(f"R² Score: {pipeline.score(X_test, y_test):.4f}")

Output:

R² Score: 0.7823

The pipeline ensures your scaler is only fit on training data, then applied to test data — preventing information leakage that inflates performance metrics.

Key Takeaways

Always check for missing values first with isnull().sum()
Use median imputation for skewed numerical columns, mode for categorical
Remove outliers with IQR or Z-score before training
Label encode ordinal features, one-hot encode nominal features
Scale features when using distance-based or gradient-based algorithms
Wrap everything in a Pipeline to avoid data leakage and ensure reproducibility

Preprocessing is unglamorous but non-negotiable. The best model architecture cannot compensate for bad input data.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

35 minLesson 3 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding

Data Preprocessing and Cleaning

The Chef Analogy

Why Data Is Almost Always Dirty

No clean dataset arrives ready to train. Preprocessing is not optional — it is foundational.

Step 1 — Handling Missing Values

Missing values are recorded as NaN (Not a Number) in pandas. The first job is to find them.

import pandas as pd
import numpy as np

# Load a sample housing dataset
df = pd.read_csv('housing.csv')

# Detect missing values
print(df.isnull().sum())
print(f"\nTotal missing: {df.isnull().sum().sum()}")
print(f"Missing percentage:\n{(df.isnull().sum() / len(df) * 100).round(2)}")

Output:

LotArea        0
YearBuilt      0
GarageArea    81
BedroomAbvGr   8
SalePrice      0

Total missing: 89
Missing percentage:
GarageArea    5.55
BedroomAbvGr  0.55

Once you know where the gaps are, you choose a strategy:

Strategy	When to Use	Code
Drop rows	Very few rows missing (<1%), row is unrecoverable	`df.dropna()`
Drop column	>40% of column is missing	`df.drop(columns=['col'])`
Mean imputation	Numerical, roughly symmetric distribution	`df.fillna(df.mean())`
Median imputation	Numerical, skewed distribution or outliers present	`df.fillna(df.median())`
Mode imputation	Categorical columns	`df.fillna(df.mode()[0])`
Predictive imputation	Critical column, enough related features	Train a regression model

from sklearn.impute import SimpleImputer

# Median imputation for GarageArea (skewed distribution)
imputer = SimpleImputer(strategy='median')
df['GarageArea'] = imputer.fit_transform(df[['GarageArea']])

# Mode imputation for categorical
df['Neighborhood'].fillna(df['Neighborhood'].mode()[0], inplace=True)

Step 2 — Handling Outliers

IQR Method (Interquartile Range):

Q1 = df['SalePrice'].quantile(0.25)
Q3 = df['SalePrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['SalePrice'] < lower_bound) | (df['SalePrice'] > upper_bound)]
print(f"Outliers found: {len(outliers)}")

# Remove outliers
df_clean = df[(df['SalePrice'] >= lower_bound) & (df['SalePrice'] <= upper_bound)]

Z-Score Method:

from scipy import stats

z_scores = np.abs(stats.zscore(df['SalePrice']))
df_clean = df[z_scores < 3]  # Keep rows within 3 standard deviations
print(f"Rows removed: {len(df) - len(df_clean)}")

Output:

Outliers found: 23
Rows removed: 19

The IQR method is more robust to extreme values. Z-score works well when the distribution is approximately normal.

Step 3 — Encoding Categorical Variables

Models need numbers. Categories like "Red", "Blue", "Green" or "High", "Medium", "Low" must be converted.

Label Encoding assigns an integer to each category. Use this for ordinal data — where order matters.

from sklearn.preprocessing import LabelEncoder

# Ordinal: Quality has a meaningful order
quality_order = {'Low': 0, 'Medium': 1, 'High': 2, 'Very High': 3}
df['QualityEncoded'] = df['Quality'].map(quality_order)

One-Hot Encoding creates a binary column for each category. Use this for nominal data — where no order exists.

# Nominal: Neighborhood names have no inherent order
df = pd.get_dummies(df, columns=['Neighborhood'], drop_first=True)
print(df.shape)  # Shape grows by (number_of_unique_neighborhoods - 1)

Output:

(1441, 47)

Never apply label encoding to nominal data. If you encode "Cat=0, Dog=1, Fish=2", the model incorrectly learns that Fish is twice as large as Dog.

Step 4 — Feature Scaling

StandardScaler transforms features to have mean=0 and standard deviation=1. Best for algorithms that assume normally distributed data (linear regression, SVM, PCA).

MinMaxScaler compresses values into [0, 1]. Best for neural networks and when you need bounded output.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler_std = StandardScaler()
scaler_mm = MinMaxScaler()

# Before scaling
print("Before scaling:")
print(df[['LotArea', 'GarageArea']].describe())

df_std = scaler_std.fit_transform(df[['LotArea', 'GarageArea']])
df_mm = scaler_mm.fit_transform(df[['LotArea', 'GarageArea']])

print("\nAfter StandardScaler (first 3 rows):")
print(df_std[:3])

print("\nAfter MinMaxScaler (first 3 rows):")
print(df_mm[:3])

Output:

Before scaling:
         LotArea  GarageArea
mean    10516.83      472.98
std      9981.26      213.80
min      1300.00        0.00
max    215245.00     1418.00

After StandardScaler (first 3 rows):
[[-0.21  0.38]
 [-0.09 -0.06]
 [-0.09  0.63]]

After MinMaxScaler (first 3 rows):
[[0.046  0.335]
 [0.053  0.315]
 [0.053  0.360]]

Step 5 — Building a Pipeline

In production, preprocessing steps must be applied consistently to both training and test data. Scikit-learn's Pipeline chains steps together, preventing data leakage and ensuring reproducibility.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

numerical_features = ['LotArea', 'GarageArea', 'YearBuilt']
categorical_features = ['Neighborhood', 'HouseStyle']

# Define transformers
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine into ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, numerical_features),
    ('cat', cat_transformer, categorical_features)
])

# Full pipeline with model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

# Fit and predict — preprocessing happens automatically
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
print(f"R² Score: {pipeline.score(X_test, y_test):.4f}")

Output:

R² Score: 0.7823

The pipeline ensures your scaler is only fit on training data, then applied to test data — preventing information leakage that inflates performance metrics.

Key Takeaways

Always check for missing values first with isnull().sum()
Use median imputation for skewed numerical columns, mode for categorical
Remove outliers with IQR or Z-score before training
Label encode ordinal features, one-hot encode nominal features
Scale features when using distance-based or gradient-based algorithms
Wrap everything in a Pipeline to avoid data leakage and ensure reproducibility

Preprocessing is unglamorous but non-negotiable. The best model architecture cannot compensate for bad input data.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →