AiTechWorlds
AiTechWorlds
Imagine a world-class chef trying to cook a gourmet meal. The recipe is perfect. The technique is flawless. But the ingredients? Half the vegetables are rotten, the salt is measured in random amounts, and some spice jars are simply unlabeled. The result, no matter how skilled the chef, will be inedible.
Machine learning models face exactly the same problem. Your algorithm might be state-of-the-art, your computing power unlimited — but if the data going in is dirty, inconsistent, or poorly prepared, the predictions coming out will be worthless. This principle is so fundamental in ML that it has its own name: garbage in, garbage out.
Data preprocessing is the act of turning raw, messy real-world data into clean, structured input that a model can actually learn from. Studies consistently show that data scientists spend 60–80% of their time on this step. Understanding it well is the difference between a model that works and one that merely runs.
Real-world data is collected from sensors, forms, databases, and user inputs — each with its own inconsistencies. Sensors fail and record null values. Users skip optional fields. Prices are entered in dollars in one row and cents in another. Ages are sometimes negative. Duplicate entries appear after system migrations.
No clean dataset arrives ready to train. Preprocessing is not optional — it is foundational.
Missing values are recorded as NaN (Not a Number) in pandas. The first job is to find them.
import pandas as pd
import numpy as np
# Load a sample housing dataset
df = pd.read_csv('housing.csv')
# Detect missing values
print(df.isnull().sum())
print(f"\nTotal missing: {df.isnull().sum().sum()}")
print(f"Missing percentage:\n{(df.isnull().sum() / len(df) * 100).round(2)}")
Output:
LotArea 0
YearBuilt 0
GarageArea 81
BedroomAbvGr 8
SalePrice 0
Total missing: 89
Missing percentage:
GarageArea 5.55
BedroomAbvGr 0.55
Once you know where the gaps are, you choose a strategy:
| Strategy | When to Use | Code |
|---|---|---|
| Drop rows | Very few rows missing (<1%), row is unrecoverable | df.dropna() |
| Drop column | >40% of column is missing | df.drop(columns=['col']) |
| Mean imputation | Numerical, roughly symmetric distribution | df.fillna(df.mean()) |
| Median imputation | Numerical, skewed distribution or outliers present | df.fillna(df.median()) |
| Mode imputation | Categorical columns | df.fillna(df.mode()[0]) |
| Predictive imputation | Critical column, enough related features | Train a regression model |
from sklearn.impute import SimpleImputer
# Median imputation for GarageArea (skewed distribution)
imputer = SimpleImputer(strategy='median')
df['GarageArea'] = imputer.fit_transform(df[['GarageArea']])
# Mode imputation for categorical
df['Neighborhood'].fillna(df['Neighborhood'].mode()[0], inplace=True)
An outlier is a data point that sits far from the rest. A house listed at $50,000,000 in a neighborhood where prices average $300,000 is either a data entry error or a genuine luxury anomaly. Either way, it distorts what the model learns.
IQR Method (Interquartile Range):
Q1 = df['SalePrice'].quantile(0.25)
Q3 = df['SalePrice'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['SalePrice'] < lower_bound) | (df['SalePrice'] > upper_bound)]
print(f"Outliers found: {len(outliers)}")
# Remove outliers
df_clean = df[(df['SalePrice'] >= lower_bound) & (df['SalePrice'] <= upper_bound)]
Z-Score Method:
from scipy import stats
z_scores = np.abs(stats.zscore(df['SalePrice']))
df_clean = df[z_scores < 3] # Keep rows within 3 standard deviations
print(f"Rows removed: {len(df) - len(df_clean)}")
Output:
Outliers found: 23
Rows removed: 19
The IQR method is more robust to extreme values. Z-score works well when the distribution is approximately normal.
Models need numbers. Categories like "Red", "Blue", "Green" or "High", "Medium", "Low" must be converted.
Label Encoding assigns an integer to each category. Use this for ordinal data — where order matters.
from sklearn.preprocessing import LabelEncoder
# Ordinal: Quality has a meaningful order
quality_order = {'Low': 0, 'Medium': 1, 'High': 2, 'Very High': 3}
df['QualityEncoded'] = df['Quality'].map(quality_order)
One-Hot Encoding creates a binary column for each category. Use this for nominal data — where no order exists.
# Nominal: Neighborhood names have no inherent order
df = pd.get_dummies(df, columns=['Neighborhood'], drop_first=True)
print(df.shape) # Shape grows by (number_of_unique_neighborhoods - 1)
Output:
(1441, 47)
Never apply label encoding to nominal data. If you encode "Cat=0, Dog=1, Fish=2", the model incorrectly learns that Fish is twice as large as Dog.
Different features live on different scales. Square footage might range from 500 to 5000. Number of bedrooms ranges from 1 to 6. Without scaling, the model treats square footage as 1000x more important simply because its numbers are larger.
StandardScaler transforms features to have mean=0 and standard deviation=1. Best for algorithms that assume normally distributed data (linear regression, SVM, PCA).
MinMaxScaler compresses values into [0, 1]. Best for neural networks and when you need bounded output.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler_std = StandardScaler()
scaler_mm = MinMaxScaler()
# Before scaling
print("Before scaling:")
print(df[['LotArea', 'GarageArea']].describe())
df_std = scaler_std.fit_transform(df[['LotArea', 'GarageArea']])
df_mm = scaler_mm.fit_transform(df[['LotArea', 'GarageArea']])
print("\nAfter StandardScaler (first 3 rows):")
print(df_std[:3])
print("\nAfter MinMaxScaler (first 3 rows):")
print(df_mm[:3])
Output:
Before scaling:
LotArea GarageArea
mean 10516.83 472.98
std 9981.26 213.80
min 1300.00 0.00
max 215245.00 1418.00
After StandardScaler (first 3 rows):
[[-0.21 0.38]
[-0.09 -0.06]
[-0.09 0.63]]
After MinMaxScaler (first 3 rows):
[[0.046 0.335]
[0.053 0.315]
[0.053 0.360]]
In production, preprocessing steps must be applied consistently to both training and test data. Scikit-learn's Pipeline chains steps together, preventing data leakage and ensuring reproducibility.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
numerical_features = ['LotArea', 'GarageArea', 'YearBuilt']
categorical_features = ['Neighborhood', 'HouseStyle']
# Define transformers
num_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
cat_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine into ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
('num', num_transformer, numerical_features),
('cat', cat_transformer, categorical_features)
])
# Full pipeline with model
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', LinearRegression())
])
# Fit and predict — preprocessing happens automatically
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
print(f"R² Score: {pipeline.score(X_test, y_test):.4f}")
Output:
R² Score: 0.7823
The pipeline ensures your scaler is only fit on training data, then applied to test data — preventing information leakage that inflates performance metrics.
isnull().sum()Pipeline to avoid data leakage and ensure reproducibilityPreprocessing is unglamorous but non-negotiable. The best model architecture cannot compensate for bad input data.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises