Handling Missing Values | Machine Learning Fundamentals | AiTechWorlds

Handling Missing Values: The Data Quality Problem Every ML Engineer Faces

Real-world data is messy. In virtually every dataset you'll work with outside of textbooks, some values will be missing. How you handle them can be the difference between a model that works and one that silently fails or produces biased predictions.

Why Data Goes Missing

Understanding why data is missing matters — the reason affects how you should handle it.

Missing Completely at Random (MCAR):
  - The missingness has nothing to do with any variable
  - Example: A sensor randomly fails due to hardware glitch
  - Safe to drop rows — won't bias results

Missing at Random (MAR):
  - Missingness depends on other observed variables
  - Example: Men are less likely to report weight → weight missing more for males
  - Can impute using other features

Missing Not at Random (MNAR):
  - Missingness depends on the missing value itself
  - Example: High-income people skip income questions
  - Hardest case — imputation can introduce bias

Getting this wrong — especially confusing MNAR for MCAR — is a common source of model bias that slips through without obvious error messages.

Detecting Missing Values

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('dataset.csv')

# Basic missing value summary
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)  # As percentages

# More detailed summary
missing_info = pd.DataFrame({
    'column': df.columns,
    'missing_count': df.isnull().sum().values,
    'missing_pct': (df.isnull().sum() / len(df) * 100).values,
    'dtype': df.dtypes.values
})
missing_info = missing_info[missing_info['missing_count'] > 0].sort_values('missing_pct', ascending=False)
print(missing_info)

# Visualize missing patterns
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.title('Missing Value Heatmap')
plt.show()

The heatmap reveals patterns — if certain rows have many missing columns simultaneously, that's a pattern worth investigating.

Strategy 1: Dropping Missing Values

The simplest approach, but rarely the best.

# Drop rows with ANY missing values
df_clean = df.dropna()

# Drop rows missing a specific column
df_clean = df.dropna(subset=['age', 'income'])

# Drop columns with >50% missing
threshold = 0.5
df_clean = df.dropna(axis=1, thresh=int(len(df) * threshold))

# When is dropping acceptable?
print(f"Original rows: {len(df)}")
print(f"After dropna: {len(df_clean)}")
print(f"Data lost: {(1 - len(df_clean)/len(df)) * 100:.1f}%")

When to drop:

MCAR data where less than 5% of rows are affected
The missing feature is not predictive
You have enough data that losing rows won't matter

When NOT to drop:

You'd lose >10% of your data
Missingness is MAR or MNAR (dropping creates bias)
The missing column is an important feature

Strategy 2: Simple Imputation

Replace missing values with a summary statistic.

from sklearn.impute import SimpleImputer

# Mean imputation (numerical — assumes normal distribution)
mean_imputer = SimpleImputer(strategy='mean')
df['age_imputed'] = mean_imputer.fit_transform(df[['age']])

# Median imputation (numerical — robust to outliers)
median_imputer = SimpleImputer(strategy='median')
df['income_imputed'] = median_imputer.fit_transform(df[['income']])

# Most frequent (categorical or numerical with modes)
mode_imputer = SimpleImputer(strategy='most_frequent')
df['category_imputed'] = mode_imputer.fit_transform(df[['category']])

# Constant fill — useful when "unknown" is meaningful
const_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
df['status_imputed'] = const_imputer.fit_transform(df[['status']])

Critical: Fit on training data only, then transform both train and test.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train)  # fit + transform
X_test_imputed = imputer.transform(X_test)          # transform only (use training stats)

This is one of the most common data leakage mistakes — fitting the imputer on the full dataset leaks test information into training.

Strategy 3: KNN Imputation

Uses the k nearest neighbors to estimate missing values — more accurate than simple imputation for complex relationships.

from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5, weights='uniform')
X_imputed = knn_imputer.fit_transform(X)

# KNN imputation is sensitive to scale — normalize first
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('imputer', KNNImputer(n_neighbors=5))
])

X_imputed = pipeline.fit_transform(X)

KNN imputation is much slower on large datasets — O(n²) complexity. Use on smaller datasets or after dimensionality reduction.

Strategy 4: Iterative Imputation (MICE)

Models each feature with missing values as a function of other features — the most sophisticated approach.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Default: uses BayesianRidge as estimator
mice_imputer = IterativeImputer(max_iter=10, random_state=42)
X_imputed = mice_imputer.fit_transform(X)

# Use RandomForest for non-linear relationships
rf_imputer = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10,
    random_state=42
)
X_imputed = rf_imputer.fit_transform(X)

MICE (Multiple Imputation by Chained Equations) iterates through features, using all other features to predict each missing one. Expensive but produces high-quality imputations.

Strategy 5: Flag and Fill

Instead of pretending imputed values are real, create an indicator for whether a value was missing.

def flag_and_fill(df, columns, fill_strategy='median'):
    df_copy = df.copy()
    
    for col in columns:
        if df[col].isnull().any():
            # Create missingness indicator
            df_copy[f'{col}_was_missing'] = df[col].isnull().astype(int)
            
            # Fill with strategy
            if fill_strategy == 'median':
                fill_val = df[col].median()
            elif fill_strategy == 'mean':
                fill_val = df[col].mean()
            else:
                fill_val = fill_strategy
            
            df_copy[col] = df[col].fillna(fill_val)
    
    return df_copy

# Now the model can learn "when this was missing, what happens?"
df_processed = flag_and_fill(df, ['age', 'income', 'credit_score'])

This is especially valuable when missingness itself is predictive — for example, if credit score is missing more often for high-risk applicants.

Handling Categorical Missing Values

# Option 1: Most frequent
df['category'].fillna(df['category'].mode()[0], inplace=True)

# Option 2: Add explicit "Unknown" category
df['category'].fillna('Unknown', inplace=True)

# Option 3: Treat as a separate category for tree models
# Tree-based models (XGBoost, LightGBM) can handle NaN natively
import lightgbm as lgb
# LightGBM handles missing values internally — no imputation needed

Time Series Missing Values

Time-ordered data has specific patterns — interpolation preserves temporal structure.

# Forward fill — last known value carries forward
df['price'].fillna(method='ffill', inplace=True)

# Backward fill
df['price'].fillna(method='bfill', inplace=True)

# Linear interpolation between known points
df['price'].interpolate(method='linear', inplace=True)

# Time-weighted interpolation (if index is datetime)
df['price'].interpolate(method='time', inplace=True)

Comparing Imputation Strategies

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import numpy as np

def evaluate_imputation(X, y, imputer, name):
    from sklearn.pipeline import Pipeline
    
    pipeline = Pipeline([
        ('imputer', imputer),
        ('model', LinearRegression())
    ])
    
    scores = cross_val_score(pipeline, X, y, cv=5, 
                              scoring='neg_mean_squared_error')
    rmse = np.sqrt(-scores.mean())
    print(f"{name}: RMSE = {rmse:.4f} (±{np.sqrt(-scores).std():.4f})")

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

evaluate_imputation(X, y, SimpleImputer(strategy='mean'), "Mean")
evaluate_imputation(X, y, SimpleImputer(strategy='median'), "Median")
evaluate_imputation(X, y, KNNImputer(n_neighbors=5), "KNN")
evaluate_imputation(X, y, IterativeImputer(max_iter=10), "MICE")

Always evaluate imputation strategies against your actual model performance — the "best" imputation method depends on your data structure.

The Decision Framework

Missing values found
        │
        ├── <5% missing, random → Drop rows
        │
        ├── >40% missing in column → Drop the column
        │
        ├── Categorical feature → Mode fill or "Unknown" category
        │
        ├── Numerical, normally distributed → Mean imputation
        │
        ├── Numerical, skewed or with outliers → Median imputation
        │
        ├── Complex relationships between features → KNN or MICE
        │
        ├── Time series data → Interpolation or forward fill
        │
        └── Missingness might be predictive → Flag + Fill

Next lesson: Train/Validation/Test Splits — the right way to evaluate models without leaking information.