Exploratory Data Analysis (EDA)

EDA is what separates ML practitioners who build models that actually work from those who just run code. Before you train a single model, you must understand your data deeply — its shape, distributions, relationships, and problems.

Skip EDA and you'll build models on flawed assumptions. Do EDA well and you'll know exactly what you're working with before writing a single training line.

The EDA Questions You Must Answer

Before any model, answer these 7 questions about your dataset:

What does each column represent? What are the units?
What's the range and distribution of each numeric feature?
Which columns have missing values, and how many?
Are there obvious outliers that could distort the model?
Which features correlate with the target variable?
Which features correlate with each other (multicollinearity)?
Are there data quality issues (inconsistent formatting, typos, impossible values)?

Step 1: Load and First Look

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('housing.csv')

# Basic shape
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

# First 5 rows
print(df.head())

# Column types and non-null counts
print(df.info())

# Statistical summary for numeric columns
print(df.describe())

describe() output tells you a lot at a glance:

         size      bedrooms      price
count  1000.0      1000.0       1000.0
mean   1842.3         3.2     342500.0
std     612.8         0.9     125000.0
min     650.0         1.0      95000.0
25%    1350.0         2.0     245000.0
50%    1790.0         3.0     330000.0
75%    2350.0         4.0     425000.0
max    5200.0         7.0     950000.0

Red flags: large gap between 75th percentile and max (possible outlier), min = 0 for something that shouldn't be zero, or mean far from median (skewed distribution).

Step 2: Missing Values Analysis

# Count and percentage of missing values
missing = pd.DataFrame({
    'count': df.isnull().sum(),
    'percent': df.isnull().sum() / len(df) * 100
})
missing = missing[missing['count'] > 0].sort_values('percent', ascending=False)
print(missing)

# Visualize missing values
import missingno as msno
msno.matrix(df)
plt.show()

Decision rules for missing values:

< 5% missing: safe to drop rows OR fill with median/mode
5–20% missing: fill with median (numeric) or mode (categorical), or use imputation
40% missing: consider dropping the column entirely

Step 3: Distribution Analysis

# Histograms for all numeric columns
df.select_dtypes(include=[np.number]).hist(
    bins=30, figsize=(15, 10), color='steelblue', edgecolor='white'
)
plt.tight_layout()
plt.show()

# Check for skewness
skewness = df.select_dtypes(include=[np.number]).skew()
print("Highly skewed columns (|skew| > 1):")
print(skewness[abs(skewness) > 1])

Why skewness matters: Linear models and distance-based algorithms (KNN, SVM) perform better with normally distributed features. Highly skewed features often need log transformation.

# Log transform a skewed feature
df['price_log'] = np.log1p(df['price'])  # log1p handles zeros safely

# Compare distribution before/after
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
df['price'].hist(bins=50, ax=ax1, title='Price (original)')
df['price_log'].hist(bins=50, ax=ax2, title='Price (log-transformed)')
plt.show()

Step 4: Outlier Detection

# Box plots to visualize outliers
plt.figure(figsize=(12, 6))
df[['size', 'bedrooms', 'bathrooms']].boxplot()
plt.title('Box plots for numeric features')
plt.show()

# IQR method to identify outliers
def detect_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower) | (df[column] > upper)]
    print(f"{column}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
    return outliers

detect_outliers_iqr(df, 'price')
detect_outliers_iqr(df, 'size')

Step 5: Correlation Analysis

# Correlation with target variable
correlations = df.select_dtypes(include=[np.number]).corr()['price'].sort_values(ascending=False)
print("Correlations with price:")
print(correlations)

# Full correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
    df.select_dtypes(include=[np.number]).corr(),
    annot=True, fmt='.2f', cmap='coolwarm',
    center=0, vmin=-1, vmax=1
)
plt.title('Feature Correlation Matrix')
plt.show()

What to look for:

Features with high correlation to target → likely useful predictors
Features with high correlation to each other → multicollinearity (can cause issues in linear models)
Features with near-zero correlation to target → candidates for dropping

Step 6: Target Variable Analysis

# Distribution of target
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
df['price'].hist(bins=50)
plt.title('Price Distribution')
plt.xlabel('Price ($)')

plt.subplot(1, 2, 2)
sns.boxplot(y=df['price'])
plt.title('Price Box Plot')
plt.tight_layout()
plt.show()

print(f"Skewness: {df['price'].skew():.3f}")
print(f"Mean: ${df['price'].mean():,.0f}")
print(f"Median: ${df['price'].median():,.0f}")

Step 7: Feature-Target Relationships

# Scatter plots for top correlated features
top_features = correlations.index[1:5]  # Skip 'price' itself

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for i, feature in enumerate(top_features):
    ax = axes[i // 2, i % 2]
    ax.scatter(df[feature], df['price'], alpha=0.3, color='steelblue')
    ax.set_xlabel(feature)
    ax.set_ylabel('price')
    # Add trend line
    z = np.polyfit(df[feature].dropna(), df.loc[df[feature].notna(), 'price'], 1)
    p = np.poly1d(z)
    ax.plot(sorted(df[feature].dropna()), p(sorted(df[feature].dropna())), "r--")

plt.tight_layout()
plt.show()

EDA Summary Template

After completing EDA, write yourself a brief summary:

Dataset: 1,000 rows × 8 columns
Target: price (right-skewed, will log-transform)
Missing values: 42 (4.2%) in 'garage_type' — will fill with 'None'
Outliers: 12 extreme prices (>$800K) — will cap at 99th percentile
Strong predictors: size (r=0.78), location_score (r=0.71), bedrooms (r=0.54)
Multicollinearity: size & total_rooms (r=0.89) — will drop total_rooms
Key insight: all properties with price < $150K are in rural areas — location is critical

This summary becomes your feature engineering plan.

Next lesson: Handling Missing Values — strategies for every type of missing data.