Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA)
EDA is what separates ML practitioners who build models that actually work from those who just run code. Before you train a single model, you must understand your data deeply — its shape, distributions, relationships, and problems.
Skip EDA and you'll build models on flawed assumptions. Do EDA well and you'll know exactly what you're working with before writing a single training line.
The EDA Questions You Must Answer
Before any model, answer these 7 questions about your dataset:
- What does each column represent? What are the units?
- What's the range and distribution of each numeric feature?
- Which columns have missing values, and how many?
- Are there obvious outliers that could distort the model?
- Which features correlate with the target variable?
- Which features correlate with each other (multicollinearity)?
- Are there data quality issues (inconsistent formatting, typos, impossible values)?
Step 1: Load and First Look
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('housing.csv')
# Basic shape
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
# First 5 rows
print(df.head())
# Column types and non-null counts
print(df.info())
# Statistical summary for numeric columns
print(df.describe())
describe() output tells you a lot at a glance:
size bedrooms price
count 1000.0 1000.0 1000.0
mean 1842.3 3.2 342500.0
std 612.8 0.9 125000.0
min 650.0 1.0 95000.0
25% 1350.0 2.0 245000.0
50% 1790.0 3.0 330000.0
75% 2350.0 4.0 425000.0
max 5200.0 7.0 950000.0
Red flags: large gap between 75th percentile and max (possible outlier), min = 0 for something that shouldn't be zero, or mean far from median (skewed distribution).
Step 2: Missing Values Analysis
# Count and percentage of missing values
missing = pd.DataFrame({
'count': df.isnull().sum(),
'percent': df.isnull().sum() / len(df) * 100
})
missing = missing[missing['count'] > 0].sort_values('percent', ascending=False)
print(missing)
# Visualize missing values
import missingno as msno
msno.matrix(df)
plt.show()
Decision rules for missing values:
- < 5% missing: safe to drop rows OR fill with median/mode
- 5–20% missing: fill with median (numeric) or mode (categorical), or use imputation
-
40% missing: consider dropping the column entirely
Step 3: Distribution Analysis
# Histograms for all numeric columns
df.select_dtypes(include=[np.number]).hist(
bins=30, figsize=(15, 10), color='steelblue', edgecolor='white'
)
plt.tight_layout()
plt.show()
# Check for skewness
skewness = df.select_dtypes(include=[np.number]).skew()
print("Highly skewed columns (|skew| > 1):")
print(skewness[abs(skewness) > 1])
Why skewness matters: Linear models and distance-based algorithms (KNN, SVM) perform better with normally distributed features. Highly skewed features often need log transformation.
# Log transform a skewed feature
df['price_log'] = np.log1p(df['price']) # log1p handles zeros safely
# Compare distribution before/after
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
df['price'].hist(bins=50, ax=ax1, title='Price (original)')
df['price_log'].hist(bins=50, ax=ax2, title='Price (log-transformed)')
plt.show()
Step 4: Outlier Detection
# Box plots to visualize outliers
plt.figure(figsize=(12, 6))
df[['size', 'bedrooms', 'bathrooms']].boxplot()
plt.title('Box plots for numeric features')
plt.show()
# IQR method to identify outliers
def detect_outliers_iqr(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower) | (df[column] > upper)]
print(f"{column}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
return outliers
detect_outliers_iqr(df, 'price')
detect_outliers_iqr(df, 'size')
Step 5: Correlation Analysis
# Correlation with target variable
correlations = df.select_dtypes(include=[np.number]).corr()['price'].sort_values(ascending=False)
print("Correlations with price:")
print(correlations)
# Full correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
df.select_dtypes(include=[np.number]).corr(),
annot=True, fmt='.2f', cmap='coolwarm',
center=0, vmin=-1, vmax=1
)
plt.title('Feature Correlation Matrix')
plt.show()
What to look for:
- Features with high correlation to target → likely useful predictors
- Features with high correlation to each other → multicollinearity (can cause issues in linear models)
- Features with near-zero correlation to target → candidates for dropping
Step 6: Target Variable Analysis
# Distribution of target
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
df['price'].hist(bins=50)
plt.title('Price Distribution')
plt.xlabel('Price ($)')
plt.subplot(1, 2, 2)
sns.boxplot(y=df['price'])
plt.title('Price Box Plot')
plt.tight_layout()
plt.show()
print(f"Skewness: {df['price'].skew():.3f}")
print(f"Mean: ${df['price'].mean():,.0f}")
print(f"Median: ${df['price'].median():,.0f}")
Step 7: Feature-Target Relationships
# Scatter plots for top correlated features
top_features = correlations.index[1:5] # Skip 'price' itself
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for i, feature in enumerate(top_features):
ax = axes[i // 2, i % 2]
ax.scatter(df[feature], df['price'], alpha=0.3, color='steelblue')
ax.set_xlabel(feature)
ax.set_ylabel('price')
# Add trend line
z = np.polyfit(df[feature].dropna(), df.loc[df[feature].notna(), 'price'], 1)
p = np.poly1d(z)
ax.plot(sorted(df[feature].dropna()), p(sorted(df[feature].dropna())), "r--")
plt.tight_layout()
plt.show()
EDA Summary Template
After completing EDA, write yourself a brief summary:
Dataset: 1,000 rows × 8 columns
Target: price (right-skewed, will log-transform)
Missing values: 42 (4.2%) in 'garage_type' — will fill with 'None'
Outliers: 12 extreme prices (>$800K) — will cap at 99th percentile
Strong predictors: size (r=0.78), location_score (r=0.71), bedrooms (r=0.54)
Multicollinearity: size & total_rooms (r=0.89) — will drop total_rooms
Key insight: all properties with price < $150K are in rural areas — location is critical
This summary becomes your feature engineering plan.
Next lesson: Handling Missing Values — strategies for every type of missing data.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises