Feature Engineering & Scaling

The quality of your features determines the ceiling of your model's performance. Even the best algorithm can't save you from bad features — but great features make even simple algorithms shine. This is where data science becomes an art.

What Is Feature Engineering?

Feature engineering is transforming raw data into representations that better capture the underlying patterns for your model. It includes:

Creating new features from existing ones
Transforming features to better distributions
Encoding categorical variables for ML algorithms
Scaling features to comparable ranges

Creating New Features

import pandas as pd
import numpy as np

df = pd.read_csv('housing.csv')

# Interaction features
df['price_per_sqft'] = df['price'] / df['size']
df['size_per_bedroom'] = df['size'] / df['bedrooms']

# Age features
df['house_age'] = 2026 - df['year_built']
df['years_since_renovation'] = 2026 - df['last_renovation'].fillna(df['year_built'])

# Binned features (convert continuous to categorical)
df['size_category'] = pd.cut(
    df['size'],
    bins=[0, 1200, 1800, 2500, 99999],
    labels=['Small', 'Medium', 'Large', 'XL']
)

# Boolean features
df['has_pool'] = (df['pool_area'] > 0).astype(int)
df['renovated_recently'] = (df['years_since_renovation'] < 10).astype(int)

Feature creation intuition: Think like a domain expert. What would a real estate agent consider? That's often what the model needs too.

Encoding Categorical Variables

ML algorithms need numbers. Here's how to encode categories:

Label Encoding (for ordinal categories)

from sklearn.preprocessing import LabelEncoder

# Use when order matters: Small < Medium < Large
size_order = {'Small': 0, 'Medium': 1, 'Large': 2, 'XL': 3}
df['size_encoded'] = df['size_category'].map(size_order)

One-Hot Encoding (for nominal categories)

# Use when categories have no natural order
# pandas get_dummies
df = pd.get_dummies(df, columns=['city', 'property_type'], drop_first=True)

# sklearn OneHotEncoder (better for pipelines)
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False, drop='first')
city_encoded = enc.fit_transform(df[['city']])

The dummy variable trap: Always use drop_first=True to avoid perfect multicollinearity.

Target Encoding (for high-cardinality categories)

# Useful when a category has hundreds of unique values
city_avg_price = df.groupby('city')['price'].mean()
df['city_target_encoded'] = df['city'].map(city_avg_price)

Feature Scaling

Many ML algorithms are sensitive to the scale of features. A feature ranging 0–1 and one ranging 0–1,000,000 will cause algorithms to heavily weight the large-valued feature.

Algorithms that NEED scaling: KNN, SVM, neural networks, logistic regression, PCA, gradient descent-based models

Algorithms that DON'T need scaling: Decision trees, random forests, gradient boosting (XGBoost, LightGBM)

Min-Max Normalization (0 to 1)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

# IMPORTANT: fit only on training data, then transform test data
X_test_scaled = scaler.transform(X_test)  # Do NOT fit again

Use when: you need values bounded in a specific range, neural networks.

Standard Scaling (mean=0, std=1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Use when: features follow a roughly normal distribution. Most common choice for traditional ML.

Robust Scaling (uses median and IQR)

from sklearn.preprocessing import RobustScaler

# Less sensitive to outliers than StandardScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)

Use when: data has significant outliers you can't or won't remove.

The Data Leakage Trap

Critical rule: Always fit your scaler on training data only, then transform both train and test.

# WRONG — leaks test data statistics into training
scaler.fit(X)  # Fits on all data including test
X_scaled = scaler.transform(X)

# RIGHT — fit only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler.fit(X_train)           # Fit on training only
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Apply same transformation

Data leakage gives you artificially inflated validation scores and models that underperform in production.

Handling Skewed Features

# Log transformation — for right-skewed data (most common)
df['price_log'] = np.log1p(df['price'])      # log(1+x), safe for zeros

# Square root — gentler transformation
df['rooms_sqrt'] = np.sqrt(df['total_rooms'])

# Box-Cox — optimal transformation found automatically
from scipy import stats
df['size_boxcox'], _ = stats.boxcox(df['size'] + 1)

# Check before and after
print(f"Original skew: {df['price'].skew():.3f}")
print(f"Log skew: {df['price_log'].skew():.3f}")

Feature Selection

More features is not always better — irrelevant features add noise.

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regression

# Method 1: Feature importance from tree models
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

importances = pd.Series(model.feature_importances_, index=feature_names)
importances.sort_values(ascending=False).plot(kind='bar')
plt.title('Feature Importances')
plt.show()

# Method 2: Select top K features by statistical test
selector = SelectKBest(f_regression, k=10)
X_selected = selector.fit_transform(X_train, y_train)
selected_features = [feature_names[i] for i in selector.get_support(indices=True)]
print("Top 10 features:", selected_features)

# Method 3: Drop low-variance features
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)  # Drop features with < 1% variance
X_filtered = selector.fit_transform(X)

The Complete Feature Engineering Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_features = ['size', 'bedrooms', 'house_age', 'price_per_sqft']
categorical_features = ['property_type', 'neighborhood_grade']

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False)),
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features),
])

# Combine with your model
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LinearRegression()),
])

full_pipeline.fit(X_train, y_train)
score = full_pipeline.score(X_test, y_test)

Using Pipeline prevents data leakage automatically and makes deployment much cleaner.

Next lesson: Train, Validation & Test Splits — setting up your evaluation framework correctly from the start.