A seasoned real estate agent doesn't just quote a home's raw price — they talk about price per square foot. That single calculated number instantly makes two homes with completely different sizes comparable. Nobody told the agent to invent that metric. They derived it from domain knowledge because it is genuinely more meaningful than either raw feature alone.

That act of transformation — taking existing data and crafting more informative representations — is feature engineering. And knowing which of those crafted features actually matter, and discarding the ones that don't, is feature selection. Together, these two skills are often what separates a mediocre model from an excellent one.

Raw data is rarely in the exact form that best exposes underlying patterns to a machine learning algorithm. Your job as a practitioner is to reshape it until it is.

Why Feature Engineering Matters

Consider predicting survival on the Titanic. The raw dataset gives you SibSp (siblings/spouses aboard) and Parch (parents/children aboard). Neither column alone is very powerful. But add them together into family_size = SibSp + Parch + 1, and suddenly you have a feature that captures something real: traveling alone versus traveling with family, which had a significant effect on survival odds.

The algorithm cannot discover that addition on its own. You provide the insight; the algorithm learns from it.

Feature Engineering Techniques

Polynomial and Interaction Features

When relationships between features and the target are non-linear, adding polynomial terms can help linear models capture the curve.

import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

df = pd.DataFrame({'area': [800, 1200, 1500, 2000, 2500],
                   'rooms': [2, 3, 3, 4, 5]})

poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df)

feature_names = poly.get_feature_names_out(['area', 'rooms'])
df_poly = pd.DataFrame(poly_features, columns=feature_names)
print(df_poly.head())

Output:

    area  rooms  area^2  area rooms  rooms^2
0    800      2  640000       1600        4
1   1200      3  1440000      3600        9
2   1500      3  2250000      4500        9
3   2000      4  4000000      8000       16
4   2500      5  6250000     12500       25

Log Transforms

Highly skewed distributions (like income, house prices, or population counts) benefit enormously from log transformation. It compresses large values, expands small ones, and often makes distributions closer to normal.

df['price_log'] = np.log1p(df['price'])  # log1p handles zeros safely

print(f"Original skewness: {df['price'].skew():.2f}")
print(f"Log-transformed skewness: {df['price_log'].skew():.2f}")

Output:

Original skewness: 3.47
Log-transformed skewness: 0.41

Binning (Discretization)

Continuous values can be grouped into meaningful categories. Age into "child", "adult", "senior". This can help when the relationship is step-like rather than linear.

df['age_group'] = pd.cut(df['Age'],
                          bins=[0, 12, 18, 60, 100],
                          labels=['child', 'teen', 'adult', 'senior'])

Titanic Feature Engineering Example

import pandas as pd

df = pd.read_csv('titanic.csv')

# Engineer: family_size
df['family_size'] = df['SibSp'] + df['Parch'] + 1

# Engineer: is_alone
df['is_alone'] = (df['family_size'] == 1).astype(int)

# Engineer: cabin_known (was a cabin assigned?)
df['cabin_known'] = df['Cabin'].notna().astype(int)

# Engineer: title from Name
df['title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
df['title'] = df['title'].replace(['Lady','Countess','Capt','Col','Don',
                                    'Dr','Major','Rev','Sir','Jonkheer','Dona'], 'Rare')
df['title'] = df['title'].replace({'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'})

print(df[['Name', 'family_size', 'is_alone', 'cabin_known', 'title']].head(5))

Output:

                                     Name  family_size  is_alone  cabin_known title
0                  Braund, Mr. Owen Harris            2         0            0    Mr
1    Cumings, Mrs. John Bradley (Florence)            2         0            1   Mrs
2                   Heikkinen, Miss. Laina            1         1            0  Miss
3         Futrelle, Mrs. Jacques Heath                2         0            1   Mrs
4                 Allen, Mr. William Henry            1         1            0    Mr

Feature Selection Techniques

Having too many features is itself a problem — the curse of dimensionality. With high-dimensional data, the feature space becomes sparse, distances become meaningless, and models overfit to noise. Feature selection finds the signal.

Correlation Matrix

Highly correlated features carry redundant information. Keep one, drop the other.

import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = df[numerical_cols].corr()
print(corr_matrix['SalePrice'].sort_values(ascending=False).head(10))

Output:

SalePrice      1.000
OverallQual    0.791
GrLivArea      0.709
GarageCars     0.640
GarageArea     0.623
TotalBsmtSF    0.614
family_size    0.089
is_alone      -0.081

Feature Importance from Trees

Tree-based models like Random Forest measure how much each feature reduces impurity across all splits. This gives a natural importance ranking.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances = importances.sort_values(ascending=False)
print(importances.head(10))

Output:

title           0.2341
fare            0.1987
age             0.1654
family_size     0.1102
cabin_known     0.0876
pclass          0.0712
is_alone        0.0534
sex             0.0421
embarked        0.0198
sibsp           0.0175

SelectKBest and RFE

from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression

# SelectKBest: keep top K features by statistical test
selector = SelectKBest(score_func=f_classif, k=6)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print("SelectKBest features:", list(selected_features))

# RFE: recursively remove weakest features
rfe = RFE(estimator=LogisticRegression(max_iter=1000), n_features_to_select=6)
rfe.fit(X, y)
rfe_features = X.columns[rfe.support_]
print("RFE features:", list(rfe_features))

Output:

SelectKBest features: ['pclass', 'fare', 'title', 'family_size', 'cabin_known', 'age']
RFE features: ['pclass', 'sex', 'fare', 'title', 'cabin_known', 'family_size']

Reference Table

Technique	When to Use	Python Method	Example
Polynomial features	Non-linear relationships	`PolynomialFeatures(degree=2)`	`area^2`, `area*rooms`
Log transform	Skewed distributions	`np.log1p(col)`	Income, price
Interaction features	Two features combine meaningfully	`df['a'] * df['b']`	`family_size`
Binning	Step-like relationships	`pd.cut()`	Age groups
Correlation filter	Remove redundant features	`.corr()` matrix	Drop `GarageCars` if corr >0.95
Tree importance	General ranking of features	`.feature_importances_`	Any tree model
SelectKBest	Fast univariate filtering	`SelectKBest(f_classif, k=K)`	Top K statistical features
RFE	Wrapper method, model-specific	`RFE(estimator, n_features)`	Recursive elimination

The Curse of Dimensionality

As the number of features grows, the volume of the feature space grows exponentially. With 2 features and 100 samples, the space is reasonably populated. With 100 features and 100 samples, each sample is essentially alone in its own neighborhood. Distance-based algorithms like K-Nearest Neighbors collapse entirely. Even tree models begin to overfit on noise dimensions.

The rule of thumb: you need roughly 5–10 samples per feature for a model to generalize. If you have 500 samples, target 50–100 features maximum.

Key Takeaways

Feature engineering creates new, more informative representations from existing data using domain knowledge
Log transforms fix skewed distributions; polynomial features capture curves; interaction terms combine signals
The Titanic cabin_known and family_size features often outperform raw columns in practice
Use correlation matrices to identify and drop redundant features
Use tree importance or SelectKBest to rank features and keep only the most predictive
The curse of dimensionality penalizes high-dimensional data — select aggressively when samples are limited

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

32 minLesson 4 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min

Chapter 2: Data Preprocessing

Feature Engineering and Feature Selection

Feature Engineering and Selection

The Real Estate Agent Analogy

Raw data is rarely in the exact form that best exposes underlying patterns to a machine learning algorithm. Your job as a practitioner is to reshape it until it is.

Why Feature Engineering Matters

The algorithm cannot discover that addition on its own. You provide the insight; the algorithm learns from it.

Feature Engineering Techniques

Polynomial and Interaction Features

When relationships between features and the target are non-linear, adding polynomial terms can help linear models capture the curve.

import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

df = pd.DataFrame({'area': [800, 1200, 1500, 2000, 2500],
                   'rooms': [2, 3, 3, 4, 5]})

poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df)

feature_names = poly.get_feature_names_out(['area', 'rooms'])
df_poly = pd.DataFrame(poly_features, columns=feature_names)
print(df_poly.head())

Output:

    area  rooms  area^2  area rooms  rooms^2
0    800      2  640000       1600        4
1   1200      3  1440000      3600        9
2   1500      3  2250000      4500        9
3   2000      4  4000000      8000       16
4   2500      5  6250000     12500       25

Log Transforms

df['price_log'] = np.log1p(df['price'])  # log1p handles zeros safely

print(f"Original skewness: {df['price'].skew():.2f}")
print(f"Log-transformed skewness: {df['price_log'].skew():.2f}")

Output:

Original skewness: 3.47
Log-transformed skewness: 0.41

Binning (Discretization)

Continuous values can be grouped into meaningful categories. Age into "child", "adult", "senior". This can help when the relationship is step-like rather than linear.

df['age_group'] = pd.cut(df['Age'],
                          bins=[0, 12, 18, 60, 100],
                          labels=['child', 'teen', 'adult', 'senior'])

Titanic Feature Engineering Example

import pandas as pd

df = pd.read_csv('titanic.csv')

# Engineer: family_size
df['family_size'] = df['SibSp'] + df['Parch'] + 1

# Engineer: is_alone
df['is_alone'] = (df['family_size'] == 1).astype(int)

# Engineer: cabin_known (was a cabin assigned?)
df['cabin_known'] = df['Cabin'].notna().astype(int)

# Engineer: title from Name
df['title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
df['title'] = df['title'].replace(['Lady','Countess','Capt','Col','Don',
                                    'Dr','Major','Rev','Sir','Jonkheer','Dona'], 'Rare')
df['title'] = df['title'].replace({'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'})

print(df[['Name', 'family_size', 'is_alone', 'cabin_known', 'title']].head(5))

Output:

                                     Name  family_size  is_alone  cabin_known title
0                  Braund, Mr. Owen Harris            2         0            0    Mr
1    Cumings, Mrs. John Bradley (Florence)            2         0            1   Mrs
2                   Heikkinen, Miss. Laina            1         1            0  Miss
3         Futrelle, Mrs. Jacques Heath                2         0            1   Mrs
4                 Allen, Mr. William Henry            1         1            0    Mr

Feature Selection Techniques

Correlation Matrix

Highly correlated features carry redundant information. Keep one, drop the other.

import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = df[numerical_cols].corr()
print(corr_matrix['SalePrice'].sort_values(ascending=False).head(10))

Output:

SalePrice      1.000
OverallQual    0.791
GrLivArea      0.709
GarageCars     0.640
GarageArea     0.623
TotalBsmtSF    0.614
family_size    0.089
is_alone      -0.081

Feature Importance from Trees

Tree-based models like Random Forest measure how much each feature reduces impurity across all splits. This gives a natural importance ranking.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances = importances.sort_values(ascending=False)
print(importances.head(10))

Output:

title           0.2341
fare            0.1987
age             0.1654
family_size     0.1102
cabin_known     0.0876
pclass          0.0712
is_alone        0.0534
sex             0.0421
embarked        0.0198
sibsp           0.0175

SelectKBest and RFE

from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression

# SelectKBest: keep top K features by statistical test
selector = SelectKBest(score_func=f_classif, k=6)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print("SelectKBest features:", list(selected_features))

# RFE: recursively remove weakest features
rfe = RFE(estimator=LogisticRegression(max_iter=1000), n_features_to_select=6)
rfe.fit(X, y)
rfe_features = X.columns[rfe.support_]
print("RFE features:", list(rfe_features))

Output:

SelectKBest features: ['pclass', 'fare', 'title', 'family_size', 'cabin_known', 'age']
RFE features: ['pclass', 'sex', 'fare', 'title', 'cabin_known', 'family_size']

Reference Table

Technique	When to Use	Python Method	Example
Polynomial features	Non-linear relationships	`PolynomialFeatures(degree=2)`	`area^2`, `area*rooms`
Log transform	Skewed distributions	`np.log1p(col)`	Income, price
Interaction features	Two features combine meaningfully	`df['a'] * df['b']`	`family_size`
Binning	Step-like relationships	`pd.cut()`	Age groups
Correlation filter	Remove redundant features	`.corr()` matrix	Drop `GarageCars` if corr >0.95
Tree importance	General ranking of features	`.feature_importances_`	Any tree model
SelectKBest	Fast univariate filtering	`SelectKBest(f_classif, k=K)`	Top K statistical features
RFE	Wrapper method, model-specific	`RFE(estimator, n_features)`	Recursive elimination

The Curse of Dimensionality

The rule of thumb: you need roughly 5–10 samples per feature for a model to generalize. If you have 500 samples, target 50–100 features maximum.

Key Takeaways

Feature engineering creates new, more informative representations from existing data using domain knowledge
Log transforms fix skewed distributions; polynomial features capture curves; interaction terms combine signals
The Titanic cabin_known and family_size features often outperform raw columns in practice
Use correlation matrices to identify and drop redundant features
Use tree importance or SelectKBest to rank features and keep only the most predictive
The curse of dimensionality penalizes high-dimensional data — select aggressively when samples are limited

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →