AiTechWorlds
AiTechWorlds
A seasoned real estate agent doesn't just quote a home's raw price — they talk about price per square foot. That single calculated number instantly makes two homes with completely different sizes comparable. Nobody told the agent to invent that metric. They derived it from domain knowledge because it is genuinely more meaningful than either raw feature alone.
That act of transformation — taking existing data and crafting more informative representations — is feature engineering. And knowing which of those crafted features actually matter, and discarding the ones that don't, is feature selection. Together, these two skills are often what separates a mediocre model from an excellent one.
Raw data is rarely in the exact form that best exposes underlying patterns to a machine learning algorithm. Your job as a practitioner is to reshape it until it is.
Consider predicting survival on the Titanic. The raw dataset gives you SibSp (siblings/spouses aboard) and Parch (parents/children aboard). Neither column alone is very powerful. But add them together into family_size = SibSp + Parch + 1, and suddenly you have a feature that captures something real: traveling alone versus traveling with family, which had a significant effect on survival odds.
The algorithm cannot discover that addition on its own. You provide the insight; the algorithm learns from it.
When relationships between features and the target are non-linear, adding polynomial terms can help linear models capture the curve.
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame({'area': [800, 1200, 1500, 2000, 2500],
'rooms': [2, 3, 3, 4, 5]})
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df)
feature_names = poly.get_feature_names_out(['area', 'rooms'])
df_poly = pd.DataFrame(poly_features, columns=feature_names)
print(df_poly.head())
Output:
area rooms area^2 area rooms rooms^2
0 800 2 640000 1600 4
1 1200 3 1440000 3600 9
2 1500 3 2250000 4500 9
3 2000 4 4000000 8000 16
4 2500 5 6250000 12500 25
Highly skewed distributions (like income, house prices, or population counts) benefit enormously from log transformation. It compresses large values, expands small ones, and often makes distributions closer to normal.
df['price_log'] = np.log1p(df['price']) # log1p handles zeros safely
print(f"Original skewness: {df['price'].skew():.2f}")
print(f"Log-transformed skewness: {df['price_log'].skew():.2f}")
Output:
Original skewness: 3.47
Log-transformed skewness: 0.41
Continuous values can be grouped into meaningful categories. Age into "child", "adult", "senior". This can help when the relationship is step-like rather than linear.
df['age_group'] = pd.cut(df['Age'],
bins=[0, 12, 18, 60, 100],
labels=['child', 'teen', 'adult', 'senior'])
import pandas as pd
df = pd.read_csv('titanic.csv')
# Engineer: family_size
df['family_size'] = df['SibSp'] + df['Parch'] + 1
# Engineer: is_alone
df['is_alone'] = (df['family_size'] == 1).astype(int)
# Engineer: cabin_known (was a cabin assigned?)
df['cabin_known'] = df['Cabin'].notna().astype(int)
# Engineer: title from Name
df['title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
df['title'] = df['title'].replace(['Lady','Countess','Capt','Col','Don',
'Dr','Major','Rev','Sir','Jonkheer','Dona'], 'Rare')
df['title'] = df['title'].replace({'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'})
print(df[['Name', 'family_size', 'is_alone', 'cabin_known', 'title']].head(5))
Output:
Name family_size is_alone cabin_known title
0 Braund, Mr. Owen Harris 2 0 0 Mr
1 Cumings, Mrs. John Bradley (Florence) 2 0 1 Mrs
2 Heikkinen, Miss. Laina 1 1 0 Miss
3 Futrelle, Mrs. Jacques Heath 2 0 1 Mrs
4 Allen, Mr. William Henry 1 1 0 Mr
Having too many features is itself a problem — the curse of dimensionality. With high-dimensional data, the feature space becomes sparse, distances become meaningless, and models overfit to noise. Feature selection finds the signal.
Highly correlated features carry redundant information. Keep one, drop the other.
import seaborn as sns
import matplotlib.pyplot as plt
corr_matrix = df[numerical_cols].corr()
print(corr_matrix['SalePrice'].sort_values(ascending=False).head(10))
Output:
SalePrice 1.000
OverallQual 0.791
GrLivArea 0.709
GarageCars 0.640
GarageArea 0.623
TotalBsmtSF 0.614
family_size 0.089
is_alone -0.081
Tree-based models like Random Forest measure how much each feature reduces impurity across all splits. This gives a natural importance ranking.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances = importances.sort_values(ascending=False)
print(importances.head(10))
Output:
title 0.2341
fare 0.1987
age 0.1654
family_size 0.1102
cabin_known 0.0876
pclass 0.0712
is_alone 0.0534
sex 0.0421
embarked 0.0198
sibsp 0.0175
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression
# SelectKBest: keep top K features by statistical test
selector = SelectKBest(score_func=f_classif, k=6)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
print("SelectKBest features:", list(selected_features))
# RFE: recursively remove weakest features
rfe = RFE(estimator=LogisticRegression(max_iter=1000), n_features_to_select=6)
rfe.fit(X, y)
rfe_features = X.columns[rfe.support_]
print("RFE features:", list(rfe_features))
Output:
SelectKBest features: ['pclass', 'fare', 'title', 'family_size', 'cabin_known', 'age']
RFE features: ['pclass', 'sex', 'fare', 'title', 'cabin_known', 'family_size']
| Technique | When to Use | Python Method | Example |
|---|---|---|---|
| Polynomial features | Non-linear relationships | PolynomialFeatures(degree=2) | area^2, area*rooms |
| Log transform | Skewed distributions | np.log1p(col) | Income, price |
| Interaction features | Two features combine meaningfully | df['a'] * df['b'] | family_size |
| Binning | Step-like relationships | pd.cut() | Age groups |
| Correlation filter | Remove redundant features | .corr() matrix | Drop GarageCars if corr >0.95 |
| Tree importance | General ranking of features | .feature_importances_ | Any tree model |
| SelectKBest | Fast univariate filtering | SelectKBest(f_classif, k=K) | Top K statistical features |
| RFE | Wrapper method, model-specific | RFE(estimator, n_features) | Recursive elimination |
As the number of features grows, the volume of the feature space grows exponentially. With 2 features and 100 samples, the space is reasonably populated. With 100 features and 100 samples, each sample is essentially alone in its own neighborhood. Distance-based algorithms like K-Nearest Neighbors collapse entirely. Even tree models begin to overfit on noise dimensions.
The rule of thumb: you need roughly 5–10 samples per feature for a model to generalize. If you have 500 samples, target 50–100 features maximum.
cabin_known and family_size features often outperform raw columns in practiceGet this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises