Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs
Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs
I've seen teams spend weeks tuning neural network architectures and hyperparameters and gain 1-2% improvement. Then a thoughtful domain expert suggested three new features, and accuracy jumped 8%.
Feature engineering is consistently underestimated by people new to ML. Algorithms — even complex deep learning ones — can only learn patterns that are present in the features you give them. If a key pattern requires an interaction between two variables, a ratio of three values, or a temporal aggregation, the model cannot discover it unless you present it.
This guide covers every major feature engineering technique with Python code, when to apply each, and how to think about feature creation from a domain perspective.
Numerical Features
Scaling and Normalization
Many algorithms (SVM, neural networks, logistic regression, KNN) are sensitive to feature magnitude. Features on different scales cause some features to dominate:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Generate example data
data = pd.DataFrame({
'age': [25, 45, 35, 55, 28],
'salary': [50000, 120000, 75000, 180000, 45000],
'years_experience': [2, 20, 10, 30, 3]
})
# StandardScaler: mean=0, std=1 — best for most algorithms
standard = StandardScaler()
data_standard = pd.DataFrame(
standard.fit_transform(data),
columns=data.columns
)
# MinMaxScaler: scales to [0, 1] — good for neural networks
minmax = MinMaxScaler()
data_minmax = pd.DataFrame(
minmax.fit_transform(data),
columns=data.columns
)
# RobustScaler: uses median and IQR — best when outliers are present
robust = RobustScaler()
data_robust = pd.DataFrame(
robust.fit_transform(data),
columns=data.columns
)
print("StandardScaler:\n", data_standard.round(3))
When to use each:
- StandardScaler: default choice, works well when features are roughly normally distributed
- MinMaxScaler: when you need values in a specific range (e.g., neural network inputs)
- RobustScaler: when your data has significant outliers
Transformations for Skewed Features
Right-skewed distributions (like income, price, count) often benefit from transformation:
import matplotlib.pyplot as plt
from scipy import stats
# Salary is right-skewed: most people earn moderate amounts, few earn very high
salary = np.array([30000, 35000, 40000, 45000, 50000, 60000, 75000, 100000, 250000, 500000])
# Log transformation (compress large values)
salary_log = np.log1p(salary) # log(x+1) to handle zeros
# Box-Cox transformation (finds optimal transformation)
salary_boxcox, lambda_val = stats.boxcox(salary)
print(f"Optimal Box-Cox lambda: {lambda_val:.3f}")
# Visualize before/after
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(salary, bins=20)
axes[0].set_title('Original (right-skewed)')
axes[1].hist(salary_log, bins=20)
axes[1].set_title('Log transformed')
axes[2].hist(salary_boxcox, bins=20)
axes[2].set_title('Box-Cox transformed')
plt.tight_layout()
plt.show()
Binning Continuous Variables
Converting continuous to categorical can capture non-linear relationships:
df = pd.DataFrame({'age': [22, 25, 31, 42, 48, 55, 63, 70]})
# Equal-width bins
df['age_group'] = pd.cut(df['age'],
bins=[0, 30, 45, 60, 100],
labels=['Young Adult', 'Middle', 'Senior', 'Elder'])
# Quantile-based bins (equal-size groups)
df['age_quantile'] = pd.qcut(df['age'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print(df[['age', 'age_group', 'age_quantile']])
Categorical Features
One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
df = pd.DataFrame({
'city': ['New York', 'London', 'Tokyo', 'New York', 'Paris', 'London'],
'category': ['A', 'B', 'A', 'C', 'B', 'A']
})
# pandas get_dummies (simple, good for exploration)
dummies = pd.get_dummies(df, columns=['city', 'category'], drop_first=True)
# scikit-learn OneHotEncoder (better for pipelines)
encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[['city', 'category']])
print("Encoded shape:", encoded.shape)
print("Feature names:", encoder.get_feature_names_out())
Ordinal Encoding
For categories with natural ordering:
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor']})
encoder = OrdinalEncoder(categories=[['High School', 'Bachelor', 'Master', 'PhD']])
df['education_encoded'] = encoder.fit_transform(df[['education']])
# High School=0, Bachelor=1, Master=2, PhD=3
print(df)
Target Encoding (for High-Cardinality)
For features with hundreds of categories, one-hot creates too many columns. Target encoding replaces each category with the mean target value for that category:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
def target_encode_cv(train_df, val_df, column, target, n_splits=5):
"""Cross-validated target encoding (prevents data leakage)"""
# Train set: encode using cross-validation to avoid leakage
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
train_encoded = np.zeros(len(train_df))
for train_idx, val_idx in kf.split(train_df):
mean_map = train_df.iloc[train_idx].groupby(column)[target].mean()
train_encoded[val_idx] = train_df.iloc[val_idx][column].map(mean_map)
# Fill unmapped with global mean
global_mean = train_df[target].mean()
train_encoded = np.where(np.isnan(train_encoded), global_mean, train_encoded)
# Validation/test set: encode using all training data
mean_map_full = train_df.groupby(column)[target].mean()
val_encoded = val_df[column].map(mean_map_full).fillna(global_mean)
return train_encoded, val_encoded.values
# Example usage
train_df = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC', 'SF', 'LA', 'NYC', 'SF', 'LA'],
'target': [1, 0, 1, 1, 0, 0, 1, 1]
})
Datetime Features
Datetime columns are gold mines for features:
import pandas as pd
df = pd.DataFrame({
'transaction_time': pd.to_datetime([
'2024-01-15 09:23:00', '2024-07-04 18:45:00',
'2024-12-24 14:30:00', '2024-03-17 02:15:00'
])
})
# Extract all useful time components
df['hour'] = df['transaction_time'].dt.hour
df['day_of_week'] = df['transaction_time'].dt.dayofweek # 0=Monday, 6=Sunday
df['day_of_month'] = df['transaction_time'].dt.day
df['month'] = df['transaction_time'].dt.month
df['quarter'] = df['transaction_time'].dt.quarter
df['year'] = df['transaction_time'].dt.year
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_business_hours'] = ((df['hour'] >= 9) & (df['hour'] < 17)).astype(int)
# Cyclical encoding (hour 23 is close to hour 0 — linear encoding misses this)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
print(df[['transaction_time', 'hour', 'is_weekend', 'is_business_hours', 'hour_sin', 'hour_cos']])
Why cyclical encoding matters: Hours 23 and 0 are only 1 hour apart, but linear encoding gives them values 23 and 0 — making them appear far apart. Sine/cosine encoding correctly represents the circular nature.
Interaction Features
Combining features can capture relationships that individual features miss:
df = pd.DataFrame({
'age': [25, 45, 35, 55, 28],
'income': [50000, 120000, 75000, 180000, 45000],
'credit_score': [650, 780, 720, 800, 600]
})
# Ratio features (often more informative than raw values)
df['income_per_age'] = df['income'] / df['age']
df['credit_score_normalized'] = df['credit_score'] / 850 # Normalize to max possible
# Polynomial features (capture non-linear relationships)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
poly_features = poly.fit_transform(df[['age', 'income', 'credit_score']])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out())
print("Polynomial feature names:", poly.get_feature_names_out())
Aggregation Features
For datasets with one-to-many relationships (e.g., customers and transactions):
transactions = pd.DataFrame({
'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
'amount': [50, 120, 75, 200, 45, 30, 80, 150, 25],
'category': ['food', 'retail', 'food', 'retail', 'food', 'food', 'retail', 'retail', 'food'],
'date': pd.to_datetime(['2024-01-01', '2024-01-05', '2024-01-10',
'2024-01-03', '2024-01-08', '2024-01-02',
'2024-01-04', '2024-01-09', '2024-01-12'])
})
# Aggregate to customer level
customer_features = transactions.groupby('customer_id').agg(
total_transactions=('amount', 'count'),
total_spend=('amount', 'sum'),
avg_spend=('amount', 'mean'),
max_spend=('amount', 'max'),
min_spend=('amount', 'min'),
spend_std=('amount', 'std'),
unique_categories=('category', 'nunique'),
days_active=('date', lambda x: (x.max() - x.min()).days)
).reset_index()
# Category-specific features
category_pivot = transactions.pivot_table(
values='amount', index='customer_id', columns='category', aggfunc='sum', fill_value=0
)
category_pivot.columns = [f'spend_{c}' for c in category_pivot.columns]
customer_features = customer_features.merge(category_pivot, on='customer_id')
print(customer_features)
Handling Missing Values
from sklearn.impute import SimpleImputer, KNNImputer
import pandas as pd
import numpy as np
df = pd.DataFrame({
'age': [25, np.nan, 35, 45, np.nan],
'income': [50000, 80000, np.nan, 120000, 45000],
'education': ['Bachelor', 'Master', np.nan, 'PhD', 'Bachelor']
})
# Strategy 1: Simple imputation
# Numerical — use median (robust to outliers)
num_imputer = SimpleImputer(strategy='median')
df[['age', 'income']] = num_imputer.fit_transform(df[['age', 'income']])
# Categorical — use most frequent
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['education']] = cat_imputer.fit_transform(df[['education']])
# Strategy 2: Add missingness indicator (when missing is informative)
df_with_indicator = pd.DataFrame({
'age': [25, np.nan, 35, 45, np.nan],
'income': [50000, 80000, np.nan, 120000, 45000]
})
# First, create indicator columns
df_with_indicator['age_missing'] = df_with_indicator['age'].isnull().astype(int)
df_with_indicator['income_missing'] = df_with_indicator['income'].isnull().astype(int)
# Then impute
df_with_indicator[['age', 'income']] = SimpleImputer(strategy='median').fit_transform(
df_with_indicator[['age', 'income']]
)
# Strategy 3: KNN imputation (uses similar rows to impute)
knn_imputer = KNNImputer(n_neighbors=3)
df_knn = pd.DataFrame({
'age': [25, np.nan, 35, 45, np.nan],
'income': [50000, 80000, np.nan, 120000, 45000]
})
df_knn_imputed = pd.DataFrame(knn_imputer.fit_transform(df_knn), columns=df_knn.columns)
Feature Selection
After creating features, select the most informative ones:
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Method 1: Feature importance from tree models
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
feature_importance = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)
print("Top 10 features by importance:")
print(feature_importance.head(10))
# Method 2: Statistical selection (SelectKBest)
selector = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)
selected_features = [feature_names[i] for i in selector.get_support(indices=True)]
print("Selected features:", selected_features)
# Method 3: Remove highly correlated features
def remove_correlated_features(df, threshold=0.95):
corr_matrix = df.corr().abs()
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper_triangle.columns if any(upper_triangle[col] > threshold)]
print(f"Removing {len(to_drop)} highly correlated features: {to_drop}")
return df.drop(columns=to_drop)
X_train_uncorrelated = remove_correlated_features(pd.DataFrame(X_train, columns=feature_names))
The Feature Engineering Workflow
1. Explore raw data
- distributions, missing values, outliers
- correlation with target
- domain-specific patterns
2. Basic preprocessing
- Handle missing values
- Encode categoricals
- Scale numerics
3. Create new features
- Datetime decomposition
- Interaction terms
- Domain-specific aggregations
- Ratio features
4. Select features
- Remove highly correlated
- Feature importance ranking
- Validate: do added features improve CV score?
5. Iterate
- Test each feature's contribution
- Remove features that don't improve validation score
Conclusion
Feature engineering is where domain knowledge and ML skill intersect. The best features come from understanding the problem deeply: why do customers churn, what makes a transaction fraudulent, what time patterns drive sales. No automated feature generation tool replaces this understanding.
The practical rule: always test features on validation data before including them in production. A feature that looks informative can still add noise rather than signal — cross-validation scores tell you the truth.
For the modeling skills that use these features, see our scikit-learn tutorial and machine learning beginners guide.
Frequently Asked Questions
What is feature engineering in machine learning?
Using domain knowledge to create, transform, and select input variables that help ML models learn patterns. Raw data rarely comes in algorithm-ready form — feature engineering bridges that gap. Good feature engineering can improve accuracy more than switching algorithms.
Is feature engineering still important with deep learning?
For tabular/structured data: yes, very much. For images, text, audio: deep learning learns features automatically. Even for tabular deep learning, thoughtful features typically improve results. Domain-knowledge features often outperform automated feature generation.
How do I handle missing values in machine learning?
Depends on why values are missing. MCAR: median imputation works. MNAR: add binary missingness indicator before imputing. For categoricals: add 'Unknown' category. For time series: forward-fill. Always use training set statistics for imputation — fitting on test data is data leakage.
What is feature selection and why does it matter?
Removing irrelevant, redundant, or noise features before training. Reduces overfitting, speeds training, improves interpretability. Methods: feature importance from tree models, SelectKBest statistical tests, correlation-based removal. Always verify removed features don't decrease validation performance.
What is one-hot encoding and when should I use it?
Converts categorical variable with N categories into N binary columns. Use for nominal categories with no order and manageable cardinality (<20-50 values). For high-cardinality, prefer target encoding. For tree models, ordinal encoding often works as well.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Best Machine Learning Courses in 2025: Ranked After Taking Them All
The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.
Computer Vision Tutorial: Build an Image Classifier from Scratch
Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.
Kaggle Competition Guide: How to Rank in the Top 10% Every Time
Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.
Machine Learning for Beginners: A Honest Guide to Getting Started
Machine learning for beginners explained honestly — what ML actually is, which skills you need first, the fastest learning path, and what to build to prove you can do it.