Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.

A
AiTechWorlds Team
May 27, 2026 9 min read
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs

I've seen teams spend weeks tuning neural network architectures and hyperparameters and gain 1-2% improvement. Then a thoughtful domain expert suggested three new features, and accuracy jumped 8%.

Feature engineering is consistently underestimated by people new to ML. Algorithms — even complex deep learning ones — can only learn patterns that are present in the features you give them. If a key pattern requires an interaction between two variables, a ratio of three values, or a temporal aggregation, the model cannot discover it unless you present it.

This guide covers every major feature engineering technique with Python code, when to apply each, and how to think about feature creation from a domain perspective.


Numerical Features

Scaling and Normalization

Many algorithms (SVM, neural networks, logistic regression, KNN) are sensitive to feature magnitude. Features on different scales cause some features to dominate:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Generate example data
data = pd.DataFrame({
    'age': [25, 45, 35, 55, 28],
    'salary': [50000, 120000, 75000, 180000, 45000],
    'years_experience': [2, 20, 10, 30, 3]
})

# StandardScaler: mean=0, std=1 — best for most algorithms
standard = StandardScaler()
data_standard = pd.DataFrame(
    standard.fit_transform(data),
    columns=data.columns
)

# MinMaxScaler: scales to [0, 1] — good for neural networks
minmax = MinMaxScaler()
data_minmax = pd.DataFrame(
    minmax.fit_transform(data),
    columns=data.columns
)

# RobustScaler: uses median and IQR — best when outliers are present
robust = RobustScaler()
data_robust = pd.DataFrame(
    robust.fit_transform(data),
    columns=data.columns
)

print("StandardScaler:\n", data_standard.round(3))

When to use each:

  • StandardScaler: default choice, works well when features are roughly normally distributed
  • MinMaxScaler: when you need values in a specific range (e.g., neural network inputs)
  • RobustScaler: when your data has significant outliers

Transformations for Skewed Features

Right-skewed distributions (like income, price, count) often benefit from transformation:

import matplotlib.pyplot as plt
from scipy import stats

# Salary is right-skewed: most people earn moderate amounts, few earn very high
salary = np.array([30000, 35000, 40000, 45000, 50000, 60000, 75000, 100000, 250000, 500000])

# Log transformation (compress large values)
salary_log = np.log1p(salary)  # log(x+1) to handle zeros

# Box-Cox transformation (finds optimal transformation)
salary_boxcox, lambda_val = stats.boxcox(salary)
print(f"Optimal Box-Cox lambda: {lambda_val:.3f}")

# Visualize before/after
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(salary, bins=20)
axes[0].set_title('Original (right-skewed)')
axes[1].hist(salary_log, bins=20)
axes[1].set_title('Log transformed')
axes[2].hist(salary_boxcox, bins=20)
axes[2].set_title('Box-Cox transformed')
plt.tight_layout()
plt.show()

Binning Continuous Variables

Converting continuous to categorical can capture non-linear relationships:

df = pd.DataFrame({'age': [22, 25, 31, 42, 48, 55, 63, 70]})

# Equal-width bins
df['age_group'] = pd.cut(df['age'],
                          bins=[0, 30, 45, 60, 100],
                          labels=['Young Adult', 'Middle', 'Senior', 'Elder'])

# Quantile-based bins (equal-size groups)
df['age_quantile'] = pd.qcut(df['age'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

print(df[['age', 'age_group', 'age_quantile']])

Categorical Features

One-Hot Encoding

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({
    'city': ['New York', 'London', 'Tokyo', 'New York', 'Paris', 'London'],
    'category': ['A', 'B', 'A', 'C', 'B', 'A']
})

# pandas get_dummies (simple, good for exploration)
dummies = pd.get_dummies(df, columns=['city', 'category'], drop_first=True)

# scikit-learn OneHotEncoder (better for pipelines)
encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[['city', 'category']])
print("Encoded shape:", encoded.shape)
print("Feature names:", encoder.get_feature_names_out())

Ordinal Encoding

For categories with natural ordering:

from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor']})

encoder = OrdinalEncoder(categories=[['High School', 'Bachelor', 'Master', 'PhD']])
df['education_encoded'] = encoder.fit_transform(df[['education']])

# High School=0, Bachelor=1, Master=2, PhD=3
print(df)

Target Encoding (for High-Cardinality)

For features with hundreds of categories, one-hot creates too many columns. Target encoding replaces each category with the mean target value for that category:

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold

def target_encode_cv(train_df, val_df, column, target, n_splits=5):
    """Cross-validated target encoding (prevents data leakage)"""
    # Train set: encode using cross-validation to avoid leakage
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    train_encoded = np.zeros(len(train_df))
    
    for train_idx, val_idx in kf.split(train_df):
        mean_map = train_df.iloc[train_idx].groupby(column)[target].mean()
        train_encoded[val_idx] = train_df.iloc[val_idx][column].map(mean_map)
    
    # Fill unmapped with global mean
    global_mean = train_df[target].mean()
    train_encoded = np.where(np.isnan(train_encoded), global_mean, train_encoded)
    
    # Validation/test set: encode using all training data
    mean_map_full = train_df.groupby(column)[target].mean()
    val_encoded = val_df[column].map(mean_map_full).fillna(global_mean)
    
    return train_encoded, val_encoded.values

# Example usage
train_df = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC', 'SF', 'LA', 'NYC', 'SF', 'LA'],
    'target': [1, 0, 1, 1, 0, 0, 1, 1]
})

Datetime Features

Datetime columns are gold mines for features:

import pandas as pd

df = pd.DataFrame({
    'transaction_time': pd.to_datetime([
        '2024-01-15 09:23:00', '2024-07-04 18:45:00',
        '2024-12-24 14:30:00', '2024-03-17 02:15:00'
    ])
})

# Extract all useful time components
df['hour'] = df['transaction_time'].dt.hour
df['day_of_week'] = df['transaction_time'].dt.dayofweek    # 0=Monday, 6=Sunday
df['day_of_month'] = df['transaction_time'].dt.day
df['month'] = df['transaction_time'].dt.month
df['quarter'] = df['transaction_time'].dt.quarter
df['year'] = df['transaction_time'].dt.year
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_business_hours'] = ((df['hour'] >= 9) & (df['hour'] < 17)).astype(int)

# Cyclical encoding (hour 23 is close to hour 0 — linear encoding misses this)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

print(df[['transaction_time', 'hour', 'is_weekend', 'is_business_hours', 'hour_sin', 'hour_cos']])

Why cyclical encoding matters: Hours 23 and 0 are only 1 hour apart, but linear encoding gives them values 23 and 0 — making them appear far apart. Sine/cosine encoding correctly represents the circular nature.


Interaction Features

Combining features can capture relationships that individual features miss:

df = pd.DataFrame({
    'age': [25, 45, 35, 55, 28],
    'income': [50000, 120000, 75000, 180000, 45000],
    'credit_score': [650, 780, 720, 800, 600]
})

# Ratio features (often more informative than raw values)
df['income_per_age'] = df['income'] / df['age']
df['credit_score_normalized'] = df['credit_score'] / 850  # Normalize to max possible

# Polynomial features (capture non-linear relationships)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
poly_features = poly.fit_transform(df[['age', 'income', 'credit_score']])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out())
print("Polynomial feature names:", poly.get_feature_names_out())

Aggregation Features

For datasets with one-to-many relationships (e.g., customers and transactions):

transactions = pd.DataFrame({
    'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
    'amount': [50, 120, 75, 200, 45, 30, 80, 150, 25],
    'category': ['food', 'retail', 'food', 'retail', 'food', 'food', 'retail', 'retail', 'food'],
    'date': pd.to_datetime(['2024-01-01', '2024-01-05', '2024-01-10',
                             '2024-01-03', '2024-01-08', '2024-01-02',
                             '2024-01-04', '2024-01-09', '2024-01-12'])
})

# Aggregate to customer level
customer_features = transactions.groupby('customer_id').agg(
    total_transactions=('amount', 'count'),
    total_spend=('amount', 'sum'),
    avg_spend=('amount', 'mean'),
    max_spend=('amount', 'max'),
    min_spend=('amount', 'min'),
    spend_std=('amount', 'std'),
    unique_categories=('category', 'nunique'),
    days_active=('date', lambda x: (x.max() - x.min()).days)
).reset_index()

# Category-specific features
category_pivot = transactions.pivot_table(
    values='amount', index='customer_id', columns='category', aggfunc='sum', fill_value=0
)
category_pivot.columns = [f'spend_{c}' for c in category_pivot.columns]

customer_features = customer_features.merge(category_pivot, on='customer_id')
print(customer_features)

Handling Missing Values

from sklearn.impute import SimpleImputer, KNNImputer
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age': [25, np.nan, 35, 45, np.nan],
    'income': [50000, 80000, np.nan, 120000, 45000],
    'education': ['Bachelor', 'Master', np.nan, 'PhD', 'Bachelor']
})

# Strategy 1: Simple imputation
# Numerical — use median (robust to outliers)
num_imputer = SimpleImputer(strategy='median')
df[['age', 'income']] = num_imputer.fit_transform(df[['age', 'income']])

# Categorical — use most frequent
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['education']] = cat_imputer.fit_transform(df[['education']])

# Strategy 2: Add missingness indicator (when missing is informative)
df_with_indicator = pd.DataFrame({
    'age': [25, np.nan, 35, 45, np.nan],
    'income': [50000, 80000, np.nan, 120000, 45000]
})

# First, create indicator columns
df_with_indicator['age_missing'] = df_with_indicator['age'].isnull().astype(int)
df_with_indicator['income_missing'] = df_with_indicator['income'].isnull().astype(int)
# Then impute
df_with_indicator[['age', 'income']] = SimpleImputer(strategy='median').fit_transform(
    df_with_indicator[['age', 'income']]
)

# Strategy 3: KNN imputation (uses similar rows to impute)
knn_imputer = KNNImputer(n_neighbors=3)
df_knn = pd.DataFrame({
    'age': [25, np.nan, 35, 45, np.nan],
    'income': [50000, 80000, np.nan, 120000, 45000]
})
df_knn_imputed = pd.DataFrame(knn_imputer.fit_transform(df_knn), columns=df_knn.columns)

Feature Selection

After creating features, select the most informative ones:

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Method 1: Feature importance from tree models
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

feature_importance = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)
print("Top 10 features by importance:")
print(feature_importance.head(10))

# Method 2: Statistical selection (SelectKBest)
selector = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)
selected_features = [feature_names[i] for i in selector.get_support(indices=True)]
print("Selected features:", selected_features)

# Method 3: Remove highly correlated features
def remove_correlated_features(df, threshold=0.95):
    corr_matrix = df.corr().abs()
    upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    to_drop = [col for col in upper_triangle.columns if any(upper_triangle[col] > threshold)]
    print(f"Removing {len(to_drop)} highly correlated features: {to_drop}")
    return df.drop(columns=to_drop)

X_train_uncorrelated = remove_correlated_features(pd.DataFrame(X_train, columns=feature_names))

The Feature Engineering Workflow

1. Explore raw data
   - distributions, missing values, outliers
   - correlation with target
   - domain-specific patterns

2. Basic preprocessing
   - Handle missing values
   - Encode categoricals
   - Scale numerics

3. Create new features
   - Datetime decomposition
   - Interaction terms
   - Domain-specific aggregations
   - Ratio features

4. Select features
   - Remove highly correlated
   - Feature importance ranking
   - Validate: do added features improve CV score?

5. Iterate
   - Test each feature's contribution
   - Remove features that don't improve validation score

Conclusion

Feature engineering is where domain knowledge and ML skill intersect. The best features come from understanding the problem deeply: why do customers churn, what makes a transaction fraudulent, what time patterns drive sales. No automated feature generation tool replaces this understanding.

The practical rule: always test features on validation data before including them in production. A feature that looks informative can still add noise rather than signal — cross-validation scores tell you the truth.

For the modeling skills that use these features, see our scikit-learn tutorial and machine learning beginners guide.


Frequently Asked Questions

What is feature engineering in machine learning?

Using domain knowledge to create, transform, and select input variables that help ML models learn patterns. Raw data rarely comes in algorithm-ready form — feature engineering bridges that gap. Good feature engineering can improve accuracy more than switching algorithms.

Is feature engineering still important with deep learning?

For tabular/structured data: yes, very much. For images, text, audio: deep learning learns features automatically. Even for tabular deep learning, thoughtful features typically improve results. Domain-knowledge features often outperform automated feature generation.

How do I handle missing values in machine learning?

Depends on why values are missing. MCAR: median imputation works. MNAR: add binary missingness indicator before imputing. For categoricals: add 'Unknown' category. For time series: forward-fill. Always use training set statistics for imputation — fitting on test data is data leakage.

What is feature selection and why does it matter?

Removing irrelevant, redundant, or noise features before training. Reduces overfitting, speeds training, improves interpretability. Methods: feature importance from tree models, SelectKBest statistical tests, correlation-based removal. Always verify removed features don't decrease validation performance.

What is one-hot encoding and when should I use it?

Converts categorical variable with N categories into N binary columns. Use for nominal categories with no order and manageable cardinality (<20-50 values). For high-cardinality, prefer target encoding. For tree models, ordinal encoding often works as well.

Share this article:

Frequently Asked Questions

Feature engineering is the process of using domain knowledge to create, transform, and select input variables (features) that help ML models learn patterns more effectively. Raw data rarely comes in a form that algorithms can directly use — feature engineering bridges that gap. Examples: from a datetime column, you might extract hour of day, day of week, and whether it's a holiday (three features that capture temporal patterns better than a raw timestamp). From text, you might extract length, sentiment score, and keyword presence. Good feature engineering can improve model accuracy more than switching from a simple algorithm to a complex one.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!