The ML Workflow: End to End

Building a machine learning model is not a single act — it is a repeatable process. Most failed ML projects don't fail because of bad algorithms. They fail because practitioners skip steps, rush past data problems, or deploy a model without understanding what it actually learned.

This lesson walks you through the complete, professional ML workflow from raw problem to production model, using house price prediction as a concrete example throughout.

The 7-Step ML Workflow

Problem Definition → Data Collection → EDA → Feature Engineering
→ Model Selection → Training & Evaluation → Deployment

Step 1: Problem Definition

Before touching a line of code, you must answer three questions clearly:

What exactly are we predicting? (target variable)
Is this classification or regression? (type of output)
How will we measure success? (metric)

House price example:

Target: sale price in dollars (continuous number → regression)
Success metric: RMSE (Root Mean Squared Error) below $25,000

Common pitfall: Jumping straight to modeling without a clear metric. You'll train endlessly without knowing if you're making progress.

Step 2: Data Collection

Garbage in, garbage out. The quality of your data determines the ceiling of your model's performance — no algorithm overcomes fundamentally bad data.

Sources for ML projects:

Public datasets: Kaggle, UCI ML Repository, government open data
APIs: Twitter, weather services, financial data providers
Web scraping (with legal/ethical care)
Internal databases and logs

House price example: We use the Ames Housing dataset — 1,460 houses, 79 features, sale prices from 2006–2010.

import pandas as pd

df = pd.read_csv('ames_housing.csv')
print(df.shape)          # (1460, 81) — 79 features + Id + SalePrice
print(df['SalePrice'].describe())
# count    1460.000000
# mean   180921.195890
# std     79442.502883
# min     34900.000000
# max    755000.000000

Common pitfall: Collecting data that is only available at prediction time if it wouldn't actually be available in production (data leakage).

Step 3: Exploratory Data Analysis (EDA)

EDA is detective work. You are trying to understand the data before the data misleads you.

Key questions to answer:

How many rows and columns? Any missing values?
What does the target variable look like? Skewed? Outliers?
Which features correlate with the target?
Are there any obvious data quality issues?

import matplotlib.pyplot as plt
import seaborn as sns

# Check missing values
missing = df.isnull().sum().sort_values(ascending=False)
print(missing[missing > 0].head(10))
# PoolQC         1453
# MiscFeature    1406
# Alley          1369
# ...

# Distribution of target variable
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
df['SalePrice'].hist(bins=50)
plt.title('SalePrice — Raw')

plt.subplot(1, 2, 2)
import numpy as np
np.log1p(df['SalePrice']).hist(bins=50)
plt.title('SalePrice — Log Transformed')
plt.show()
# Log transform makes it normally distributed — better for linear models

Common pitfall: Skipping EDA entirely and discovering problems after training. EDA takes an hour. Discovering data issues after a 12-hour training run costs far more.

Step 4: Feature Engineering

Raw data is rarely in the right form for an algorithm. Feature engineering is the process of transforming raw data into features that represent the underlying problem more clearly.

This step separates good ML practitioners from great ones.

from sklearn.preprocessing import LabelEncoder

# 1. Handle missing values
df['LotFrontage'].fillna(df['LotFrontage'].median(), inplace=True)
df['GarageYrBlt'].fillna(0, inplace=True)

# 2. Create new features
df['HouseAge'] = df['YrSold'] - df['YearBuilt']
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
df['HasPool'] = (df['PoolArea'] > 0).astype(int)

# 3. Encode categorical variables
df['MSZoning_enc'] = LabelEncoder().fit_transform(df['MSZoning'])

# 4. Log-transform skewed numerical features
skewed_cols = ['LotArea', 'GrLivArea', '1stFlrSF']
df[skewed_cols] = np.log1p(df[skewed_cols])

# 5. Log-transform target
y = np.log1p(df['SalePrice'])

Common pitfall: Encoding the target variable or applying transforms to test data using statistics computed on test data (these transforms must be fit on training data only).

Step 5: Model Selection

Start simple. A simple model that works is better than a complex model that barely improves on it.

Hierarchy for regression problems:

Linear Regression (baseline)
Ridge / Lasso (adds regularization)
Random Forest (handles non-linearity)
Gradient Boosting (XGBoost / LightGBM — often best)

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

feature_cols = ['TotalSF', 'OverallQual', 'GrLivArea', 'GarageCars',
                'HouseAge', 'TotalBsmtSF', 'FullBath', 'HasPool']

X = df[feature_cols].fillna(0)

models = {
    'Ridge': Ridge(alpha=10),
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
    'GradientBoosting': GradientBoostingRegressor(n_estimators=200, random_state=42),
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error')
    print(f"{name}: RMSE = {-scores.mean():.4f} (+/- {scores.std():.4f})")
# Ridge:            RMSE = 0.1423 (+/- 0.0089)
# RandomForest:     RMSE = 0.1381 (+/- 0.0101)
# GradientBoosting: RMSE = 0.1298 (+/- 0.0093)

Common pitfall: Evaluating models on the training set. Always use cross-validation or a held-out test set. A model that scores 100% on training data has memorized, not learned.

Step 6: Training and Evaluation

Once you've selected your best model, do a final proper evaluation on data the model has never seen.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = GradientBoostingRegressor(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

# Predict and inverse-transform from log scale
y_pred_log = model.predict(X_test)
y_pred = np.expm1(y_pred_log)
y_actual = np.expm1(y_test)

rmse = mean_squared_error(y_actual, y_pred, squared=False)
print(f"Test RMSE: ${rmse:,.0f}")
# Test RMSE: $21,847

# Check feature importances
importances = pd.Series(
    model.feature_importances_, index=feature_cols
).sort_values(ascending=False)
print(importances)
# TotalSF        0.382
# OverallQual    0.298
# GrLivArea      0.143
# ...

Common pitfall: Evaluating the model on the test set repeatedly and tuning based on those results. The test set is for final, one-time evaluation only. For tuning, use cross-validation on training data.

Step 7: Deployment

A model that stays in a Jupyter notebook has zero business value. Deployment means making your model available to users or systems.

Minimal deployment with Flask:

import pickle
from flask import Flask, request, jsonify
import numpy as np

app = Flask(__name__)

# Save model during training
with open('house_price_model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load model in API
with open('house_price_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = np.array([[
        data['TotalSF'],
        data['OverallQual'],
        data['GrLivArea'],
        data['GarageCars'],
        data['HouseAge'],
        data['TotalBsmtSF'],
        data['FullBath'],
        data['HasPool']
    ]])
    log_price = loaded_model.predict(features)[0]
    price = np.expm1(log_price)
    return jsonify({'predicted_price': round(float(price), 2)})

if __name__ == '__main__':
    app.run(debug=True)

Test it: curl -X POST http://localhost:5000/predict -H "Content-Type: application/json" -d '{"TotalSF": 2000, "OverallQual": 7, "GrLivArea": 1500, "GarageCars": 2, "HouseAge": 15, "TotalBsmtSF": 800, "FullBath": 2, "HasPool": 0}'

Common pitfall: Deploying a model and never updating it. Real-world data drifts over time. Monitor your model's predictions against actuals and retrain on a schedule.

Tools Summary by Step

Step	Primary Tools
Problem Definition	Notebooks, whiteboards, stakeholder discussions
Data Collection	pandas, SQL, requests, BeautifulSoup, Kaggle CLI
EDA	pandas, matplotlib, seaborn, pandas-profiling
Feature Engineering	pandas, scikit-learn (Pipeline, ColumnTransformer)
Model Selection	scikit-learn, XGBoost, LightGBM
Training & Evaluation	scikit-learn metrics, cross_val_score, MLflow
Deployment	Flask, FastAPI, Docker, AWS SageMaker, Render

Key Takeaway

The workflow is not linear — you will loop back. EDA reveals a feature engineering idea. Training results send you back to collect more data. That's normal. What matters is following the process deliberately rather than jumping straight to modeling.

Next lesson: We'll set up the complete Python ML environment so you have all these tools ready to run.