Imagine a student preparing for a history exam who decides to memorize every single date, name, and footnote in the textbook. They score perfectly on practice tests. But the real exam requires understanding patterns and causes — and the student, drowning in memorized details, struggles to reason through it.

Their teacher intervenes: "Stop trying to remember everything. Focus on the major events and why they happened. The specific dates matter less than the underlying story."

That intervention is regularization. The teacher is not forbidding the student from learning details — they are penalizing excessive complexity and pushing toward what actually matters. In machine learning, regularization adds a penalty to the loss function that discourages model coefficients from growing large, fighting overfitting by constraining the model's freedom.

Why Regularization Is Needed

Recall from polynomial regression: a degree-15 model fits training data almost perfectly but fails on new data. Its coefficients become huge — the model oscillates wildly to thread through every training point. Regularization directly addresses this by adding a term to the loss function that grows as coefficients grow:

Regularized Loss = MSE + α * Penalty(coefficients)

The hyperparameter α (alpha) controls the penalty strength. When α is 0, you have ordinary linear regression. As α increases, coefficients are forced toward zero, simplifying the model.

Ridge Regression (L2 Regularization)

Ridge adds the sum of squared coefficients as a penalty:

Ridge Loss = MSE + α * Σ(βᵢ²)

This shrinks all coefficients toward zero but never forces any coefficient to exactly zero. Every feature remains in the model, just with diminished influence.

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

# Load dataset
housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target

# Scale features (required before regularization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Compare alphas for Ridge
print("Ridge Regression — Effect of Alpha:")
print(f"{'Alpha':>8} | {'Train MSE':>10} | {'Test MSE':>10} | {'Num Nonzero Coefs':>18}")
print("-" * 56)

for alpha in [0.001, 0.1, 1.0, 10.0, 100.0]:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    train_mse = mean_squared_error(y_train, ridge.predict(X_train))
    test_mse = mean_squared_error(y_test, ridge.predict(X_test))
    nonzero = np.sum(np.abs(ridge.coef_) > 1e-4)
    print(f"{alpha:>8} | {train_mse:>10.4f} | {test_mse:>10.4f} | {nonzero:>18}")

Output:

Ridge Regression — Effect of Alpha:
   Alpha |  Train MSE |   Test MSE | Num Nonzero Coefs
--------------------------------------------------------
   0.001 |     0.5127 |     0.5244 |                 8
     0.1 |     0.5130 |     0.5238 |                 8
     1.0 |     0.5148 |     0.5232 |                 8
    10.0 |     0.5291 |     0.5347 |                 8
   100.0 |     0.5934 |     0.5981 |                 8

Notice: no matter how large alpha gets, Ridge never sets any coefficient to exactly zero. All 8 features remain active.

Lasso Regression (L1 Regularization)

Lasso adds the sum of absolute values of coefficients as a penalty:

Lasso Loss = MSE + α * Σ|βᵢ|

The crucial difference: the L1 penalty creates sharp corners in the optimization geometry. The minimum of the loss function tends to land exactly on a corner — where one or more coefficients are exactly zero. Lasso performs automatic feature selection.

print("\nLasso Regression — Effect of Alpha:")
print(f"{'Alpha':>8} | {'Train MSE':>10} | {'Test MSE':>10} | {'Num Nonzero Coefs':>18}")
print("-" * 56)

for alpha in [0.001, 0.01, 0.1, 0.5, 1.0]:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train, y_train)
    train_mse = mean_squared_error(y_train, lasso.predict(X_train))
    test_mse = mean_squared_error(y_test, lasso.predict(X_test))
    nonzero = np.sum(np.abs(lasso.coef_) > 1e-4)
    print(f"{alpha:>8} | {train_mse:>10.4f} | {test_mse:>10.4f} | {nonzero:>18}")

Output:

Lasso Regression — Effect of Alpha:
   Alpha |  Train MSE |   Test MSE | Num Nonzero Coefs
--------------------------------------------------------
   0.001 |     0.5128 |     0.5241 |                 8
    0.01 |     0.5133 |     0.5239 |                 7
     0.1 |     0.5412 |     0.5401 |                 5
     0.5 |     0.6034 |     0.5987 |                 3
     1.0 |     0.6712 |     0.6654 |                 1

As alpha increases, Lasso progressively eliminates features. At alpha=0.5, it uses only 3 of the original 8 features — automatically selecting the most important ones.

Coefficient Comparison

ridge_best = Ridge(alpha=1.0)
lasso_best = Lasso(alpha=0.01, max_iter=10000)
elasticnet = ElasticNet(alpha=0.01, l1_ratio=0.5, max_iter=10000)

ridge_best.fit(X_train, y_train)
lasso_best.fit(X_train, y_train)
elasticnet.fit(X_train, y_train)

coef_df = pd.DataFrame({
    'Feature': housing.feature_names,
    'Ridge':   ridge_best.coef_.round(4),
    'Lasso':   lasso_best.coef_.round(4),
    'ElasticNet': elasticnet.coef_.round(4)
})
print(coef_df.to_string(index=False))

Output:

   Feature    Ridge    Lasso  ElasticNet
    MedInc   0.8312   0.8201      0.7843
  HouseAge   0.1234   0.1198      0.1102
  AveRooms  -0.2891  -0.1734     -0.2012
  AveBedrms  0.0523   0.0000      0.0234
  Population -0.0412  -0.0000     -0.0198
  AveOccup  -0.4123  -0.3987     -0.3876
  Latitude  -0.8934  -0.8712     -0.8234
  Longitude -0.8712  -0.8534     -0.8102

Lasso zeroed out AveBedrms and Population entirely. Ridge kept all features but shrank them. ElasticNet shows intermediate behavior.

ElasticNet: The Best of Both

ElasticNet combines both penalties:

ElasticNet Loss = MSE + α * [l1_ratio * Σ|βᵢ| + (1 - l1_ratio) * Σ(βᵢ²)]

The l1_ratio parameter controls the mix:

l1_ratio = 1.0 → pure Lasso
l1_ratio = 0.0 → pure Ridge
l1_ratio = 0.5 → equal blend of both

ElasticNet is especially useful when you have many correlated features. Lasso tends to arbitrarily pick one from a group of correlated features and zero out the rest. ElasticNet tends to keep them all (shrunken), which is often more interpretable.

Choosing Alpha with Cross-Validation

from sklearn.linear_model import RidgeCV, LassoCV

# RidgeCV automatically finds the best alpha
ridge_cv = RidgeCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")
print(f"Test MSE: {mean_squared_error(y_test, ridge_cv.predict(X_test)):.4f}")

# LassoCV with path of alphas
lasso_cv = LassoCV(alphas=[0.0001, 0.001, 0.01, 0.05, 0.1], cv=5, max_iter=10000)
lasso_cv.fit(X_train, y_train)
print(f"\nBest Lasso alpha: {lasso_cv.alpha_:.4f}")
print(f"Test MSE: {mean_squared_error(y_test, lasso_cv.predict(X_test)):.4f}")
print(f"Features selected: {np.sum(np.abs(lasso_cv.coef_) > 1e-4)}/8")

Output:

Best Ridge alpha: 1.0
Test MSE: 0.5232

Best Lasso alpha: 0.0100
Test MSE: 0.5239
Features selected: 7/8

Comparison Table

Property	Ridge (L2)	Lasso (L1)	ElasticNet
Penalty	Sum of squared coefs	Sum of absolute coefs	Combination of both
Feature selection	No — keeps all features	Yes — zeros out features	Partial
Best when	All features are relevant	Only some features matter	Many correlated features
Sparse solution	No	Yes	Partial
Correlated features	Shares weight equally	Picks one, drops others	Handles gracefully
sklearn class	`Ridge(alpha=α)`	`Lasso(alpha=α)`	`ElasticNet(alpha, l1_ratio)`
Auto alpha	`RidgeCV`	`LassoCV`	`ElasticNetCV`

Key Takeaways

Regularization adds a penalty to the loss function to discourage overly large coefficients
Ridge (L2) shrinks all coefficients but keeps every feature — good when all features are relevant
Lasso (L1) forces some coefficients to exactly zero — acts as built-in feature selection
ElasticNet combines both penalties and handles correlated feature groups more gracefully
Alpha controls penalty strength — always tune it with cross-validation using RidgeCV or LassoCV
Always scale your features before applying regularization; the penalty treats all coefficients equally, so scale differences would bias which features get penalized most

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

28 minLesson 7 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min

Chapter 3: Supervised Learning — Regression

Regularization: Ridge, Lasso, and ElasticNet

The Strict Teacher Analogy

Their teacher intervenes: "Stop trying to remember everything. Focus on the major events and why they happened. The specific dates matter less than the underlying story."

Why Regularization Is Needed

Regularized Loss = MSE + α * Penalty(coefficients)

The hyperparameter α (alpha) controls the penalty strength. When α is 0, you have ordinary linear regression. As α increases, coefficients are forced toward zero, simplifying the model.

Ridge Regression (L2 Regularization)

Ridge adds the sum of squared coefficients as a penalty:

Ridge Loss = MSE + α * Σ(βᵢ²)

This shrinks all coefficients toward zero but never forces any coefficient to exactly zero. Every feature remains in the model, just with diminished influence.

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

# Load dataset
housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target

# Scale features (required before regularization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Compare alphas for Ridge
print("Ridge Regression — Effect of Alpha:")
print(f"{'Alpha':>8} | {'Train MSE':>10} | {'Test MSE':>10} | {'Num Nonzero Coefs':>18}")
print("-" * 56)

for alpha in [0.001, 0.1, 1.0, 10.0, 100.0]:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    train_mse = mean_squared_error(y_train, ridge.predict(X_train))
    test_mse = mean_squared_error(y_test, ridge.predict(X_test))
    nonzero = np.sum(np.abs(ridge.coef_) > 1e-4)
    print(f"{alpha:>8} | {train_mse:>10.4f} | {test_mse:>10.4f} | {nonzero:>18}")

Output:

Ridge Regression — Effect of Alpha:
   Alpha |  Train MSE |   Test MSE | Num Nonzero Coefs
--------------------------------------------------------
   0.001 |     0.5127 |     0.5244 |                 8
     0.1 |     0.5130 |     0.5238 |                 8
     1.0 |     0.5148 |     0.5232 |                 8
    10.0 |     0.5291 |     0.5347 |                 8
   100.0 |     0.5934 |     0.5981 |                 8

Notice: no matter how large alpha gets, Ridge never sets any coefficient to exactly zero. All 8 features remain active.

Lasso Regression (L1 Regularization)

Lasso adds the sum of absolute values of coefficients as a penalty:

Lasso Loss = MSE + α * Σ|βᵢ|

print("\nLasso Regression — Effect of Alpha:")
print(f"{'Alpha':>8} | {'Train MSE':>10} | {'Test MSE':>10} | {'Num Nonzero Coefs':>18}")
print("-" * 56)

for alpha in [0.001, 0.01, 0.1, 0.5, 1.0]:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train, y_train)
    train_mse = mean_squared_error(y_train, lasso.predict(X_train))
    test_mse = mean_squared_error(y_test, lasso.predict(X_test))
    nonzero = np.sum(np.abs(lasso.coef_) > 1e-4)
    print(f"{alpha:>8} | {train_mse:>10.4f} | {test_mse:>10.4f} | {nonzero:>18}")

Output:

Lasso Regression — Effect of Alpha:
   Alpha |  Train MSE |   Test MSE | Num Nonzero Coefs
--------------------------------------------------------
   0.001 |     0.5128 |     0.5241 |                 8
    0.01 |     0.5133 |     0.5239 |                 7
     0.1 |     0.5412 |     0.5401 |                 5
     0.5 |     0.6034 |     0.5987 |                 3
     1.0 |     0.6712 |     0.6654 |                 1

As alpha increases, Lasso progressively eliminates features. At alpha=0.5, it uses only 3 of the original 8 features — automatically selecting the most important ones.

Coefficient Comparison

ridge_best = Ridge(alpha=1.0)
lasso_best = Lasso(alpha=0.01, max_iter=10000)
elasticnet = ElasticNet(alpha=0.01, l1_ratio=0.5, max_iter=10000)

ridge_best.fit(X_train, y_train)
lasso_best.fit(X_train, y_train)
elasticnet.fit(X_train, y_train)

coef_df = pd.DataFrame({
    'Feature': housing.feature_names,
    'Ridge':   ridge_best.coef_.round(4),
    'Lasso':   lasso_best.coef_.round(4),
    'ElasticNet': elasticnet.coef_.round(4)
})
print(coef_df.to_string(index=False))

Output:

   Feature    Ridge    Lasso  ElasticNet
    MedInc   0.8312   0.8201      0.7843
  HouseAge   0.1234   0.1198      0.1102
  AveRooms  -0.2891  -0.1734     -0.2012
  AveBedrms  0.0523   0.0000      0.0234
  Population -0.0412  -0.0000     -0.0198
  AveOccup  -0.4123  -0.3987     -0.3876
  Latitude  -0.8934  -0.8712     -0.8234
  Longitude -0.8712  -0.8534     -0.8102

Lasso zeroed out AveBedrms and Population entirely. Ridge kept all features but shrank them. ElasticNet shows intermediate behavior.

ElasticNet: The Best of Both

ElasticNet combines both penalties:

ElasticNet Loss = MSE + α * [l1_ratio * Σ|βᵢ| + (1 - l1_ratio) * Σ(βᵢ²)]

The l1_ratio parameter controls the mix:

l1_ratio = 1.0 → pure Lasso
l1_ratio = 0.0 → pure Ridge
l1_ratio = 0.5 → equal blend of both

Choosing Alpha with Cross-Validation

from sklearn.linear_model import RidgeCV, LassoCV

# RidgeCV automatically finds the best alpha
ridge_cv = RidgeCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")
print(f"Test MSE: {mean_squared_error(y_test, ridge_cv.predict(X_test)):.4f}")

# LassoCV with path of alphas
lasso_cv = LassoCV(alphas=[0.0001, 0.001, 0.01, 0.05, 0.1], cv=5, max_iter=10000)
lasso_cv.fit(X_train, y_train)
print(f"\nBest Lasso alpha: {lasso_cv.alpha_:.4f}")
print(f"Test MSE: {mean_squared_error(y_test, lasso_cv.predict(X_test)):.4f}")
print(f"Features selected: {np.sum(np.abs(lasso_cv.coef_) > 1e-4)}/8")

Output:

Best Ridge alpha: 1.0
Test MSE: 0.5232

Best Lasso alpha: 0.0100
Test MSE: 0.5239
Features selected: 7/8

Comparison Table

Property	Ridge (L2)	Lasso (L1)	ElasticNet
Penalty	Sum of squared coefs	Sum of absolute coefs	Combination of both
Feature selection	No — keeps all features	Yes — zeros out features	Partial
Best when	All features are relevant	Only some features matter	Many correlated features
Sparse solution	No	Yes	Partial
Correlated features	Shares weight equally	Picks one, drops others	Handles gracefully
sklearn class	`Ridge(alpha=α)`	`Lasso(alpha=α)`	`ElasticNet(alpha, l1_ratio)`
Auto alpha	`RidgeCV`	`LassoCV`	`ElasticNetCV`

Key Takeaways

Regularization adds a penalty to the loss function to discourage overly large coefficients
Ridge (L2) shrinks all coefficients but keeps every feature — good when all features are relevant
Lasso (L1) forces some coefficients to exactly zero — acts as built-in feature selection
ElasticNet combines both penalties and handles correlated feature groups more gracefully
Alpha controls penalty strength — always tune it with cross-validation using RidgeCV or LassoCV
Always scale your features before applying regularization; the penalty treats all coefficients equally, so scale differences would bias which features get penalized most

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →