AiTechWorlds
AiTechWorlds
Imagine a student preparing for a history exam who decides to memorize every single date, name, and footnote in the textbook. They score perfectly on practice tests. But the real exam requires understanding patterns and causes — and the student, drowning in memorized details, struggles to reason through it.
Their teacher intervenes: "Stop trying to remember everything. Focus on the major events and why they happened. The specific dates matter less than the underlying story."
That intervention is regularization. The teacher is not forbidding the student from learning details — they are penalizing excessive complexity and pushing toward what actually matters. In machine learning, regularization adds a penalty to the loss function that discourages model coefficients from growing large, fighting overfitting by constraining the model's freedom.
Recall from polynomial regression: a degree-15 model fits training data almost perfectly but fails on new data. Its coefficients become huge — the model oscillates wildly to thread through every training point. Regularization directly addresses this by adding a term to the loss function that grows as coefficients grow:
Regularized Loss = MSE + α * Penalty(coefficients)
The hyperparameter α (alpha) controls the penalty strength. When α is 0, you have ordinary linear regression. As α increases, coefficients are forced toward zero, simplifying the model.
Ridge adds the sum of squared coefficients as a penalty:
Ridge Loss = MSE + α * Σ(βᵢ²)
This shrinks all coefficients toward zero but never forces any coefficient to exactly zero. Every feature remains in the model, just with diminished influence.
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing
# Load dataset
housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target
# Scale features (required before regularization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Compare alphas for Ridge
print("Ridge Regression — Effect of Alpha:")
print(f"{'Alpha':>8} | {'Train MSE':>10} | {'Test MSE':>10} | {'Num Nonzero Coefs':>18}")
print("-" * 56)
for alpha in [0.001, 0.1, 1.0, 10.0, 100.0]:
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, ridge.predict(X_train))
test_mse = mean_squared_error(y_test, ridge.predict(X_test))
nonzero = np.sum(np.abs(ridge.coef_) > 1e-4)
print(f"{alpha:>8} | {train_mse:>10.4f} | {test_mse:>10.4f} | {nonzero:>18}")
Output:
Ridge Regression — Effect of Alpha:
Alpha | Train MSE | Test MSE | Num Nonzero Coefs
--------------------------------------------------------
0.001 | 0.5127 | 0.5244 | 8
0.1 | 0.5130 | 0.5238 | 8
1.0 | 0.5148 | 0.5232 | 8
10.0 | 0.5291 | 0.5347 | 8
100.0 | 0.5934 | 0.5981 | 8
Notice: no matter how large alpha gets, Ridge never sets any coefficient to exactly zero. All 8 features remain active.
Lasso adds the sum of absolute values of coefficients as a penalty:
Lasso Loss = MSE + α * Σ|βᵢ|
The crucial difference: the L1 penalty creates sharp corners in the optimization geometry. The minimum of the loss function tends to land exactly on a corner — where one or more coefficients are exactly zero. Lasso performs automatic feature selection.
print("\nLasso Regression — Effect of Alpha:")
print(f"{'Alpha':>8} | {'Train MSE':>10} | {'Test MSE':>10} | {'Num Nonzero Coefs':>18}")
print("-" * 56)
for alpha in [0.001, 0.01, 0.1, 0.5, 1.0]:
lasso = Lasso(alpha=alpha, max_iter=10000)
lasso.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, lasso.predict(X_train))
test_mse = mean_squared_error(y_test, lasso.predict(X_test))
nonzero = np.sum(np.abs(lasso.coef_) > 1e-4)
print(f"{alpha:>8} | {train_mse:>10.4f} | {test_mse:>10.4f} | {nonzero:>18}")
Output:
Lasso Regression — Effect of Alpha:
Alpha | Train MSE | Test MSE | Num Nonzero Coefs
--------------------------------------------------------
0.001 | 0.5128 | 0.5241 | 8
0.01 | 0.5133 | 0.5239 | 7
0.1 | 0.5412 | 0.5401 | 5
0.5 | 0.6034 | 0.5987 | 3
1.0 | 0.6712 | 0.6654 | 1
As alpha increases, Lasso progressively eliminates features. At alpha=0.5, it uses only 3 of the original 8 features — automatically selecting the most important ones.
ridge_best = Ridge(alpha=1.0)
lasso_best = Lasso(alpha=0.01, max_iter=10000)
elasticnet = ElasticNet(alpha=0.01, l1_ratio=0.5, max_iter=10000)
ridge_best.fit(X_train, y_train)
lasso_best.fit(X_train, y_train)
elasticnet.fit(X_train, y_train)
coef_df = pd.DataFrame({
'Feature': housing.feature_names,
'Ridge': ridge_best.coef_.round(4),
'Lasso': lasso_best.coef_.round(4),
'ElasticNet': elasticnet.coef_.round(4)
})
print(coef_df.to_string(index=False))
Output:
Feature Ridge Lasso ElasticNet
MedInc 0.8312 0.8201 0.7843
HouseAge 0.1234 0.1198 0.1102
AveRooms -0.2891 -0.1734 -0.2012
AveBedrms 0.0523 0.0000 0.0234
Population -0.0412 -0.0000 -0.0198
AveOccup -0.4123 -0.3987 -0.3876
Latitude -0.8934 -0.8712 -0.8234
Longitude -0.8712 -0.8534 -0.8102
Lasso zeroed out AveBedrms and Population entirely. Ridge kept all features but shrank them. ElasticNet shows intermediate behavior.
ElasticNet combines both penalties:
ElasticNet Loss = MSE + α * [l1_ratio * Σ|βᵢ| + (1 - l1_ratio) * Σ(βᵢ²)]
The l1_ratio parameter controls the mix:
l1_ratio = 1.0 → pure Lassol1_ratio = 0.0 → pure Ridgel1_ratio = 0.5 → equal blend of bothElasticNet is especially useful when you have many correlated features. Lasso tends to arbitrarily pick one from a group of correlated features and zero out the rest. ElasticNet tends to keep them all (shrunken), which is often more interpretable.
from sklearn.linear_model import RidgeCV, LassoCV
# RidgeCV automatically finds the best alpha
ridge_cv = RidgeCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")
print(f"Test MSE: {mean_squared_error(y_test, ridge_cv.predict(X_test)):.4f}")
# LassoCV with path of alphas
lasso_cv = LassoCV(alphas=[0.0001, 0.001, 0.01, 0.05, 0.1], cv=5, max_iter=10000)
lasso_cv.fit(X_train, y_train)
print(f"\nBest Lasso alpha: {lasso_cv.alpha_:.4f}")
print(f"Test MSE: {mean_squared_error(y_test, lasso_cv.predict(X_test)):.4f}")
print(f"Features selected: {np.sum(np.abs(lasso_cv.coef_) > 1e-4)}/8")
Output:
Best Ridge alpha: 1.0
Test MSE: 0.5232
Best Lasso alpha: 0.0100
Test MSE: 0.5239
Features selected: 7/8
| Property | Ridge (L2) | Lasso (L1) | ElasticNet |
|---|---|---|---|
| Penalty | Sum of squared coefs | Sum of absolute coefs | Combination of both |
| Feature selection | No — keeps all features | Yes — zeros out features | Partial |
| Best when | All features are relevant | Only some features matter | Many correlated features |
| Sparse solution | No | Yes | Partial |
| Correlated features | Shares weight equally | Picks one, drops others | Handles gracefully |
| sklearn class | Ridge(alpha=α) | Lasso(alpha=α) | ElasticNet(alpha, l1_ratio) |
| Auto alpha | RidgeCV | LassoCV | ElasticNetCV |
RidgeCV or LassoCVGet this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises