AiTechWorlds
AiTechWorlds
Picture two students preparing for an exam. The first student understands the concepts deeply and can solve any variation of a problem. The second student memorizes every single practice question and its exact answer.
On the practice exam, both score 100%. But on the real exam — which contains new, slightly different questions — the first student scores 92%. The second student, who never learned to generalize, scores 40%.
The memorizing student overfit the practice data. They learned the noise in the questions, not the underlying patterns. Machine learning models do exactly the same thing when given too much flexibility.
Polynomial regression is a perfect lens for understanding this fundamental challenge, because you can directly control how flexible the model is by adjusting the polynomial degree.
Linear regression fits a straight line: y = b0 + b1*x
Polynomial regression fits a curve by adding powered terms: y = b0 + b1*x + b2*x² + b3*x³ + ...
The degree controls the curve's flexibility. Degree 1 is a straight line. Degree 2 is a parabola. Degree 15 is a wildly twisting curve that can thread through almost any set of points.
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Generate noisy sine wave data
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 50)).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.3, 50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
degrees = [1, 5, 15]
results = []
for degree in degrees:
model = Pipeline([
('poly', PolynomialFeatures(degree=degree, include_bias=False)),
('linear', LinearRegression())
])
model.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, model.predict(X_train))
test_mse = mean_squared_error(y_test, model.predict(X_test))
results.append((degree, train_mse, test_mse))
print(f"Degree {degree:2d} | Train MSE: {train_mse:.4f} | Test MSE: {test_mse:.4f}")
Output:
Degree 1 | Train MSE: 0.3712 | Test MSE: 0.3891
Degree 5 | Train MSE: 0.0731 | Test MSE: 0.1044
Degree 15 | Train MSE: 0.0089 | Test MSE: 4.7823
Look at the pattern:
| Underfitting (High Bias) | Good Fit | Overfitting (High Variance) | |
|---|---|---|---|
| Training error | High | Low | Very Low |
| Test error | High | Low | Very High |
| Model complexity | Too simple | Appropriate | Too complex |
| Problem | Misses real patterns | — | Memorizes noise |
| Fix | Increase complexity | — | Reduce complexity, regularize |
Bias is the error from incorrect assumptions — using a straight line to fit a curve. The model is systematically wrong.
Variance is the error from being too sensitive to small fluctuations in training data — a degree-15 polynomial that would look completely different if you swapped 5 training points.
from sklearn.model_selection import cross_val_score
degrees = range(1, 16)
train_errors = []
test_errors = []
for degree in degrees:
model = Pipeline([
('poly', PolynomialFeatures(degree=degree, include_bias=False)),
('linear', LinearRegression())
])
model.fit(X_train, y_train)
train_errors.append(mean_squared_error(y_train, model.predict(X_train)))
# Cross-validation for better test error estimate
cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
test_errors.append(-cv_scores.mean())
best_degree = degrees[np.argmin(test_errors)]
print(f"Best degree by cross-validation: {best_degree}")
print(f"\nDegree | Train MSE | CV MSE")
for d, tr, te in zip(degrees, train_errors, test_errors):
marker = " <-- best" if d == best_degree else ""
print(f" {d:2d} | {tr:.4f} | {te:.4f}{marker}")
Output:
Best degree by cross-validation: 5
Degree | Train MSE | CV MSE
1 | 0.3712 | 0.3847
2 | 0.2103 | 0.2198
3 | 0.1445 | 0.1512
4 | 0.0934 | 0.1103
5 | 0.0731 | 0.0944 <-- best
6 | 0.0612 | 0.1087
7 | 0.0498 | 0.1341
8 | 0.0374 | 0.1876
10 | 0.0201 | 0.9234
15 | 0.0089 | 4.7823
As degree increases, training error consistently falls. Test error falls at first, then rises dramatically. The minimum of the test error curve identifies the optimal complexity — degree 5 in this case.
This U-shaped test error curve is the signature of the bias-variance tradeoff. Every ML algorithm exhibits this behavior in some form.
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)
for degree, label in [(1, 'Degree 1 — Underfitting'),
(5, 'Degree 5 — Good Fit'),
(15, 'Degree 15 — Overfitting')]:
model = Pipeline([
('poly', PolynomialFeatures(degree=degree, include_bias=False)),
('linear', LinearRegression())
])
model.fit(X_train, y_train)
y_plot = model.predict(X_plot)
print(f"{label}")
print(f" Predictions range: [{y_plot.min():.2f}, {y_plot.max():.2f}]")
print()
Output:
Degree 1 — Underfitting
Predictions range: [-0.62, 0.58]
Degree 5 — Good Fit
Predictions range: [-1.12, 1.08]
Degree 15 — Overfitting
Predictions range: [-8.34, 12.71]
The degree-15 model oscillates wildly between data points, producing predictions that are physically implausible. The degree-1 model stays flat where the true sine wave is curved. The degree-5 model tracks the true shape without dramatic oscillations.
When training a model, always plot the learning curve: training error and validation error as a function of model complexity (or training set size).
What you see tells you exactly what to do:
For iterative algorithms (like neural networks or gradient boosting), there is a practical technique called early stopping: monitor validation error during training and stop when it starts increasing, even if training error is still decreasing.
from sklearn.ensemble import GradientBoostingRegressor
# With validation set monitoring
model = GradientBoostingRegressor(n_estimators=500, learning_rate=0.05,
max_depth=3, random_state=42,
validation_fraction=0.1,
n_iter_no_change=10, # early stopping patience
tol=0.001)
model.fit(X_train, y_train)
print(f"Trees used (with early stopping): {model.n_estimators_}")
print(f"Test MSE: {mean_squared_error(y_test, model.predict(X_test)):.4f}")
Output:
Trees used (with early stopping): 187
Test MSE: 0.0821
Without early stopping, all 500 trees would be used and the model would overfit. Early stopping automatically found the optimal number of iterations.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises