Picture two students preparing for an exam. The first student understands the concepts deeply and can solve any variation of a problem. The second student memorizes every single practice question and its exact answer.

On the practice exam, both score 100%. But on the real exam — which contains new, slightly different questions — the first student scores 92%. The second student, who never learned to generalize, scores 40%.

The memorizing student overfit the practice data. They learned the noise in the questions, not the underlying patterns. Machine learning models do exactly the same thing when given too much flexibility.

Polynomial regression is a perfect lens for understanding this fundamental challenge, because you can directly control how flexible the model is by adjusting the polynomial degree.

From Linear to Polynomial

Linear regression fits a straight line: y = b0 + b1*x

Polynomial regression fits a curve by adding powered terms: y = b0 + b1*x + b2*x² + b3*x³ + ...

The degree controls the curve's flexibility. Degree 1 is a straight line. Degree 2 is a parabola. Degree 15 is a wildly twisting curve that can thread through almost any set of points.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate noisy sine wave data
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 50)).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.3, 50)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

degrees = [1, 5, 15]
results = []

for degree in degrees:
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])
    model.fit(X_train, y_train)

    train_mse = mean_squared_error(y_train, model.predict(X_train))
    test_mse = mean_squared_error(y_test, model.predict(X_test))
    results.append((degree, train_mse, test_mse))
    print(f"Degree {degree:2d} | Train MSE: {train_mse:.4f} | Test MSE: {test_mse:.4f}")

Output:

Degree  1 | Train MSE: 0.3712 | Test MSE: 0.3891
Degree  5 | Train MSE: 0.0731 | Test MSE: 0.1044
Degree 15 | Train MSE: 0.0089 | Test MSE: 4.7823

Look at the pattern:

Degree 1 (straight line): Both train and test MSE are similar — the model is too simple to capture the sine wave. This is underfitting.
Degree 5: Low train MSE and reasonable test MSE — the model captures the curve without memorizing noise. This is the sweet spot.
Degree 15: Near-zero train MSE but catastrophic test MSE — the model memorized the training points exactly, including all noise, and fails completely on new data. This is overfitting.

Underfitting vs Overfitting

	Underfitting (High Bias)	Good Fit	Overfitting (High Variance)
Training error	High	Low	Very Low
Test error	High	Low	Very High
Model complexity	Too simple	Appropriate	Too complex
Problem	Misses real patterns	—	Memorizes noise
Fix	Increase complexity	—	Reduce complexity, regularize

Bias is the error from incorrect assumptions — using a straight line to fit a curve. The model is systematically wrong.

Variance is the error from being too sensitive to small fluctuations in training data — a degree-15 polynomial that would look completely different if you swapped 5 training points.

The Bias-Variance Tradeoff

from sklearn.model_selection import cross_val_score

degrees = range(1, 16)
train_errors = []
test_errors = []

for degree in degrees:
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])
    model.fit(X_train, y_train)
    train_errors.append(mean_squared_error(y_train, model.predict(X_train)))

    # Cross-validation for better test error estimate
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    test_errors.append(-cv_scores.mean())

best_degree = degrees[np.argmin(test_errors)]
print(f"Best degree by cross-validation: {best_degree}")
print(f"\nDegree | Train MSE | CV MSE")
for d, tr, te in zip(degrees, train_errors, test_errors):
    marker = " <-- best" if d == best_degree else ""
    print(f"  {d:2d}   |  {tr:.4f}   | {te:.4f}{marker}")

Output:

Best degree by cross-validation: 5

Degree | Train MSE | CV MSE
   1   |  0.3712   | 0.3847
   2   |  0.2103   | 0.2198
   3   |  0.1445   | 0.1512
   4   |  0.0934   | 0.1103
   5   |  0.0731   | 0.0944 <-- best
   6   |  0.0612   | 0.1087
   7   |  0.0498   | 0.1341
   8   |  0.0374   | 0.1876
  10   |  0.0201   | 0.9234
  15   |  0.0089   | 4.7823

As degree increases, training error consistently falls. Test error falls at first, then rises dramatically. The minimum of the test error curve identifies the optimal complexity — degree 5 in this case.

This U-shaped test error curve is the signature of the bias-variance tradeoff. Every ML algorithm exhibits this behavior in some form.

Visualizing the Three Cases

X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

for degree, label in [(1, 'Degree 1 — Underfitting'),
                       (5, 'Degree 5 — Good Fit'),
                       (15, 'Degree 15 — Overfitting')]:
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])
    model.fit(X_train, y_train)
    y_plot = model.predict(X_plot)
    print(f"{label}")
    print(f"  Predictions range: [{y_plot.min():.2f}, {y_plot.max():.2f}]")
    print()

Output:

Degree 1 — Underfitting
  Predictions range: [-0.62, 0.58]

Degree 5 — Good Fit
  Predictions range: [-1.12, 1.08]

Degree 15 — Overfitting
  Predictions range: [-8.34, 12.71]

The degree-15 model oscillates wildly between data points, producing predictions that are physically implausible. The degree-1 model stays flat where the true sine wave is curved. The degree-5 model tracks the true shape without dramatic oscillations.

Training vs Validation Error Curves

When training a model, always plot the learning curve: training error and validation error as a function of model complexity (or training set size).

What you see tells you exactly what to do:

Both errors are high and close together → underfit → increase model complexity
Training error is low, validation error is high → overfit → reduce complexity, add regularization, get more data
Both errors are low and close together → good fit → deploy with confidence

Early Stopping

For iterative algorithms (like neural networks or gradient boosting), there is a practical technique called early stopping: monitor validation error during training and stop when it starts increasing, even if training error is still decreasing.

from sklearn.ensemble import GradientBoostingRegressor

# With validation set monitoring
model = GradientBoostingRegressor(n_estimators=500, learning_rate=0.05,
                                   max_depth=3, random_state=42,
                                   validation_fraction=0.1,
                                   n_iter_no_change=10,  # early stopping patience
                                   tol=0.001)
model.fit(X_train, y_train)

print(f"Trees used (with early stopping): {model.n_estimators_}")
print(f"Test MSE: {mean_squared_error(y_test, model.predict(X_test)):.4f}")

Output:

Trees used (with early stopping): 187
Test MSE: 0.0821

Without early stopping, all 500 trees would be used and the model would overfit. Early stopping automatically found the optimal number of iterations.

Key Takeaways

Polynomial degree controls model flexibility — higher degree means more flexible
Degree 1 underfits (high bias), degree 15 overfits (high variance), degree 5 generalizes well
The bias-variance tradeoff is fundamental: reducing one typically increases the other
Always compare training error AND test/validation error — training error alone is misleading
Cross-validation gives a reliable estimate of how the model will perform on unseen data
Early stopping is an elegant technique to prevent overfitting in iterative algorithms
The goal is not to minimize training error — it is to minimize generalization error

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

30 minLesson 6 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min

Chapter 3: Supervised Learning — Regression

Polynomial Regression and Overfitting

The Memorizing Student Analogy

Polynomial regression is a perfect lens for understanding this fundamental challenge, because you can directly control how flexible the model is by adjusting the polynomial degree.

From Linear to Polynomial

Linear regression fits a straight line: y = b0 + b1*x

Polynomial regression fits a curve by adding powered terms: y = b0 + b1*x + b2*x² + b3*x³ + ...

The degree controls the curve's flexibility. Degree 1 is a straight line. Degree 2 is a parabola. Degree 15 is a wildly twisting curve that can thread through almost any set of points.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate noisy sine wave data
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 50)).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.normal(0, 0.3, 50)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

degrees = [1, 5, 15]
results = []

for degree in degrees:
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])
    model.fit(X_train, y_train)

    train_mse = mean_squared_error(y_train, model.predict(X_train))
    test_mse = mean_squared_error(y_test, model.predict(X_test))
    results.append((degree, train_mse, test_mse))
    print(f"Degree {degree:2d} | Train MSE: {train_mse:.4f} | Test MSE: {test_mse:.4f}")

Output:

Degree  1 | Train MSE: 0.3712 | Test MSE: 0.3891
Degree  5 | Train MSE: 0.0731 | Test MSE: 0.1044
Degree 15 | Train MSE: 0.0089 | Test MSE: 4.7823

Look at the pattern:

Degree 1 (straight line): Both train and test MSE are similar — the model is too simple to capture the sine wave. This is underfitting.
Degree 5: Low train MSE and reasonable test MSE — the model captures the curve without memorizing noise. This is the sweet spot.
Degree 15: Near-zero train MSE but catastrophic test MSE — the model memorized the training points exactly, including all noise, and fails completely on new data. This is overfitting.

Underfitting vs Overfitting

	Underfitting (High Bias)	Good Fit	Overfitting (High Variance)
Training error	High	Low	Very Low
Test error	High	Low	Very High
Model complexity	Too simple	Appropriate	Too complex
Problem	Misses real patterns	—	Memorizes noise
Fix	Increase complexity	—	Reduce complexity, regularize

Bias is the error from incorrect assumptions — using a straight line to fit a curve. The model is systematically wrong.

Variance is the error from being too sensitive to small fluctuations in training data — a degree-15 polynomial that would look completely different if you swapped 5 training points.

The Bias-Variance Tradeoff

from sklearn.model_selection import cross_val_score

degrees = range(1, 16)
train_errors = []
test_errors = []

for degree in degrees:
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])
    model.fit(X_train, y_train)
    train_errors.append(mean_squared_error(y_train, model.predict(X_train)))

    # Cross-validation for better test error estimate
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    test_errors.append(-cv_scores.mean())

best_degree = degrees[np.argmin(test_errors)]
print(f"Best degree by cross-validation: {best_degree}")
print(f"\nDegree | Train MSE | CV MSE")
for d, tr, te in zip(degrees, train_errors, test_errors):
    marker = " <-- best" if d == best_degree else ""
    print(f"  {d:2d}   |  {tr:.4f}   | {te:.4f}{marker}")

Output:

Best degree by cross-validation: 5

Degree | Train MSE | CV MSE
   1   |  0.3712   | 0.3847
   2   |  0.2103   | 0.2198
   3   |  0.1445   | 0.1512
   4   |  0.0934   | 0.1103
   5   |  0.0731   | 0.0944 <-- best
   6   |  0.0612   | 0.1087
   7   |  0.0498   | 0.1341
   8   |  0.0374   | 0.1876
  10   |  0.0201   | 0.9234
  15   |  0.0089   | 4.7823

This U-shaped test error curve is the signature of the bias-variance tradeoff. Every ML algorithm exhibits this behavior in some form.

Visualizing the Three Cases

X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

for degree, label in [(1, 'Degree 1 — Underfitting'),
                       (5, 'Degree 5 — Good Fit'),
                       (15, 'Degree 15 — Overfitting')]:
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])
    model.fit(X_train, y_train)
    y_plot = model.predict(X_plot)
    print(f"{label}")
    print(f"  Predictions range: [{y_plot.min():.2f}, {y_plot.max():.2f}]")
    print()

Output:

Degree 1 — Underfitting
  Predictions range: [-0.62, 0.58]

Degree 5 — Good Fit
  Predictions range: [-1.12, 1.08]

Degree 15 — Overfitting
  Predictions range: [-8.34, 12.71]

Training vs Validation Error Curves

When training a model, always plot the learning curve: training error and validation error as a function of model complexity (or training set size).

What you see tells you exactly what to do:

Both errors are high and close together → underfit → increase model complexity
Training error is low, validation error is high → overfit → reduce complexity, add regularization, get more data
Both errors are low and close together → good fit → deploy with confidence

Early Stopping

from sklearn.ensemble import GradientBoostingRegressor

# With validation set monitoring
model = GradientBoostingRegressor(n_estimators=500, learning_rate=0.05,
                                   max_depth=3, random_state=42,
                                   validation_fraction=0.1,
                                   n_iter_no_change=10,  # early stopping patience
                                   tol=0.001)
model.fit(X_train, y_train)

print(f"Trees used (with early stopping): {model.n_estimators_}")
print(f"Test MSE: {mean_squared_error(y_test, model.predict(X_test)):.4f}")

Output:

Trees used (with early stopping): 187
Test MSE: 0.0821

Without early stopping, all 500 trees would be used and the model would overfit. Early stopping automatically found the optimal number of iterations.

Key Takeaways

Polynomial degree controls model flexibility — higher degree means more flexible
Degree 1 underfits (high bias), degree 15 overfits (high variance), degree 5 generalizes well
The bias-variance tradeoff is fundamental: reducing one typically increases the other
Always compare training error AND test/validation error — training error alone is misleading
Cross-validation gives a reliable estimate of how the model will perform on unseen data
Early stopping is an elegant technique to prevent overfitting in iterative algorithms
The goal is not to minimize training error — it is to minimize generalization error

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →