A doctor reviews a patient's test results and says: "Based on the cell size, texture, and growth rate, there is a 73% probability this tumor is malignant." The doctor is not saying "definitely cancer" or "definitely not cancer." They are expressing a calibrated probability, accounting for uncertainty.

This is the essence of logistic regression. Despite its name, it is a classification algorithm — not a regression one. It predicts the probability that an input belongs to a particular class, then makes a binary decision based on a threshold. The output is always between 0 and 1, which is what makes it interpretable as a probability.

Logistic regression is one of the most widely used algorithms in production systems. In medicine, finance, marketing, and spam filtering, practitioners often prefer it over black-box models because the output is a probability, the coefficients are interpretable, and it trains in milliseconds.

From Linear to Logistic: The Sigmoid Function

Linear regression predicts any value on the number line: -1000 to +1000. That is useless if you need an output between 0 and 1. The fix is to pass the linear output through the sigmoid function:

σ(z) = 1 / (1 + e^(-z))

Where z = b0 + b1*x1 + b2*x2 + ... is the same linear combination as before.

The sigmoid function has a characteristic S-shape. It maps any real number to the range (0, 1). Feed it large positive numbers and it approaches 1. Feed it large negative numbers and it approaches 0. The midpoint, at z=0, outputs exactly 0.5.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 300)
probs = sigmoid(z)

# Show key points
print("Sigmoid function key values:")
print(f"  z = -10  →  σ(z) = {sigmoid(-10):.6f}")
print(f"  z =  -2  →  σ(z) = {sigmoid(-2):.4f}")
print(f"  z =   0  →  σ(z) = {sigmoid(0):.4f}")
print(f"  z =   2  →  σ(z) = {sigmoid(2):.4f}")
print(f"  z =  10  →  σ(z) = {sigmoid(10):.6f}")

Output:

Sigmoid function key values:
  z = -10  →  σ(z) = 0.000045
  z =  -2  →  σ(z) = 0.1192
  z =   0  →  σ(z) = 0.5000
  z =   2  →  σ(z) = 0.8808
  z =  10  →  σ(z) = 0.999955

Binary Classification: Pass/Fail Prediction

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (classification_report, confusion_matrix,
                              accuracy_score, roc_auc_score)
from sklearn.preprocessing import StandardScaler

# Dataset: predict if student passes exam
np.random.seed(42)
n = 200
hours_studied = np.random.normal(5, 2, n).clip(0, 12)
sleep_hours = np.random.normal(7, 1.5, n).clip(3, 10)
prev_gpa = np.random.normal(3.0, 0.5, n).clip(1.0, 4.0)

# Probability of passing increases with study time, sleep, and GPA
log_odds = -8 + 0.8*hours_studied + 0.4*sleep_hours + 1.5*prev_gpa
passed = (sigmoid(log_odds) > np.random.uniform(0, 1, n)).astype(int)

df = pd.DataFrame({
    'hours_studied': hours_studied,
    'sleep_hours':   sleep_hours,
    'prev_gpa':      prev_gpa,
    'passed':        passed
})

print(f"Dataset shape: {df.shape}")
print(f"Pass rate: {df['passed'].mean():.1%}")
print(df.head())

Output:

Dataset shape: (200, 4)
Pass rate: 62.0%
   hours_studied  sleep_hours  prev_gpa  passed
0           5.88         7.15      3.12       1
1           6.24         8.42      3.45       1
2           3.71         5.89      2.34       0
3           7.12         7.23      3.78       1
4           2.03         6.01      2.87       0

X = df[['hours_studied', 'sleep_hours', 'prev_gpa']]
y = df['passed']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

# Train logistic regression
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(X_train_sc, y_train)

# Predictions
y_pred = lr.predict(X_test_sc)
y_proba = lr.predict_proba(X_test_sc)[:, 1]

print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba):.4f}")

Output:

Accuracy:  0.8200
ROC-AUC:   0.8831

Probability Output and Decision Boundary

# Show probability outputs for test samples
results = pd.DataFrame({
    'hours_studied': X_test['hours_studied'].values[:8],
    'prev_gpa':      X_test['prev_gpa'].values[:8],
    'prob_pass':     y_proba[:8].round(3),
    'predicted':     y_pred[:8],
    'actual':        y_test.values[:8]
})
print(results.to_string(index=False))

print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred, target_names=['Fail', 'Pass']))

Output:

 hours_studied  prev_gpa  prob_pass  predicted  actual
          7.12      3.45      0.923          1       1
          3.21      2.10      0.134          0       0
          5.89      3.78      0.872          1       1
          4.01      2.67      0.421          0       0
          8.34      3.91      0.971          1       1
          2.34      1.89      0.072          0       0
          6.12      3.34      0.811          1       1
          4.78      3.12      0.612          1       0

--- Classification Report ---
              precision    recall  f1-score   support

        Fail       0.79      0.79      0.79        19
        Pass       0.84      0.84      0.84        31

    accuracy                           0.82        50
   macro avg       0.81      0.81      0.81        50
weighted avg       0.82      0.82      0.82        50

The decision boundary is at probability = 0.5 by default. A student with probability 0.612 is predicted to pass; one with 0.421 fails. You can shift this threshold to prioritize recall over precision depending on the use case.

Coefficients and Interpretability

coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lr.coef_[0].round(4),
    'Odds_Ratio': np.exp(lr.coef_[0]).round(4)
})
print(coef_df.to_string(index=False))

Output:

       Feature  Coefficient  Odds_Ratio
 hours_studied       0.9812      2.6674
   sleep_hours       0.4123      1.5103
      prev_gpa       1.3245      3.7604

The odds ratio for prev_gpa is 3.76 — meaning a one-standard-deviation increase in GPA multiplies the odds of passing by 3.76. This interpretability is why logistic regression is trusted in regulated industries.

Multi-Class Logistic Regression

For problems with more than two classes, logistic regression extends via:

One-vs-Rest (OvR): train K binary classifiers, one per class. At prediction time, take the class with the highest probability.
Softmax (Multinomial): a single model that outputs K probabilities summing to 1.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

iris = load_iris(as_frame=True)
X_iris = iris.data
y_iris = iris.target

lr_multi = LogisticRegression(multi_class='multinomial', solver='lbfgs',
                               max_iter=1000, random_state=42)
lr_multi.fit(X_iris, y_iris)

sample = X_iris.iloc[[0, 50, 100]]
probs = lr_multi.predict_proba(sample)
print("Class probabilities (setosa / versicolor / virginica):")
for i, row in enumerate(probs):
    print(f"  Sample {i}: {row.round(4)}")

print(f"\nOverall accuracy: {lr_multi.score(X_iris, y_iris):.4f}")

Output:

Class probabilities (setosa / versicolor / virginica):
  Sample 0: [0.9794 0.0205 0.0001]
  Sample 1: [0.0026 0.7843 0.2131]
  Sample 2: [0.0000 0.0193 0.9807]

Overall accuracy: 0.9733

When to Use Logistic Regression

Logistic regression is the right choice when:

The output is a binary category or multi-class label
You need probability estimates, not just hard predictions
The relationship between features and log-odds is roughly linear
Interpretability is important (auditable model, regulated domain)
You need a fast, reliable baseline before trying complex models

Avoid it when decision boundaries are highly non-linear, when features have complex interactions that are hard to engineer manually, or when you have image/text data without feature extraction.

Key Takeaways

Logistic regression passes a linear combination through the sigmoid function to produce a probability in (0, 1)
The decision threshold is 0.5 by default but can be adjusted based on business requirements
Coefficients represent the change in log-odds per unit change in a feature; exponentiated, they become odds ratios
Cross-entropy loss is minimized during training (not MSE, which is used for regression)
Multi-class problems use either One-vs-Rest or Softmax (multinomial) extension
ROC-AUC is the preferred metric when classes are imbalanced; accuracy alone can be misleading

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

32 minLesson 8 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class

Logistic Regression

The Doctor Analogy

From Linear to Logistic: The Sigmoid Function

Linear regression predicts any value on the number line: -1000 to +1000. That is useless if you need an output between 0 and 1. The fix is to pass the linear output through the sigmoid function:

σ(z) = 1 / (1 + e^(-z))

Where z = b0 + b1*x1 + b2*x2 + ... is the same linear combination as before.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 300)
probs = sigmoid(z)

# Show key points
print("Sigmoid function key values:")
print(f"  z = -10  →  σ(z) = {sigmoid(-10):.6f}")
print(f"  z =  -2  →  σ(z) = {sigmoid(-2):.4f}")
print(f"  z =   0  →  σ(z) = {sigmoid(0):.4f}")
print(f"  z =   2  →  σ(z) = {sigmoid(2):.4f}")
print(f"  z =  10  →  σ(z) = {sigmoid(10):.6f}")

Output:

Sigmoid function key values:
  z = -10  →  σ(z) = 0.000045
  z =  -2  →  σ(z) = 0.1192
  z =   0  →  σ(z) = 0.5000
  z =   2  →  σ(z) = 0.8808
  z =  10  →  σ(z) = 0.999955

Binary Classification: Pass/Fail Prediction

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (classification_report, confusion_matrix,
                              accuracy_score, roc_auc_score)
from sklearn.preprocessing import StandardScaler

# Dataset: predict if student passes exam
np.random.seed(42)
n = 200
hours_studied = np.random.normal(5, 2, n).clip(0, 12)
sleep_hours = np.random.normal(7, 1.5, n).clip(3, 10)
prev_gpa = np.random.normal(3.0, 0.5, n).clip(1.0, 4.0)

# Probability of passing increases with study time, sleep, and GPA
log_odds = -8 + 0.8*hours_studied + 0.4*sleep_hours + 1.5*prev_gpa
passed = (sigmoid(log_odds) > np.random.uniform(0, 1, n)).astype(int)

df = pd.DataFrame({
    'hours_studied': hours_studied,
    'sleep_hours':   sleep_hours,
    'prev_gpa':      prev_gpa,
    'passed':        passed
})

print(f"Dataset shape: {df.shape}")
print(f"Pass rate: {df['passed'].mean():.1%}")
print(df.head())

Output:

Dataset shape: (200, 4)
Pass rate: 62.0%
   hours_studied  sleep_hours  prev_gpa  passed
0           5.88         7.15      3.12       1
1           6.24         8.42      3.45       1
2           3.71         5.89      2.34       0
3           7.12         7.23      3.78       1
4           2.03         6.01      2.87       0

X = df[['hours_studied', 'sleep_hours', 'prev_gpa']]
y = df['passed']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

# Train logistic regression
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(X_train_sc, y_train)

# Predictions
y_pred = lr.predict(X_test_sc)
y_proba = lr.predict_proba(X_test_sc)[:, 1]

print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba):.4f}")

Output:

Accuracy:  0.8200
ROC-AUC:   0.8831

Probability Output and Decision Boundary

# Show probability outputs for test samples
results = pd.DataFrame({
    'hours_studied': X_test['hours_studied'].values[:8],
    'prev_gpa':      X_test['prev_gpa'].values[:8],
    'prob_pass':     y_proba[:8].round(3),
    'predicted':     y_pred[:8],
    'actual':        y_test.values[:8]
})
print(results.to_string(index=False))

print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred, target_names=['Fail', 'Pass']))

Output:

 hours_studied  prev_gpa  prob_pass  predicted  actual
          7.12      3.45      0.923          1       1
          3.21      2.10      0.134          0       0
          5.89      3.78      0.872          1       1
          4.01      2.67      0.421          0       0
          8.34      3.91      0.971          1       1
          2.34      1.89      0.072          0       0
          6.12      3.34      0.811          1       1
          4.78      3.12      0.612          1       0

--- Classification Report ---
              precision    recall  f1-score   support

        Fail       0.79      0.79      0.79        19
        Pass       0.84      0.84      0.84        31

    accuracy                           0.82        50
   macro avg       0.81      0.81      0.81        50
weighted avg       0.82      0.82      0.82        50

Coefficients and Interpretability

coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lr.coef_[0].round(4),
    'Odds_Ratio': np.exp(lr.coef_[0]).round(4)
})
print(coef_df.to_string(index=False))

Output:

       Feature  Coefficient  Odds_Ratio
 hours_studied       0.9812      2.6674
   sleep_hours       0.4123      1.5103
      prev_gpa       1.3245      3.7604

Multi-Class Logistic Regression

For problems with more than two classes, logistic regression extends via:

One-vs-Rest (OvR): train K binary classifiers, one per class. At prediction time, take the class with the highest probability.
Softmax (Multinomial): a single model that outputs K probabilities summing to 1.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

iris = load_iris(as_frame=True)
X_iris = iris.data
y_iris = iris.target

lr_multi = LogisticRegression(multi_class='multinomial', solver='lbfgs',
                               max_iter=1000, random_state=42)
lr_multi.fit(X_iris, y_iris)

sample = X_iris.iloc[[0, 50, 100]]
probs = lr_multi.predict_proba(sample)
print("Class probabilities (setosa / versicolor / virginica):")
for i, row in enumerate(probs):
    print(f"  Sample {i}: {row.round(4)}")

print(f"\nOverall accuracy: {lr_multi.score(X_iris, y_iris):.4f}")

Output:

Class probabilities (setosa / versicolor / virginica):
  Sample 0: [0.9794 0.0205 0.0001]
  Sample 1: [0.0026 0.7843 0.2131]
  Sample 2: [0.0000 0.0193 0.9807]

Overall accuracy: 0.9733

When to Use Logistic Regression

Logistic regression is the right choice when:

The output is a binary category or multi-class label
You need probability estimates, not just hard predictions
The relationship between features and log-odds is roughly linear
Interpretability is important (auditable model, regulated domain)
You need a fast, reliable baseline before trying complex models

Avoid it when decision boundaries are highly non-linear, when features have complex interactions that are hard to engineer manually, or when you have image/text data without feature extraction.

Key Takeaways

Logistic regression passes a linear combination through the sigmoid function to produce a probability in (0, 1)
The decision threshold is 0.5 by default but can be adjusted based on business requirements
Coefficients represent the change in log-odds per unit change in a feature; exponentiated, they become odds ratios
Cross-entropy loss is minimized during training (not MSE, which is used for regression)
Multi-class problems use either One-vs-Rest or Softmax (multinomial) extension
ROC-AUC is the preferred metric when classes are imbalanced; accuracy alone can be misleading

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →