AiTechWorlds
AiTechWorlds
A doctor reviews a patient's test results and says: "Based on the cell size, texture, and growth rate, there is a 73% probability this tumor is malignant." The doctor is not saying "definitely cancer" or "definitely not cancer." They are expressing a calibrated probability, accounting for uncertainty.
This is the essence of logistic regression. Despite its name, it is a classification algorithm — not a regression one. It predicts the probability that an input belongs to a particular class, then makes a binary decision based on a threshold. The output is always between 0 and 1, which is what makes it interpretable as a probability.
Logistic regression is one of the most widely used algorithms in production systems. In medicine, finance, marketing, and spam filtering, practitioners often prefer it over black-box models because the output is a probability, the coefficients are interpretable, and it trains in milliseconds.
Linear regression predicts any value on the number line: -1000 to +1000. That is useless if you need an output between 0 and 1. The fix is to pass the linear output through the sigmoid function:
σ(z) = 1 / (1 + e^(-z))
Where z = b0 + b1*x1 + b2*x2 + ... is the same linear combination as before.
The sigmoid function has a characteristic S-shape. It maps any real number to the range (0, 1). Feed it large positive numbers and it approaches 1. Feed it large negative numbers and it approaches 0. The midpoint, at z=0, outputs exactly 0.5.
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
def sigmoid(z):
return 1 / (1 + np.exp(-z))
z = np.linspace(-10, 10, 300)
probs = sigmoid(z)
# Show key points
print("Sigmoid function key values:")
print(f" z = -10 → σ(z) = {sigmoid(-10):.6f}")
print(f" z = -2 → σ(z) = {sigmoid(-2):.4f}")
print(f" z = 0 → σ(z) = {sigmoid(0):.4f}")
print(f" z = 2 → σ(z) = {sigmoid(2):.4f}")
print(f" z = 10 → σ(z) = {sigmoid(10):.6f}")
Output:
Sigmoid function key values:
z = -10 → σ(z) = 0.000045
z = -2 → σ(z) = 0.1192
z = 0 → σ(z) = 0.5000
z = 2 → σ(z) = 0.8808
z = 10 → σ(z) = 0.999955
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (classification_report, confusion_matrix,
accuracy_score, roc_auc_score)
from sklearn.preprocessing import StandardScaler
# Dataset: predict if student passes exam
np.random.seed(42)
n = 200
hours_studied = np.random.normal(5, 2, n).clip(0, 12)
sleep_hours = np.random.normal(7, 1.5, n).clip(3, 10)
prev_gpa = np.random.normal(3.0, 0.5, n).clip(1.0, 4.0)
# Probability of passing increases with study time, sleep, and GPA
log_odds = -8 + 0.8*hours_studied + 0.4*sleep_hours + 1.5*prev_gpa
passed = (sigmoid(log_odds) > np.random.uniform(0, 1, n)).astype(int)
df = pd.DataFrame({
'hours_studied': hours_studied,
'sleep_hours': sleep_hours,
'prev_gpa': prev_gpa,
'passed': passed
})
print(f"Dataset shape: {df.shape}")
print(f"Pass rate: {df['passed'].mean():.1%}")
print(df.head())
Output:
Dataset shape: (200, 4)
Pass rate: 62.0%
hours_studied sleep_hours prev_gpa passed
0 5.88 7.15 3.12 1
1 6.24 8.42 3.45 1
2 3.71 5.89 2.34 0
3 7.12 7.23 3.78 1
4 2.03 6.01 2.87 0
X = df[['hours_studied', 'sleep_hours', 'prev_gpa']]
y = df['passed']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# Train logistic regression
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(X_train_sc, y_train)
# Predictions
y_pred = lr.predict(X_test_sc)
y_proba = lr.predict_proba(X_test_sc)[:, 1]
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
Output:
Accuracy: 0.8200
ROC-AUC: 0.8831
# Show probability outputs for test samples
results = pd.DataFrame({
'hours_studied': X_test['hours_studied'].values[:8],
'prev_gpa': X_test['prev_gpa'].values[:8],
'prob_pass': y_proba[:8].round(3),
'predicted': y_pred[:8],
'actual': y_test.values[:8]
})
print(results.to_string(index=False))
print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred, target_names=['Fail', 'Pass']))
Output:
hours_studied prev_gpa prob_pass predicted actual
7.12 3.45 0.923 1 1
3.21 2.10 0.134 0 0
5.89 3.78 0.872 1 1
4.01 2.67 0.421 0 0
8.34 3.91 0.971 1 1
2.34 1.89 0.072 0 0
6.12 3.34 0.811 1 1
4.78 3.12 0.612 1 0
--- Classification Report ---
precision recall f1-score support
Fail 0.79 0.79 0.79 19
Pass 0.84 0.84 0.84 31
accuracy 0.82 50
macro avg 0.81 0.81 0.81 50
weighted avg 0.82 0.82 0.82 50
The decision boundary is at probability = 0.5 by default. A student with probability 0.612 is predicted to pass; one with 0.421 fails. You can shift this threshold to prioritize recall over precision depending on the use case.
coef_df = pd.DataFrame({
'Feature': X.columns,
'Coefficient': lr.coef_[0].round(4),
'Odds_Ratio': np.exp(lr.coef_[0]).round(4)
})
print(coef_df.to_string(index=False))
Output:
Feature Coefficient Odds_Ratio
hours_studied 0.9812 2.6674
sleep_hours 0.4123 1.5103
prev_gpa 1.3245 3.7604
The odds ratio for prev_gpa is 3.76 — meaning a one-standard-deviation increase in GPA multiplies the odds of passing by 3.76. This interpretability is why logistic regression is trusted in regulated industries.
For problems with more than two classes, logistic regression extends via:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
iris = load_iris(as_frame=True)
X_iris = iris.data
y_iris = iris.target
lr_multi = LogisticRegression(multi_class='multinomial', solver='lbfgs',
max_iter=1000, random_state=42)
lr_multi.fit(X_iris, y_iris)
sample = X_iris.iloc[[0, 50, 100]]
probs = lr_multi.predict_proba(sample)
print("Class probabilities (setosa / versicolor / virginica):")
for i, row in enumerate(probs):
print(f" Sample {i}: {row.round(4)}")
print(f"\nOverall accuracy: {lr_multi.score(X_iris, y_iris):.4f}")
Output:
Class probabilities (setosa / versicolor / virginica):
Sample 0: [0.9794 0.0205 0.0001]
Sample 1: [0.0026 0.7843 0.2131]
Sample 2: [0.0000 0.0193 0.9807]
Overall accuracy: 0.9733
Logistic regression is the right choice when:
Avoid it when decision boundaries are highly non-linear, when features have complex interactions that are hard to engineer manually, or when you have image/text data without feature extraction.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises