Logistic Regression & Classification
Logistic Regression & Classification
Logistic regression is the most important classification algorithm you'll learn — not because it's the most powerful, but because it builds the conceptual foundation for neural networks, understanding probability outputs, and interpreting ML models.
Despite the name, logistic regression is a classification algorithm, not regression.
What Logistic Regression Does
Linear regression predicts a continuous number. Logistic regression predicts a probability between 0 and 1, then classifies based on a threshold (default: 0.5).
The core idea: take the linear equation y = wx + b and pass it through the sigmoid function to squash the output to [0, 1]:
sigmoid(x) = 1 / (1 + e^(-x))
Output:
- Close to 0 → class 0 (negative)
- Close to 1 → class 1 (positive)
- 0.5 → decision boundary
import numpy as np
import matplotlib.pyplot as plt
# The sigmoid function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
x = np.linspace(-10, 10, 100)
plt.plot(x, sigmoid(x))
plt.axhline(y=0.5, color='r', linestyle='--', label='Decision boundary')
plt.xlabel('Linear combination of features')
plt.ylabel('Probability')
plt.title('Sigmoid Function')
plt.legend()
plt.show()
Implementation with Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
import numpy as np
# Example: Email spam classification
# Features: [word_count, exclamation_marks, contains_unsubscribe, uppercase_ratio]
np.random.seed(42)
n_samples = 1000
X = np.column_stack([
np.random.randint(50, 500, n_samples), # word count
np.random.randint(0, 20, n_samples), # exclamation marks
np.random.randint(0, 2, n_samples), # unsubscribe link
np.random.uniform(0, 0.5, n_samples), # uppercase ratio
])
# Create labels — spam more likely with many exclamation marks + unsubscribe
y = ((X[:, 1] > 10) | (X[:, 2] == 1) & (X[:, 3] > 0.3)).astype(int)
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1] # Probability of spam
print(classification_report(y_test, y_pred, target_names=['Not Spam', 'Spam']))
Understanding the Classification Report
precision recall f1-score support
Not Spam 0.92 0.95 0.93 155
Spam 0.88 0.82 0.85 45
accuracy 0.91 200
Precision: Of all emails we called spam, 88% actually were spam. Recall: Of all actual spam emails, we caught 82% of them. F1-score: Harmonic mean of precision and recall (useful when classes are imbalanced).
The Confusion Matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Not Spam', 'Spam'],
yticklabels=['Not Spam', 'Spam'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
plt.show()
# The four cells:
# True Negative (TN): Correctly classified as not spam
# False Positive (FP): Wrongly called spam (sent real email to junk)
# False Negative (FN): Missed spam (spam got through)
# True Positive (TP): Correctly caught spam
The Precision-Recall Tradeoff
Adjusting the classification threshold shifts the tradeoff between precision and recall:
# Lower threshold → catch more spam (higher recall, lower precision)
# Higher threshold → only flag obvious spam (higher precision, lower recall)
thresholds = [0.3, 0.5, 0.7]
for t in thresholds:
y_pred_t = (y_proba >= t).astype(int)
cm = confusion_matrix(y_test, y_pred_t)
tp = cm[1, 1]
fp = cm[0, 1]
fn = cm[1, 0]
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
print(f"Threshold {t}: Precision={precision:.2f}, Recall={recall:.2f}")
Which matters more depends on your problem:
- Medical diagnosis: maximize recall (don't miss diseases)
- Spam filter: balance precision and recall (annoying either way)
- Content moderation: usually maximize precision (avoid false positives)
Feature Coefficients — Interpreting the Model
feature_names = ['word_count', 'exclamation_marks', 'unsubscribe', 'uppercase_ratio']
coefs = pd.Series(model.coef_[0], index=feature_names).sort_values()
coefs.plot(kind='barh', color=['red' if c < 0 else 'blue' for c in coefs])
plt.title('Feature Coefficients (positive = pushes toward Spam)')
plt.axvline(x=0, color='black', linestyle='-')
plt.show()
A positive coefficient means the feature pushes toward class 1 (spam). A negative coefficient pushes toward class 0 (not spam). Coefficients are interpretable when features are scaled.
Multi-Class Classification
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# multi_class='auto' handles multi-class automatically
model = LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.3f}")
print(classification_report(y_test, model.predict(X_test), target_names=iris.target_names))
When to Use Logistic Regression
Use it when:
- You need probability estimates (not just class predictions)
- Interpretability matters — you can explain why the model made a prediction
- Your dataset is linearly separable (or close to it)
- As a fast baseline before trying complex models
Consider alternatives when:
- The relationship between features and target is highly non-linear
- Accuracy is more important than interpretability
- You have high-dimensional sparse data (consider Naive Bayes or SVM)
Next lesson: Decision Trees — a powerful non-linear classifier that makes interpretable decisions.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises