AiTechWorlds
AiTechWorlds
Every activation and loss function with formulas, PyTorch code, and a decision framework for choosing the right one.
Without activation functions, neural networks are just stacked linear transforms β they can only learn linear relationships. Activation functions introduce non-linearity, enabling networks to approximate any complex function.
| Function | Range | Zero-Centered | Vanishing Gradient | Use Case |
|---|---|---|---|---|
| Sigmoid | (0, 1) | No | Yes (severe) | Binary output layer |
| Tanh | (-1, 1) | Yes | Yes (less severe) | RNNs, older networks |
| ReLU | [0, β) | No | No (but dying) | Default hidden layers |
| Leaky ReLU | (-β, β) | No | No | ReLU alternative |
| ELU | (-1, β) | Approximately | Minimal | Deep networks |
| GELU | (-0.17, β) | Approximately | Minimal | Transformers (BERT, GPT) |
| SiLU / Swish | β(-0.28, β) | Approximately | Minimal | EfficientNet, LLaMA FFN |
| Softmax | (0, 1) each, sums to 1 | No | N/A | Multiclass output layer |
Sigmoid: Ο(x) = 1 / (1 + e^(-x))
Tanh: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
ReLU: f(x) = max(0, x)
Leaky ReLU: f(x) = max(Ξ±x, x) where Ξ± = 0.01
ELU: f(x) = x if x > 0, else Ξ±(e^x - 1)
GELU: f(x) = x Β· Ξ¦(x) where Ξ¦ is CDF of normal distribution
SiLU/Swish: f(x) = x Β· Ο(x)
Softmax: f(xα΅’) = e^xα΅’ / Ξ£e^xβ±Όimport torch
import torch.nn as nn
# Apply in layers
layer = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(), # or nn.GELU(), nn.SiLU()
nn.Linear(64, 10),
nn.Softmax(dim=-1) # for multiclass output
)
# Functional API
import torch.nn.functional as F
x = F.relu(x)
x = F.gelu(x)
x = F.softmax(logits, dim=-1)ReLU neurons that receive consistent negative input output zero permanently β their weights never update.
ReLU(x) = 0 for all x β€ 0
β gradient = 0 β weights stop learningFix: Leaky ReLU (Ξ±=0.01), ELU, or GELU all pass small gradients for negative inputs.
| Loss Function | Task | Equation (simplified) | Sensitive to Outliers | ||
|---|---|---|---|---|---|
| MSE | Regression | mean((y - Ε·)Β²) | Yes (squares errors) | ||
| MAE | Regression | `mean( | y - Ε· | )` | No |
| Huber Loss | Regression | MSE near 0, MAE far from 0 | Robust | ||
| Binary Cross-Entropy | Binary classification | -[yΒ·log(p) + (1-y)Β·log(1-p)] | No | ||
| Categorical Cross-Entropy | Multiclass | -Ξ£ yα΅’ Β· log(pα΅’) | No | ||
| Sparse Categorical CE | Multiclass (int labels) | Same, accepts integer labels | No | ||
| KL Divergence | Distribution matching | Ξ£ p(x)Β·log(p(x)/q(x)) | No | ||
| Contrastive Loss | Similarity learning | Minimize distance for similar pairs | No | ||
| Triplet Loss | Embeddings | max(d(a,p) - d(a,n) + margin, 0) | No |
import torch.nn as nn
# Mean Squared Error
criterion_mse = nn.MSELoss()
# Mean Absolute Error
criterion_mae = nn.L1Loss()
# Huber (SmoothL1)
criterion_huber = nn.SmoothL1Loss(beta=1.0)
loss = criterion_mse(predictions, targets)
loss.backward()# Binary classification (sigmoid output)
criterion = nn.BCELoss()
# or combined (more numerically stable):
criterion = nn.BCEWithLogitsLoss() # applies sigmoid internally
# Multiclass (softmax output)
criterion = nn.CrossEntropyLoss() # applies log_softmax internally
# Usage: CrossEntropyLoss expects raw logits (not softmax) + integer labels
loss = criterion(logits, targets) # targets: torch.long integers
# Class-weighted loss for imbalanced data
weights = torch.tensor([1.0, 3.0, 2.0]) # weight minority classes higher
criterion = nn.CrossEntropyLoss(weight=weights)Regression?
ββ Normal noise β MSE
ββ Outliers present β MAE or Huber
ββ Distribution prediction β NLL / Gaussian NLL
Classification?
ββ Binary output β BCEWithLogitsLoss
ββ Multiclass (mutually exclusive) β CrossEntropyLoss
ββ Multilabel (multiple can be true) β BCEWithLogitsLoss per label
ββ Extreme class imbalance β Focal Loss
Embeddings / Similarity?
ββ Two classes (similar/different) β Contrastive Loss
ββ Ranked similarity β Triplet Loss
ββ Dense retrieval β InfoNCE / NT-XentStandard cross-entropy gives equal weight to easy and hard examples. Focal loss down-weights easy examples:
class FocalLoss(nn.Module):
def __init__(self, gamma=2.0, alpha=0.25):
super().__init__()
self.gamma = gamma
self.alpha = alpha
def forward(self, logits, targets):
ce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
p_t = torch.exp(-ce)
loss = self.alpha * (1 - p_t) ** self.gamma * ce
return loss.mean()| Task | Last Activation | Loss Function |
|---|---|---|
| Binary classification | Sigmoid | BCEWithLogitsLoss |
| Multiclass (1 of N) | Softmax | CrossEntropyLoss |
| Multilabel (M of N) | Sigmoid each | BCEWithLogitsLoss |
| Regression | None (linear) | MSE / MAE / Huber |
| Token generation | Softmax over vocab | CrossEntropyLoss |
Download Activation & Loss Functions Reference
Get this note + 100s more free on Telegram
Get more notes like this daily on Telegram!
Free study notes, cheat sheets & AI tips
Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content β 100% free!
No spam. Leave anytime.