Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

AiTechWorlds

⚡

AI Learning

Activation & Loss Functions Reference

Every activation and loss function with formulas, PyTorch code, and a decision framework for choosing the right one.

#neural-networks #pytorch #activation-functions #loss-functions #deep-learning

Back to Notes Library

Activation & Loss Functions: Complete Reference

Why Activation Functions Matter

Without activation functions, neural networks are just stacked linear transforms — they can only learn linear relationships. Activation functions introduce non-linearity, enabling networks to approximate any complex function.

Activation Functions

Quick Comparison Table

Function	Range	Zero-Centered	Vanishing Gradient	Use Case
Sigmoid	(0, 1)	No	Yes (severe)	Binary output layer
Tanh	(-1, 1)	Yes	Yes (less severe)	RNNs, older networks
ReLU	[0, ∞)	No	No (but dying)	Default hidden layers
Leaky ReLU	(-∞, ∞)	No	No	ReLU alternative
ELU	(-1, ∞)	Approximately	Minimal	Deep networks
GELU	(-0.17, ∞)	Approximately	Minimal	Transformers (BERT, GPT)
SiLU / Swish	≈(-0.28, ∞)	Approximately	Minimal	EfficientNet, LLaMA FFN
Softmax	(0, 1) each, sums to 1	No	N/A	Multiclass output layer

Formulas

text

Sigmoid:    σ(x) = 1 / (1 + e^(-x))
Tanh:       tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
ReLU:       f(x) = max(0, x)
Leaky ReLU: f(x) = max(αx, x)  where α = 0.01
ELU:        f(x) = x if x > 0, else α(e^x - 1)
GELU:       f(x) = x · Φ(x)   where Φ is CDF of normal distribution
SiLU/Swish: f(x) = x · σ(x)
Softmax:    f(xᵢ) = e^xᵢ / Σe^xⱼ

PyTorch Usage

python

import torch
import torch.nn as nn

# Apply in layers
layer = nn.Sequential(
    nn.Linear(128, 64),
    nn.ReLU(),              # or nn.GELU(), nn.SiLU()
    nn.Linear(64, 10),
    nn.Softmax(dim=-1)      # for multiclass output
)

# Functional API
import torch.nn.functional as F
x = F.relu(x)
x = F.gelu(x)
x = F.softmax(logits, dim=-1)

The Dying ReLU Problem

ReLU neurons that receive consistent negative input output zero permanently — their weights never update.

text

ReLU(x) = 0 for all x ≤ 0
→ gradient = 0 → weights stop learning

Fix: Leaky ReLU (α=0.01), ELU, or GELU all pass small gradients for negative inputs.

Loss Functions

Quick Comparison Table

Loss Function	Task	Equation (simplified)	Sensitive to Outliers
MSE	Regression	`mean((y - ŷ)²)`	Yes (squares errors)
MAE	Regression	`mean(	y - ŷ	)`	No
Huber Loss	Regression	MSE near 0, MAE far from 0	Robust
Binary Cross-Entropy	Binary classification	`-[y·log(p) + (1-y)·log(1-p)]`	No
Categorical Cross-Entropy	Multiclass	`-Σ yᵢ · log(pᵢ)`	No
Sparse Categorical CE	Multiclass (int labels)	Same, accepts integer labels	No
KL Divergence	Distribution matching	`Σ p(x)·log(p(x)/q(x))`	No
Contrastive Loss	Similarity learning	Minimize distance for similar pairs	No
Triplet Loss	Embeddings	`max(d(a,p) - d(a,n) + margin, 0)`	No

Regression Loss Code

python

import torch.nn as nn

# Mean Squared Error
criterion_mse = nn.MSELoss()

# Mean Absolute Error
criterion_mae = nn.L1Loss()

# Huber (SmoothL1)
criterion_huber = nn.SmoothL1Loss(beta=1.0)

loss = criterion_mse(predictions, targets)
loss.backward()

Classification Loss Code

python

# Binary classification (sigmoid output)
criterion = nn.BCELoss()
# or combined (more numerically stable):
criterion = nn.BCEWithLogitsLoss()   # applies sigmoid internally

# Multiclass (softmax output)
criterion = nn.CrossEntropyLoss()    # applies log_softmax internally
# Usage: CrossEntropyLoss expects raw logits (not softmax) + integer labels
loss = criterion(logits, targets)    # targets: torch.long integers

# Class-weighted loss for imbalanced data
weights = torch.tensor([1.0, 3.0, 2.0])  # weight minority classes higher
criterion = nn.CrossEntropyLoss(weight=weights)

Choosing Loss for Your Task

text

Regression?
  ├─ Normal noise → MSE
  ├─ Outliers present → MAE or Huber
  └─ Distribution prediction → NLL / Gaussian NLL

Classification?
  ├─ Binary output → BCEWithLogitsLoss
  ├─ Multiclass (mutually exclusive) → CrossEntropyLoss
  ├─ Multilabel (multiple can be true) → BCEWithLogitsLoss per label
  └─ Extreme class imbalance → Focal Loss

Embeddings / Similarity?
  ├─ Two classes (similar/different) → Contrastive Loss
  ├─ Ranked similarity → Triplet Loss
  └─ Dense retrieval → InfoNCE / NT-Xent

Focal Loss (For Imbalanced Classification)

Standard cross-entropy gives equal weight to easy and hard examples. Focal loss down-weights easy examples:

python

class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0, alpha=0.25):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha

    def forward(self, logits, targets):
        ce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
        p_t = torch.exp(-ce)
        loss = self.alpha * (1 - p_t) ** self.gamma * ce
        return loss.mean()

Output Layer Summary

Task	Last Activation	Loss Function
Binary classification	Sigmoid	BCEWithLogitsLoss
Multiclass (1 of N)	Softmax	CrossEntropyLoss
Multilabel (M of N)	Sigmoid each	BCEWithLogitsLoss
Regression	None (linear)	MSE / MAE / Huber
Token generation	Softmax over vocab	CrossEntropyLoss

Common Mistakes

Applying softmax inside the model AND using CrossEntropyLoss (it applies log_softmax internally — double application distorts gradients)
Using MSE for classification — cross-entropy has better gradient properties for probability outputs
Forgetting class weights on imbalanced datasets — the model learns to predict the majority class always
Using ReLU in the output layer — it clips negative predictions to 0, breaking regression
Applying sigmoid on multiclass logits before CrossEntropyLoss — turns multiclass into independent binary problems

Download Activation & Loss Functions Reference

Get this note + 100s more free on Telegram

Join Free →

📱

Get more notes like this daily on Telegram!

Free study notes, cheat sheets & AI tips

Join Free →

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

⚡

AI Learning

Activation & Loss Functions Reference

Every activation and loss function with formulas, PyTorch code, and a decision framework for choosing the right one.

#neural-networks #pytorch #activation-functions #loss-functions #deep-learning

Back to Notes Library

Activation & Loss Functions: Complete Reference

Why Activation Functions Matter

Activation Functions

Quick Comparison Table

Function	Range	Zero-Centered	Vanishing Gradient	Use Case
Sigmoid	(0, 1)	No	Yes (severe)	Binary output layer
Tanh	(-1, 1)	Yes	Yes (less severe)	RNNs, older networks
ReLU	[0, ∞)	No	No (but dying)	Default hidden layers
Leaky ReLU	(-∞, ∞)	No	No	ReLU alternative
ELU	(-1, ∞)	Approximately	Minimal	Deep networks
GELU	(-0.17, ∞)	Approximately	Minimal	Transformers (BERT, GPT)
SiLU / Swish	≈(-0.28, ∞)	Approximately	Minimal	EfficientNet, LLaMA FFN
Softmax	(0, 1) each, sums to 1	No	N/A	Multiclass output layer

Formulas

text

Sigmoid:    σ(x) = 1 / (1 + e^(-x))
Tanh:       tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
ReLU:       f(x) = max(0, x)
Leaky ReLU: f(x) = max(αx, x)  where α = 0.01
ELU:        f(x) = x if x > 0, else α(e^x - 1)
GELU:       f(x) = x · Φ(x)   where Φ is CDF of normal distribution
SiLU/Swish: f(x) = x · σ(x)
Softmax:    f(xᵢ) = e^xᵢ / Σe^xⱼ

PyTorch Usage

python

import torch
import torch.nn as nn

# Apply in layers
layer = nn.Sequential(
    nn.Linear(128, 64),
    nn.ReLU(),              # or nn.GELU(), nn.SiLU()
    nn.Linear(64, 10),
    nn.Softmax(dim=-1)      # for multiclass output
)

# Functional API
import torch.nn.functional as F
x = F.relu(x)
x = F.gelu(x)
x = F.softmax(logits, dim=-1)

The Dying ReLU Problem

ReLU neurons that receive consistent negative input output zero permanently — their weights never update.

text

ReLU(x) = 0 for all x ≤ 0
→ gradient = 0 → weights stop learning

Fix: Leaky ReLU (α=0.01), ELU, or GELU all pass small gradients for negative inputs.

Loss Functions

Quick Comparison Table

Loss Function	Task	Equation (simplified)	Sensitive to Outliers
MSE	Regression	`mean((y - ŷ)²)`	Yes (squares errors)
MAE	Regression	`mean(	y - ŷ	)`	No
Huber Loss	Regression	MSE near 0, MAE far from 0	Robust
Binary Cross-Entropy	Binary classification	`-[y·log(p) + (1-y)·log(1-p)]`	No
Categorical Cross-Entropy	Multiclass	`-Σ yᵢ · log(pᵢ)`	No
Sparse Categorical CE	Multiclass (int labels)	Same, accepts integer labels	No
KL Divergence	Distribution matching	`Σ p(x)·log(p(x)/q(x))`	No
Contrastive Loss	Similarity learning	Minimize distance for similar pairs	No
Triplet Loss	Embeddings	`max(d(a,p) - d(a,n) + margin, 0)`	No

Regression Loss Code

python

import torch.nn as nn

# Mean Squared Error
criterion_mse = nn.MSELoss()

# Mean Absolute Error
criterion_mae = nn.L1Loss()

# Huber (SmoothL1)
criterion_huber = nn.SmoothL1Loss(beta=1.0)

loss = criterion_mse(predictions, targets)
loss.backward()

Classification Loss Code

python

# Binary classification (sigmoid output)
criterion = nn.BCELoss()
# or combined (more numerically stable):
criterion = nn.BCEWithLogitsLoss()   # applies sigmoid internally

# Multiclass (softmax output)
criterion = nn.CrossEntropyLoss()    # applies log_softmax internally
# Usage: CrossEntropyLoss expects raw logits (not softmax) + integer labels
loss = criterion(logits, targets)    # targets: torch.long integers

# Class-weighted loss for imbalanced data
weights = torch.tensor([1.0, 3.0, 2.0])  # weight minority classes higher
criterion = nn.CrossEntropyLoss(weight=weights)

Choosing Loss for Your Task

text

Regression?
  ├─ Normal noise → MSE
  ├─ Outliers present → MAE or Huber
  └─ Distribution prediction → NLL / Gaussian NLL

Classification?
  ├─ Binary output → BCEWithLogitsLoss
  ├─ Multiclass (mutually exclusive) → CrossEntropyLoss
  ├─ Multilabel (multiple can be true) → BCEWithLogitsLoss per label
  └─ Extreme class imbalance → Focal Loss

Embeddings / Similarity?
  ├─ Two classes (similar/different) → Contrastive Loss
  ├─ Ranked similarity → Triplet Loss
  └─ Dense retrieval → InfoNCE / NT-Xent

Focal Loss (For Imbalanced Classification)

Standard cross-entropy gives equal weight to easy and hard examples. Focal loss down-weights easy examples:

python

class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0, alpha=0.25):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha

    def forward(self, logits, targets):
        ce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
        p_t = torch.exp(-ce)
        loss = self.alpha * (1 - p_t) ** self.gamma * ce
        return loss.mean()

Output Layer Summary

Task	Last Activation	Loss Function
Binary classification	Sigmoid	BCEWithLogitsLoss
Multiclass (1 of N)	Softmax	CrossEntropyLoss
Multilabel (M of N)	Sigmoid each	BCEWithLogitsLoss
Regression	None (linear)	MSE / MAE / Huber
Token generation	Softmax over vocab	CrossEntropyLoss

Common Mistakes

Applying softmax inside the model AND using CrossEntropyLoss (it applies log_softmax internally — double application distorts gradients)
Using MSE for classification — cross-entropy has better gradient properties for probability outputs
Forgetting class weights on imbalanced datasets — the model learns to predict the majority class always
Using ReLU in the output layer — it clips negative predictions to 0, breaking regression
Applying sigmoid on multiclass logits before CrossEntropyLoss — turns multiclass into independent binary problems

Download Activation & Loss Functions Reference

Get this note + 100s more free on Telegram

Join Free →

📱

Get more notes like this daily on Telegram!

Free study notes, cheat sheets & AI tips

Join Free →

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.