AiTechWorlds

🔬

AI Learning

Transformer Architecture Cheat Sheet

Self-attention formula, encoder vs decoder, positional encoding, HNSW, scaling laws — every key transformer concept.

#transformers#llm#attention#deep-learning

Transformer Architecture Cheat Sheet

What Is a Transformer?

Transformers are the neural network architecture behind every modern LLM — GPT-4, Claude, Gemini, LLaMA. Introduced in the 2017 paper Attention Is All You Need, they replaced recurrent networks (RNNs/LSTMs) with a fully attention-based approach that parallelizes training and handles long-range dependencies.

Core Components

Component	Role
Token Embeddings	Convert tokens (words/subwords) into dense vectors
Positional Encoding	Inject position information (sine/cosine or learned)
Multi-Head Self-Attention	Each token attends to every other token in parallel
Feed-Forward Network (FFN)	Two linear layers with ReLU/GELU — applied per token
Layer Normalization	Stabilize activations before/after each sub-layer
Residual Connections	`output = x + sublayer(x)` — prevents gradient vanishing

Self-Attention Formula

text

Attention(Q, K, V) = softmax( QKᵀ / √dₖ ) · V

Q (Query) — what this token is looking for
K (Key) — what each token offers
V (Value) — what each token contributes if selected
√dₖ — scale factor to prevent large dot products from vanishing gradients in softmax

Multi-head attention runs h attention heads in parallel, then concatenates:

text

MultiHead(Q,K,V) = Concat(head₁,...,headₙ) · Wᴼ
head_i = Attention(QWᵢᵠ, KWᵢᴷ, VWᵢᵛ)

Encoder vs Decoder

Feature	Encoder	Decoder
Attention type	Bidirectional self-attention	Masked (causal) self-attention + cross-attention
Sees	All tokens at once	Only past tokens (autoregressive)
Used in	BERT, RoBERTa, DistilBERT	GPT, Claude, LLaMA
Task	Classification, NER, embeddings	Text generation, chat

Encoder-Decoder (T5, BART): Encoder reads input, decoder generates output — ideal for translation, summarization.

Attention Mask Types

Mask	Purpose	Used By
Padding mask	Ignore pad tokens	All transformers
Causal (look-ahead) mask	Block future tokens during training	GPT-style decoders
Cross-attention mask	Encoder output → decoder	Encoder-decoder models

Key Hyperparameters

Parameter	What It Controls	GPT-3 Value
`d_model`	Hidden dimension size	12,288
`n_heads`	Number of attention heads	96
`d_ff`	Feed-forward inner size	49,152
`n_layers`	Number of transformer blocks	96
`max_seq_len`	Context window length	2,048
`vocab_size`	Number of unique tokens	50,257

Positional Encoding

Sinusoidal (original paper):

text

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Rotary Position Embedding (RoPE) — used in LLaMA, Mistral:

Encodes position as rotation in embedding space
Better extrapolation to longer sequences than sinusoidal

ALiBi — used in MPT, BLOOM:

Adds a linear bias to attention scores based on distance

Transformer Block (One Layer)

text

x → LayerNorm → Multi-Head Attention → + x (residual)
  → LayerNorm → Feed-Forward Network  → + x (residual)

Pre-LN (normalize before sublayer) is more stable than Post-LN for training very deep models.

Scaling Laws (Chinchilla)

Model Size	Optimal Training Tokens
1B parameters	~20B tokens
7B parameters	~140B tokens
70B parameters	~1.4T tokens

Loss scales predictably with compute: L ∝ (N/Nₒ)^(-αN) + (D/Dₒ)^(-αD) where N = parameters, D = data.

Efficient Attention Variants

Variant	Technique	Speed Gain
FlashAttention	IO-aware tiling (no full attention matrix)	2–4×
Multi-Query Attention (MQA)	One K,V head shared across all Q heads	Faster inference
Grouped-Query Attention (GQA)	Groups of Q heads share K,V (LLaMA 2)	Balance speed/quality
Sliding Window Attention	Attend only to local window (Mistral)	Linear complexity
Sparse Attention	Skip far-apart token pairs	Sub-quadratic

Mini PyTorch Attention

python

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(q, k, v, mask=None):
    d_k = q.size(-1)
    scores = torch.matmul(q, k.transpose(-2, -1)) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, v)

Common Mistakes

Forgetting to scale by √dₖ — softmax gradients vanish with large dot products
Confusing encoder attention (bidirectional) with decoder attention (causal/masked)
Using Post-LN in very deep models — Pre-LN trains more stably
Not masking padding tokens — they pollute attention scores
Conflating model size (parameters) with capability — data quality and training recipe matter equally

Read Online

📱

Get more notes like this daily on Telegram!

Free study notes, cheat sheets & AI tips

Last reviewed on June 13, 2026 by the AiTechWorlds Notes Team. Free cheat sheet — no signup required.

Go deeper on this topic

ArticleLSTM vs Transformer: The Evolution of Sequence Learning in AI InterviewMachine Learning & AI ArticleConvolutional Neural Networks (CNNs): How Image Recognition Works ArticleDeep Learning Explained: Neural Networks from Zero to Understanding ArticleBuilding Your First Deep Learning Model with PyTorch: Practical Guide ArticleTransfer Learning Explained: Fine-Tune Pre-Trained Models in 30 Minutes

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

🔬

AI Learning

Transformer Architecture Cheat Sheet

Self-attention formula, encoder vs decoder, positional encoding, HNSW, scaling laws — every key transformer concept.

#transformers#llm#attention#deep-learning

Back to Notes Library

Transformer Architecture Cheat Sheet

What Is a Transformer?

Core Components

Component	Role
Token Embeddings	Convert tokens (words/subwords) into dense vectors
Positional Encoding	Inject position information (sine/cosine or learned)
Multi-Head Self-Attention	Each token attends to every other token in parallel
Feed-Forward Network (FFN)	Two linear layers with ReLU/GELU — applied per token
Layer Normalization	Stabilize activations before/after each sub-layer
Residual Connections	`output = x + sublayer(x)` — prevents gradient vanishing

Self-Attention Formula

text

Attention(Q, K, V) = softmax( QKᵀ / √dₖ ) · V

Q (Query) — what this token is looking for
K (Key) — what each token offers
V (Value) — what each token contributes if selected
√dₖ — scale factor to prevent large dot products from vanishing gradients in softmax

Multi-head attention runs h attention heads in parallel, then concatenates:

text

MultiHead(Q,K,V) = Concat(head₁,...,headₙ) · Wᴼ
head_i = Attention(QWᵢᵠ, KWᵢᴷ, VWᵢᵛ)

Encoder vs Decoder

Feature	Encoder	Decoder
Attention type	Bidirectional self-attention	Masked (causal) self-attention + cross-attention
Sees	All tokens at once	Only past tokens (autoregressive)
Used in	BERT, RoBERTa, DistilBERT	GPT, Claude, LLaMA
Task	Classification, NER, embeddings	Text generation, chat

Encoder-Decoder (T5, BART): Encoder reads input, decoder generates output — ideal for translation, summarization.

Attention Mask Types

Mask	Purpose	Used By
Padding mask	Ignore pad tokens	All transformers
Causal (look-ahead) mask	Block future tokens during training	GPT-style decoders
Cross-attention mask	Encoder output → decoder	Encoder-decoder models

Key Hyperparameters

Parameter	What It Controls	GPT-3 Value
`d_model`	Hidden dimension size	12,288
`n_heads`	Number of attention heads	96
`d_ff`	Feed-forward inner size	49,152
`n_layers`	Number of transformer blocks	96
`max_seq_len`	Context window length	2,048
`vocab_size`	Number of unique tokens	50,257

Positional Encoding

Sinusoidal (original paper):

text

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Rotary Position Embedding (RoPE) — used in LLaMA, Mistral:

Encodes position as rotation in embedding space
Better extrapolation to longer sequences than sinusoidal

ALiBi — used in MPT, BLOOM:

Adds a linear bias to attention scores based on distance

Transformer Block (One Layer)

text

x → LayerNorm → Multi-Head Attention → + x (residual)
  → LayerNorm → Feed-Forward Network  → + x (residual)

Pre-LN (normalize before sublayer) is more stable than Post-LN for training very deep models.

Scaling Laws (Chinchilla)

Model Size	Optimal Training Tokens
1B parameters	~20B tokens
7B parameters	~140B tokens
70B parameters	~1.4T tokens

Loss scales predictably with compute: L ∝ (N/Nₒ)^(-αN) + (D/Dₒ)^(-αD) where N = parameters, D = data.

Efficient Attention Variants

Variant	Technique	Speed Gain
FlashAttention	IO-aware tiling (no full attention matrix)	2–4×
Multi-Query Attention (MQA)	One K,V head shared across all Q heads	Faster inference
Grouped-Query Attention (GQA)	Groups of Q heads share K,V (LLaMA 2)	Balance speed/quality
Sliding Window Attention	Attend only to local window (Mistral)	Linear complexity
Sparse Attention	Skip far-apart token pairs	Sub-quadratic

Mini PyTorch Attention

python

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(q, k, v, mask=None):
    d_k = q.size(-1)
    scores = torch.matmul(q, k.transpose(-2, -1)) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, v)

Common Mistakes

Forgetting to scale by √dₖ — softmax gradients vanish with large dot products
Confusing encoder attention (bidirectional) with decoder attention (causal/masked)
Using Post-LN in very deep models — Pre-LN trains more stably
Not masking padding tokens — they pollute attention scores
Conflating model size (parameters) with capability — data quality and training recipe matter equally

Read Online

📱

Get more notes like this daily on Telegram!

Free study notes, cheat sheets & AI tips

Last reviewed on June 13, 2026 by the AiTechWorlds Notes Team. Free cheat sheet — no signup required.

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.