AiTechWorlds
AiTechWorlds
Self-attention formula, encoder vs decoder, positional encoding, HNSW, scaling laws — every key transformer concept.
Transformers are the neural network architecture behind every modern LLM — GPT-4, Claude, Gemini, LLaMA. Introduced in the 2017 paper Attention Is All You Need, they replaced recurrent networks (RNNs/LSTMs) with a fully attention-based approach that parallelizes training and handles long-range dependencies.
| Component | Role |
|---|---|
| Token Embeddings | Convert tokens (words/subwords) into dense vectors |
| Positional Encoding | Inject position information (sine/cosine or learned) |
| Multi-Head Self-Attention | Each token attends to every other token in parallel |
| Feed-Forward Network (FFN) | Two linear layers with ReLU/GELU — applied per token |
| Layer Normalization | Stabilize activations before/after each sub-layer |
| Residual Connections | output = x + sublayer(x) — prevents gradient vanishing |
Attention(Q, K, V) = softmax( QKᵀ / √dₖ ) · VMulti-head attention runs h attention heads in parallel, then concatenates:
MultiHead(Q,K,V) = Concat(head₁,...,headₙ) · Wᴼ
head_i = Attention(QWᵢᵠ, KWᵢᴷ, VWᵢᵛ)| Feature | Encoder | Decoder |
|---|---|---|
| Attention type | Bidirectional self-attention | Masked (causal) self-attention + cross-attention |
| Sees | All tokens at once | Only past tokens (autoregressive) |
| Used in | BERT, RoBERTa, DistilBERT | GPT, Claude, LLaMA |
| Task | Classification, NER, embeddings | Text generation, chat |
Encoder-Decoder (T5, BART): Encoder reads input, decoder generates output — ideal for translation, summarization.
| Mask | Purpose | Used By |
|---|---|---|
| Padding mask | Ignore pad tokens | All transformers |
| Causal (look-ahead) mask | Block future tokens during training | GPT-style decoders |
| Cross-attention mask | Encoder output → decoder | Encoder-decoder models |
| Parameter | What It Controls | GPT-3 Value |
|---|---|---|
d_model | Hidden dimension size | 12,288 |
n_heads | Number of attention heads | 96 |
d_ff | Feed-forward inner size | 49,152 |
n_layers | Number of transformer blocks | 96 |
max_seq_len | Context window length | 2,048 |
vocab_size | Number of unique tokens | 50,257 |
Sinusoidal (original paper):
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))Rotary Position Embedding (RoPE) — used in LLaMA, Mistral:
ALiBi — used in MPT, BLOOM:
x → LayerNorm → Multi-Head Attention → + x (residual)
→ LayerNorm → Feed-Forward Network → + x (residual)Pre-LN (normalize before sublayer) is more stable than Post-LN for training very deep models.
| Model Size | Optimal Training Tokens |
|---|---|
| 1B parameters | ~20B tokens |
| 7B parameters | ~140B tokens |
| 70B parameters | ~1.4T tokens |
Loss scales predictably with compute: L ∝ (N/Nₒ)^(-αN) + (D/Dₒ)^(-αD) where N = parameters, D = data.
| Variant | Technique | Speed Gain |
|---|---|---|
| FlashAttention | IO-aware tiling (no full attention matrix) | 2–4× |
| Multi-Query Attention (MQA) | One K,V head shared across all Q heads | Faster inference |
| Grouped-Query Attention (GQA) | Groups of Q heads share K,V (LLaMA 2) | Balance speed/quality |
| Sliding Window Attention | Attend only to local window (Mistral) | Linear complexity |
| Sparse Attention | Skip far-apart token pairs | Sub-quadratic |
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(q, k, v, mask=None):
d_k = q.size(-1)
scores = torch.matmul(q, k.transpose(-2, -1)) / d_k ** 0.5
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, v)Download Transformer Architecture Cheat Sheet
Get this note + 100s more free on Telegram
Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!
No spam. Leave anytime.