Transformer Architecture Explained: The Architecture Behind All Modern AI
Transformer architecture explained clearly — attention mechanisms, encoder-decoder structure, positional encoding, and why transformers replaced RNNs for NLP and beyond.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Transformer Architecture Explained: The Architecture Behind All Modern AI
In June 2017, Google researchers published "Attention Is All You Need" — eight pages that changed AI more than almost any paper before it. The transformer architecture they introduced replaced recurrent neural networks that had dominated NLP for a decade, enabled training at a scale previously impossible, and became the foundation of every major AI system in 2025.
GPT-4, Claude 3.5, Gemini 1.5, LLaMA 3, DALL-E, Whisper, Stable Diffusion — all transformers. Understanding the architecture means understanding the foundation of modern AI.
This guide builds the transformer from first principles, explaining every component with code and intuition before putting them together.
The Problem Transformers Solved
Before transformers, sequence models used RNNs (Recurrent Neural Networks) or LSTMs:
RNN processing "The cat sat on the mat":
Step 1: Process "The" → hidden state h₁
Step 2: Process "cat" → hidden state h₂ (depends on h₁)
Step 3: Process "sat" → hidden state h₃ (depends on h₂)
...sequential, can't parallelize...
Step 7: Process "mat" → final hidden state
Problems:
1. Sequential processing: token N waits for token N-1 → slow training
2. Long-range dependencies: information about "The" fades by "mat"
3. Gradient vanishing: gradients shrink through many sequential steps
Transformers solve all three:
Transformer processing "The cat sat on the mat":
All tokens processed simultaneously (parallel)
Every token directly attends to every other token
No sequential dependency — gradients flow directly
Building Block 1: Self-Attention
Self-attention is the core mechanism. It allows each token to gather information from any other token in the sequence:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class SelfAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.d_head = d_model // n_heads # Dimension per head
# Linear projections for Q, K, V
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
batch_size, seq_len, d_model = x.shape
# Project input to Q, K, V
Q = self.W_q(x) # (batch, seq, d_model)
K = self.W_k(x)
V = self.W_v(x)
# Reshape for multi-head attention
Q = Q.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
K = K.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
V = V.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
# Shape: (batch, n_heads, seq_len, d_head)
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) # (batch, heads, seq, seq)
scores = scores / math.sqrt(self.d_head) # Scale to prevent vanishing gradients
# Apply mask (for causal/decoder attention: can't see future tokens)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax → attention weights (probabilities)
attn_weights = F.softmax(scores, dim=-1)
# Apply attention weights to values
context = torch.matmul(attn_weights, V) # (batch, heads, seq, d_head)
# Concatenate heads and project
context = context.transpose(1, 2).contiguous()
context = context.view(batch_size, seq_len, d_model)
output = self.W_o(context)
return output, attn_weights
The intuition for Q, K, V:
Imagine a library lookup system:
- Query (Q): "What information am I looking for?" — the question
- Key (K): "What information does each book have?" — the index
- Value (V): "What does each book actually contain?" — the content
Attention score = how well the query matches each key
Attention output = weighted sum of values, weighted by scores
For "bank" in "The river bank":
- Q_bank = "what disambiguates me?"
- K_river = "I'm about water/geography"
- High score → V_river contributes heavily to bank's representation
Building Block 2: Positional Encoding
Transformers process tokens in parallel with no inherent order. Positional encoding adds position information:
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_seq_len=5000, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
# Create position encoding matrix
pe = torch.zeros(max_seq_len, d_model)
position = torch.arange(0, max_seq_len).unsqueeze(1).float()
# Sinusoidal encoding
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term) # Even dimensions: sin
pe[:, 1::2] = torch.cos(position * div_term) # Odd dimensions: cos
pe = pe.unsqueeze(0) # (1, max_seq_len, d_model)
self.register_buffer('pe', pe)
def forward(self, x):
# Add positional encoding to token embeddings
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
# Test
d_model = 512
pe = PositionalEncoding(d_model)
sample_embeddings = torch.randn(1, 10, d_model) # batch=1, seq=10
output = pe(sample_embeddings)
print(f"Output shape: {output.shape}") # (1, 10, 512)
Why sinusoidal? It encodes both absolute position (each position has a unique pattern) and relative position (the difference between two positions can be computed from their encodings).
Building Block 3: Feed-Forward Network
Each transformer layer has a position-wise feed-forward network applied identically to each position:
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(), # Modern transformers use GELU, not ReLU
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
def forward(self, x):
return self.net(x)
# In GPT-3: d_model=12288, d_ff=49152 (4x d_model)
# The FFN stores most of the model's factual knowledge
Interestingly, research suggests the feed-forward layers store most of the factual knowledge, while attention handles the relational processing.
Building Block 4: Layer Normalization and Residual Connections
class TransformerLayer(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = SelfAttention(d_model, n_heads)
self.ff = FeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Residual connection 1: x + attention(norm(x))
# "Pre-norm" used in GPT-2+: normalize before sublayer, not after
attn_out, _ = self.attention(self.norm1(x), mask)
x = x + self.dropout(attn_out)
# Residual connection 2: x + ff(norm(x))
ff_out = self.ff(self.norm2(x))
x = x + self.dropout(ff_out)
return x
Why residual connections matter: Without them, gradients vanish through many layers. Residuals provide a "highway" for gradients to flow directly from output to early layers, enabling training of very deep networks (96+ layers in GPT-3).
Putting It Together: GPT-Style Decoder
class GPTModel(nn.Module):
def __init__(self, vocab_size, d_model=512, n_heads=8, n_layers=6,
d_ff=2048, max_seq_len=1024, dropout=0.1):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = PositionalEncoding(d_model, max_seq_len, dropout)
self.layers = nn.ModuleList([
TransformerLayer(d_model, n_heads, d_ff, dropout)
for _ in range(n_layers)
])
self.norm = nn.LayerNorm(d_model)
self.output = nn.Linear(d_model, vocab_size) # Predict next token
# Initialize weights
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, (nn.Linear, nn.Embedding)):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, input_ids):
seq_len = input_ids.size(1)
# Causal mask: token i can only attend to tokens 0..i
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(0)
mask = mask.to(input_ids.device)
# Embed tokens + add positional encoding
x = self.pos_encoding(self.token_embedding(input_ids))
# Pass through transformer layers
for layer in self.layers:
x = layer(x, mask)
# Final norm and project to vocabulary
x = self.norm(x)
logits = self.output(x) # (batch, seq_len, vocab_size)
return logits
def generate(self, input_ids, max_new_tokens=50, temperature=1.0):
"""Autoregressive generation"""
for _ in range(max_new_tokens):
# Forward pass (only use last seq_len tokens if over limit)
logits = self.forward(input_ids[:, -1024:])
# Get logits for the last token
next_token_logits = logits[:, -1, :] / temperature
# Sample from the distribution
probs = F.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Append to sequence
input_ids = torch.cat([input_ids, next_token], dim=1)
return input_ids
Encoder vs Decoder: The Key Difference
Encoder (BERT-style):
- Bidirectional attention: token can attend to ALL other tokens
- Used for understanding tasks (reading comprehension, classification)
- Pre-training: fill in masked tokens
- Example: "The [MASK] sat on the mat" → "cat"
Decoder (GPT-style):
- Causal/unidirectional: token can only attend to PRECEDING tokens
- Used for generation tasks (writing, coding, conversation)
- Pre-training: predict next token
- Example: "The cat sat on" → "the"
Encoder-Decoder (T5/BART):
- Encoder: bidirectional processing of input
- Decoder: causal generation of output, also attends to encoder output
- Used for: translation, summarization, seq2seq tasks
Scale and Modern Transformers
The original transformer had 65M parameters. Modern models:
| Model | Parameters | Training Tokens | Context Window |
|---|---|---|---|
| BERT-base | 110M | 3.3B | 512 |
| GPT-2 | 1.5B | ~10B | 1,024 |
| GPT-3 | 175B | 300B | 4,096 |
| LLaMA 3.1 | 8B-405B | 15T | 128K |
| GPT-4 (est.) | ~1.8T MoE | ~10T+ | 128K |
Key architectural differences in modern vs. original transformer:
- RoPE (Rotary Position Embedding): Better long-context handling than sinusoidal
- GQA (Grouped Query Attention): Reduces KV cache memory
- SwiGLU activation: Replaces ReLU in FFN
- RMSNorm: Faster than LayerNorm
- Flash Attention: Memory-efficient attention for long sequences
Conclusion
The transformer's success comes from one insight: attention is all you need. Processing sequences by having every token attend to every other token — with learned Q, K, V projections — captures long-range dependencies, parallelizes across tokens, and scales with compute in ways RNNs cannot.
Every modern frontier model builds on this architecture with incremental improvements. Understanding the original lets you understand all of them.
For how transformers are trained at scale, see our how LLMs work guide. For applying transformer models in code, see our Hugging Face guide through the NLP beginners guide.
Frequently Asked Questions
What is the transformer architecture?
A neural network processing sequences with self-attention — each token simultaneously attending to all other tokens. Introduced in 2017's "Attention Is All You Need." Foundation of BERT, GPT, T5, and all modern LLMs. Replaced RNNs for almost all sequence tasks.
What is the attention mechanism and how does it work?
Each token computes Q (query), K (key), V (value) vectors. Attention score = dot product of Q and K, scaled, softmax-normalized. Output = weighted sum of V vectors, weighted by attention scores. Multi-head attention runs this in parallel with different projections, each head learning different relationships.
What is positional encoding and why do transformers need it?
Transformers process tokens in parallel with no inherent order sense. Positional encoding adds position information to each token's embedding using sinusoidal functions. Modern models use RoPE or ALiBi for better relative position encoding and long-context generalization.
What is the difference between encoder-only, decoder-only, and encoder-decoder transformers?
Encoder-only (BERT): bidirectional attention, understanding tasks, masked language modeling. Decoder-only (GPT): causal attention, generation tasks, next-token prediction — all frontier LLMs use this. Encoder-decoder (T5): bidirectional encoder + causal decoder, sequence-to-sequence tasks.
Why did transformers replace RNNs?
Three advantages: parallelism (all tokens processed simultaneously, enabling faster training); long-range dependencies (attention directly connects any two tokens); better scaling (quality keeps improving with compute; RNN quality plateaued). By 2019-2020, transformers had displaced RNNs for virtually all NLP tasks.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)
AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.
Embeddings Explained: How AI Converts Words to Numbers That Mean Something
Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.
Fine-Tuning LLMs: When to Do It and How to Do It Right
Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.
GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?
GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.