What is the attention mechanism and how does it work?

Attention is the mechanism that allows each token to incorporate information from other tokens in the sequence when computing its representation. For each token, attention computes three vectors — Query (Q), Key (K), and Value (V) — from its embedding. The attention score between two tokens is the dot product of one's Query and the other's Key, scaled and softmax-normalized. These scores weight the Value vectors, which are summed to produce the context-aware representation. Intuitively: the Query is 'what information am I looking for?', the Key is 'what information do I have?', and the Value is 'what do I actually contribute?'. Multi-head attention runs this process in parallel with different projection matrices, allowing each head to attend to different aspects.

What is positional encoding and why is transformers need it?

Transformers process all tokens in parallel, unlike RNNs that process them in order. This parallelism makes them fast to train but means they inherently have no sense of token position or order. Positional encoding adds position information to each token's embedding. The original transformer uses sinusoidal functions of different frequencies: PE(pos, 2i) = sin(pos/10000^(2i/d)), PE(pos, 2i+1) = cos(pos/10000^(2i/d)). This encodes absolute position — the embedding for token at position 5 is different from position 50. Modern LLMs use Rotary Position Embedding (RoPE) or ALiBi, which better handle relative positions and extrapolate to longer sequences than the model was trained on.

What is the difference between encoder-only, decoder-only, and encoder-decoder transformers?

Encoder-only (BERT-style): sees the full input in both directions (bidirectional attention). Pre-trained with masked token prediction. Excellent for understanding tasks — text classification, NER, question answering as a reader. Cannot generate text autoregressively. Decoder-only (GPT-style): sees only preceding tokens (causal/unidirectional attention). Pre-trained with next-token prediction. Excellent for generation — writing, code, conversation. All current frontier LLMs (GPT-4, Claude, Gemini, LLaMA) use this architecture. Encoder-decoder (T5, BART): encoder processes input bidirectionally, decoder generates output causally. Good for sequence-to-sequence tasks (translation, summarization). More compute than decoder-only but better for structured transformation tasks.

Why did transformers replace RNNs and LSTMs for language tasks?

Three key advantages: Parallelism: RNNs process tokens sequentially — token N must wait for token N-1. Transformers process all tokens simultaneously, enabling massive speedup on GPUs during training. Long-range dependencies: RNNs struggle to propagate information over very long sequences (information 'fades' over 100+ steps). Attention directly connects any two tokens regardless of distance — 'the bank' at position 2 directly attends to 'river' at position 200. Scaling: transformers scale better with compute. The quality of RNN models plateaued; transformer quality keeps improving with more parameters and data. The 2017 'Attention Is All You Need' paper showed transformers outperform RNNs on translation tasks. By 2019-2020, transformers had completely displaced RNNs for almost all NLP applications.

AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

large language model architecture diagram on screen — transformer architecture explained

Llm Learning

Transformer Architecture Explained: The Architecture Behind All Modern AI

⚡ Quick Answer

Transformer architecture explained clearly — attention mechanisms, encoder-decoder structure, positional encoding, and why transformers replaced RNNs for NLP and beyond.

AiTechWorlds Team May 27, 2026 8 min read

#transformer-architecture-explained #attention-mechanism #bert-gpt-architecture #llm-learning

📚Part of the Llm Learning guide — explore all Llm Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Transformer Architecture Explained: The Architecture Behind All Modern AI

In June 2017, Google researchers published "Attention Is All You Need" — eight pages that changed AI more than almost any paper before it. The transformer architecture they introduced replaced recurrent neural networks that had dominated NLP for a decade, enabled training at a scale previously impossible, and became the foundation of every major AI system in 2025.

GPT-4, Claude 3.5, Gemini 1.5, LLaMA 3, DALL-E, Whisper, Stable Diffusion — all transformers. Understanding the architecture means understanding the foundation of modern AI.

This guide builds the transformer from first principles, explaining every component with code and intuition before putting them together.

The Problem Transformers Solved

Before transformers, sequence models used RNNs (Recurrent Neural Networks) or LSTMs:

RNN processing "The cat sat on the mat":

Step 1: Process "The"    → hidden state h₁
Step 2: Process "cat"    → hidden state h₂ (depends on h₁)
Step 3: Process "sat"    → hidden state h₃ (depends on h₂)
...sequential, can't parallelize...
Step 7: Process "mat"    → final hidden state

Problems:
1. Sequential processing: token N waits for token N-1 → slow training
2. Long-range dependencies: information about "The" fades by "mat"
3. Gradient vanishing: gradients shrink through many sequential steps

Transformers solve all three:

Transformer processing "The cat sat on the mat":

All tokens processed simultaneously (parallel)
Every token directly attends to every other token
No sequential dependency — gradients flow directly

Building Block 1: Self-Attention

Self-attention is the core mechanism. It allows each token to gather information from any other token in the sequence:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads  # Dimension per head
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x, mask=None):
        batch_size, seq_len, d_model = x.shape
        
        # Project input to Q, K, V
        Q = self.W_q(x)  # (batch, seq, d_model)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Reshape for multi-head attention
        Q = Q.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
        # Shape: (batch, n_heads, seq_len, d_head)
        
        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch, heads, seq, seq)
        scores = scores / math.sqrt(self.d_head)        # Scale to prevent vanishing gradients
        
        # Apply mask (for causal/decoder attention: can't see future tokens)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Softmax → attention weights (probabilities)
        attn_weights = F.softmax(scores, dim=-1)
        
        # Apply attention weights to values
        context = torch.matmul(attn_weights, V)  # (batch, heads, seq, d_head)
        
        # Concatenate heads and project
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch_size, seq_len, d_model)
        output = self.W_o(context)
        
        return output, attn_weights

The intuition for Q, K, V:

Imagine a library lookup system:
- Query (Q): "What information am I looking for?" — the question
- Key (K): "What information does each book have?" — the index
- Value (V): "What does each book actually contain?" — the content

Attention score = how well the query matches each key
Attention output = weighted sum of values, weighted by scores

For "bank" in "The river bank":
- Q_bank = "what disambiguates me?"
- K_river = "I'm about water/geography"
- High score → V_river contributes heavily to bank's representation

Building Block 2: Positional Encoding

Transformers process tokens in parallel with no inherent order. Positional encoding adds position information:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Create position encoding matrix
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len).unsqueeze(1).float()
        
        # Sinusoidal encoding
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions: sin
        pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions: cos
        
        pe = pe.unsqueeze(0)  # (1, max_seq_len, d_model)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        # Add positional encoding to token embeddings
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

# Test
d_model = 512
pe = PositionalEncoding(d_model)
sample_embeddings = torch.randn(1, 10, d_model)  # batch=1, seq=10
output = pe(sample_embeddings)
print(f"Output shape: {output.shape}")  # (1, 10, 512)

Why sinusoidal? It encodes both absolute position (each position has a unique pattern) and relative position (the difference between two positions can be computed from their encodings).

Building Block 3: Feed-Forward Network

Each transformer layer has a position-wise feed-forward network applied identically to each position:

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),              # Modern transformers use GELU, not ReLU
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
    
    def forward(self, x):
        return self.net(x)

# In GPT-3: d_model=12288, d_ff=49152 (4x d_model)
# The FFN stores most of the model's factual knowledge

Interestingly, research suggests the feed-forward layers store most of the factual knowledge, while attention handles the relational processing.

Building Block 4: Layer Normalization and Residual Connections

class TransformerLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = SelfAttention(d_model, n_heads)
        self.ff = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Residual connection 1: x + attention(norm(x))
        # "Pre-norm" used in GPT-2+: normalize before sublayer, not after
        attn_out, _ = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        
        # Residual connection 2: x + ff(norm(x))
        ff_out = self.ff(self.norm2(x))
        x = x + self.dropout(ff_out)
        
        return x

Why residual connections matter: Without them, gradients vanish through many layers. Residuals provide a "highway" for gradients to flow directly from output to early layers, enabling training of very deep networks (96+ layers in GPT-3).

Putting It Together: GPT-Style Decoder

class GPTModel(nn.Module):
    def __init__(self, vocab_size, d_model=512, n_heads=8, n_layers=6,
                 d_ff=2048, max_seq_len=1024, dropout=0.1):
        super().__init__()
        
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len, dropout)
        
        self.layers = nn.ModuleList([
            TransformerLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        self.output = nn.Linear(d_model, vocab_size)  # Predict next token
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, input_ids):
        seq_len = input_ids.size(1)
        
        # Causal mask: token i can only attend to tokens 0..i
        mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(0)
        mask = mask.to(input_ids.device)
        
        # Embed tokens + add positional encoding
        x = self.pos_encoding(self.token_embedding(input_ids))
        
        # Pass through transformer layers
        for layer in self.layers:
            x = layer(x, mask)
        
        # Final norm and project to vocabulary
        x = self.norm(x)
        logits = self.output(x)  # (batch, seq_len, vocab_size)
        
        return logits
    
    def generate(self, input_ids, max_new_tokens=50, temperature=1.0):
        """Autoregressive generation"""
        for _ in range(max_new_tokens):
            # Forward pass (only use last seq_len tokens if over limit)
            logits = self.forward(input_ids[:, -1024:])
            
            # Get logits for the last token
            next_token_logits = logits[:, -1, :] / temperature
            
            # Sample from the distribution
            probs = F.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Append to sequence
            input_ids = torch.cat([input_ids, next_token], dim=1)
        
        return input_ids

Encoder vs Decoder: The Key Difference

Encoder (BERT-style):
- Bidirectional attention: token can attend to ALL other tokens
- Used for understanding tasks (reading comprehension, classification)
- Pre-training: fill in masked tokens
- Example: "The [MASK] sat on the mat" → "cat"

Decoder (GPT-style):
- Causal/unidirectional: token can only attend to PRECEDING tokens
- Used for generation tasks (writing, coding, conversation)
- Pre-training: predict next token
- Example: "The cat sat on" → "the"

Encoder-Decoder (T5/BART):
- Encoder: bidirectional processing of input
- Decoder: causal generation of output, also attends to encoder output
- Used for: translation, summarization, seq2seq tasks

Scale and Modern Transformers

The original transformer had 65M parameters. Modern models:

Model	Parameters	Training Tokens	Context Window
BERT-base	110M	3.3B	512
GPT-2	1.5B	~10B	1,024
GPT-3	175B	300B	4,096
LLaMA 3.1	8B-405B	15T	128K
GPT-4 (est.)	~1.8T MoE	~10T+	128K

Key architectural differences in modern vs. original transformer:

RoPE (Rotary Position Embedding): Better long-context handling than sinusoidal
GQA (Grouped Query Attention): Reduces KV cache memory
SwiGLU activation: Replaces ReLU in FFN
RMSNorm: Faster than LayerNorm
Flash Attention: Memory-efficient attention for long sequences

Conclusion

The transformer's success comes from one insight: attention is all you need. Processing sequences by having every token attend to every other token — with learned Q, K, V projections — captures long-range dependencies, parallelizes across tokens, and scales with compute in ways RNNs cannot.

Every modern frontier model builds on this architecture with incremental improvements. Understanding the original lets you understand all of them.

For how transformers are trained at scale, see our how LLMs work guide. For applying transformer models in code, see our Hugging Face guide through the NLP beginners guide.

Frequently Asked Questions

The transformer is a neural network architecture introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al. (Google). It processes sequences using a self-attention mechanism — each token can attend to (be influenced by) every other token in the sequence simultaneously. Unlike RNNs that process tokens sequentially, transformers process all tokens in parallel, enabling much more efficient training on modern GPUs. The transformer consists of encoder and decoder components (or just one in GPT-style models), each made of stacked layers with self-attention and feed-forward sublayers. It's the foundation of BERT, GPT, T5, LLaMA, Claude, Gemini, and virtually every modern language model.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

large language model architecture diagram on screen — ai hallucination explained

AI Learning

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.

May 27, 2026 10 min read

large language model architecture diagram on screen — embeddings explained

AI Learning

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.

May 27, 2026 8 min read

large language model architecture diagram on screen — fine-tuning llms fine tuning llm guide

AI Learning

Fine-Tuning LLMs: When to Do It and How to Do It Right

Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.

May 27, 2026 9 min read

large language model architecture diagram on screen — gpt-4 vs claude vs gemini gpt4 vs claude vs gemini

AI Learning

🔥 Trending

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.

May 27, 2026 8 min read

Go deeper on this topic

NotesPrompt Engineering Cheat Sheet NotesLLM Core Concepts Explained NotesChatGPT Tips & Tricks Cheat Sheet NotesTransformer Architecture Cheat Sheet NotesPrompt Engineering vs Fine-Tuning vs RLHF NotesRAG: Retrieval-Augmented Generation Guide

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Llm Learning

Transformer Architecture Explained: The Architecture Behind All Modern AI

⚡ Quick Answer

Transformer architecture explained clearly — attention mechanisms, encoder-decoder structure, positional encoding, and why transformers replaced RNNs for NLP and beyond.

AiTechWorlds Team May 27, 2026 8 min read

#transformer-architecture-explained #attention-mechanism #bert-gpt-architecture #llm-learning

📚Part of the Llm Learning guide — explore all Llm Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Transformer Architecture Explained: The Architecture Behind All Modern AI

GPT-4, Claude 3.5, Gemini 1.5, LLaMA 3, DALL-E, Whisper, Stable Diffusion — all transformers. Understanding the architecture means understanding the foundation of modern AI.

This guide builds the transformer from first principles, explaining every component with code and intuition before putting them together.

The Problem Transformers Solved

Before transformers, sequence models used RNNs (Recurrent Neural Networks) or LSTMs:

RNN processing "The cat sat on the mat":

Step 1: Process "The"    → hidden state h₁
Step 2: Process "cat"    → hidden state h₂ (depends on h₁)
Step 3: Process "sat"    → hidden state h₃ (depends on h₂)
...sequential, can't parallelize...
Step 7: Process "mat"    → final hidden state

Problems:
1. Sequential processing: token N waits for token N-1 → slow training
2. Long-range dependencies: information about "The" fades by "mat"
3. Gradient vanishing: gradients shrink through many sequential steps

Transformers solve all three:

Transformer processing "The cat sat on the mat":

All tokens processed simultaneously (parallel)
Every token directly attends to every other token
No sequential dependency — gradients flow directly

Building Block 1: Self-Attention

Self-attention is the core mechanism. It allows each token to gather information from any other token in the sequence:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads  # Dimension per head
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x, mask=None):
        batch_size, seq_len, d_model = x.shape
        
        # Project input to Q, K, V
        Q = self.W_q(x)  # (batch, seq, d_model)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Reshape for multi-head attention
        Q = Q.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.n_heads, self.d_head).transpose(1, 2)
        # Shape: (batch, n_heads, seq_len, d_head)
        
        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch, heads, seq, seq)
        scores = scores / math.sqrt(self.d_head)        # Scale to prevent vanishing gradients
        
        # Apply mask (for causal/decoder attention: can't see future tokens)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Softmax → attention weights (probabilities)
        attn_weights = F.softmax(scores, dim=-1)
        
        # Apply attention weights to values
        context = torch.matmul(attn_weights, V)  # (batch, heads, seq, d_head)
        
        # Concatenate heads and project
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch_size, seq_len, d_model)
        output = self.W_o(context)
        
        return output, attn_weights

The intuition for Q, K, V:

Imagine a library lookup system:
- Query (Q): "What information am I looking for?" — the question
- Key (K): "What information does each book have?" — the index
- Value (V): "What does each book actually contain?" — the content

Attention score = how well the query matches each key
Attention output = weighted sum of values, weighted by scores

For "bank" in "The river bank":
- Q_bank = "what disambiguates me?"
- K_river = "I'm about water/geography"
- High score → V_river contributes heavily to bank's representation

Building Block 2: Positional Encoding

Transformers process tokens in parallel with no inherent order. Positional encoding adds position information:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Create position encoding matrix
        pe = torch.zeros(max_seq_len, d_model)
        position = torch.arange(0, max_seq_len).unsqueeze(1).float()
        
        # Sinusoidal encoding
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions: sin
        pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions: cos
        
        pe = pe.unsqueeze(0)  # (1, max_seq_len, d_model)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        # Add positional encoding to token embeddings
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

# Test
d_model = 512
pe = PositionalEncoding(d_model)
sample_embeddings = torch.randn(1, 10, d_model)  # batch=1, seq=10
output = pe(sample_embeddings)
print(f"Output shape: {output.shape}")  # (1, 10, 512)

Why sinusoidal? It encodes both absolute position (each position has a unique pattern) and relative position (the difference between two positions can be computed from their encodings).

Building Block 3: Feed-Forward Network

Each transformer layer has a position-wise feed-forward network applied identically to each position:

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),              # Modern transformers use GELU, not ReLU
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
    
    def forward(self, x):
        return self.net(x)

# In GPT-3: d_model=12288, d_ff=49152 (4x d_model)
# The FFN stores most of the model's factual knowledge

Interestingly, research suggests the feed-forward layers store most of the factual knowledge, while attention handles the relational processing.

Building Block 4: Layer Normalization and Residual Connections

class TransformerLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = SelfAttention(d_model, n_heads)
        self.ff = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Residual connection 1: x + attention(norm(x))
        # "Pre-norm" used in GPT-2+: normalize before sublayer, not after
        attn_out, _ = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        
        # Residual connection 2: x + ff(norm(x))
        ff_out = self.ff(self.norm2(x))
        x = x + self.dropout(ff_out)
        
        return x

Putting It Together: GPT-Style Decoder

class GPTModel(nn.Module):
    def __init__(self, vocab_size, d_model=512, n_heads=8, n_layers=6,
                 d_ff=2048, max_seq_len=1024, dropout=0.1):
        super().__init__()
        
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len, dropout)
        
        self.layers = nn.ModuleList([
            TransformerLayer(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        self.output = nn.Linear(d_model, vocab_size)  # Predict next token
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, input_ids):
        seq_len = input_ids.size(1)
        
        # Causal mask: token i can only attend to tokens 0..i
        mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(0)
        mask = mask.to(input_ids.device)
        
        # Embed tokens + add positional encoding
        x = self.pos_encoding(self.token_embedding(input_ids))
        
        # Pass through transformer layers
        for layer in self.layers:
            x = layer(x, mask)
        
        # Final norm and project to vocabulary
        x = self.norm(x)
        logits = self.output(x)  # (batch, seq_len, vocab_size)
        
        return logits
    
    def generate(self, input_ids, max_new_tokens=50, temperature=1.0):
        """Autoregressive generation"""
        for _ in range(max_new_tokens):
            # Forward pass (only use last seq_len tokens if over limit)
            logits = self.forward(input_ids[:, -1024:])
            
            # Get logits for the last token
            next_token_logits = logits[:, -1, :] / temperature
            
            # Sample from the distribution
            probs = F.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Append to sequence
            input_ids = torch.cat([input_ids, next_token], dim=1)
        
        return input_ids

Encoder vs Decoder: The Key Difference

Encoder (BERT-style):
- Bidirectional attention: token can attend to ALL other tokens
- Used for understanding tasks (reading comprehension, classification)
- Pre-training: fill in masked tokens
- Example: "The [MASK] sat on the mat" → "cat"

Decoder (GPT-style):
- Causal/unidirectional: token can only attend to PRECEDING tokens
- Used for generation tasks (writing, coding, conversation)
- Pre-training: predict next token
- Example: "The cat sat on" → "the"

Encoder-Decoder (T5/BART):
- Encoder: bidirectional processing of input
- Decoder: causal generation of output, also attends to encoder output
- Used for: translation, summarization, seq2seq tasks

Scale and Modern Transformers

The original transformer had 65M parameters. Modern models:

Model	Parameters	Training Tokens	Context Window
BERT-base	110M	3.3B	512
GPT-2	1.5B	~10B	1,024
GPT-3	175B	300B	4,096
LLaMA 3.1	8B-405B	15T	128K
GPT-4 (est.)	~1.8T MoE	~10T+	128K

Key architectural differences in modern vs. original transformer:

RoPE (Rotary Position Embedding): Better long-context handling than sinusoidal
GQA (Grouped Query Attention): Reduces KV cache memory
SwiGLU activation: Replaces ReLU in FFN
RMSNorm: Faster than LayerNorm
Flash Attention: Memory-efficient attention for long sequences

Conclusion

Every modern frontier model builds on this architecture with incremental improvements. Understanding the original lets you understand all of them.

For how transformers are trained at scale, see our how LLMs work guide. For applying transformer models in code, see our Hugging Face guide through the NLP beginners guide.

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI Learning

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.

May 27, 2026 10 min read

AI Learning

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.

May 27, 2026 8 min read

AI Learning

Fine-Tuning LLMs: When to Do It and How to Do It Right

Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.

May 27, 2026 9 min read

AI Learning

🔥 Trending

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.

May 27, 2026 8 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Transformer Architecture Explained: The Architecture Behind All Modern AI

Transformer Architecture Explained: The Architecture Behind All Modern AI

The Problem Transformers Solved

Building Block 1: Self-Attention

Building Block 2: Positional Encoding

Building Block 3: Feed-Forward Network

Building Block 4: Layer Normalization and Residual Connections

Putting It Together: GPT-Style Decoder

Encoder vs Decoder: The Key Difference

Scale and Modern Transformers

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Fine-Tuning LLMs: When to Do It and How to Do It Right

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

Go deeper on this topic

Get Free AI Notes Daily

Transformer Architecture Explained: The Architecture Behind All Modern AI

Transformer Architecture Explained: The Architecture Behind All Modern AI

The Problem Transformers Solved

Building Block 1: Self-Attention

Building Block 2: Positional Encoding

Building Block 3: Feed-Forward Network

Building Block 4: Layer Normalization and Residual Connections

Putting It Together: GPT-Style Decoder

Encoder vs Decoder: The Key Difference

Scale and Modern Transformers

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Fine-Tuning LLMs: When to Do It and How to Do It Right

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

Go deeper on this topic

Get Free AI Notes Daily