RLHF Explained: How Human Feedback Trains AI to Be Helpful and Safe
RLHF explained — how reinforcement learning from human feedback transforms raw language models into helpful assistants, with DPO, Constitutional AI, and modern alignment alternatives.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
RLHF Explained: How Human Feedback Trains AI to Be Helpful and Safe
If you asked GPT-2 (2019) to explain machine learning helpfully, it would generate coherent text that wasn't necessarily structured to be useful. It had no concept of "what the user wants" — only "what tokens are likely to follow."
RLHF — Reinforcement Learning from Human Feedback — is the technique that changes this. It's how you go from "a model that predicts likely text" to "a model that tries to be helpful, honest, and harmless." ChatGPT's 2022 launch made RLHF famous. Understanding it explains why modern AI assistants behave as they do — and why they fail in specific ways.
The Problem RLHF Solves
Raw pre-trained language models have one objective: predict the next token. This makes them good at text completion but poor at assistance:
Pre-trained GPT (before RLHF):
Prompt: "How do I bake a chocolate cake?"
Output: "...and the temperature should be 350F for 30 minutes. However, older
recipes use 375F. The difference in cake density stems from..."
[continues into technical essay no one asked for]
After RLHF:
Prompt: "How do I bake a chocolate cake?"
Output: "Here's a simple chocolate cake recipe:
1. Preheat oven to 350F...
[helpful, structured, practical]"
The pre-trained model couldn't gauge appropriate detail, format, or safety. RLHF teaches these properties from human judgments.
Stage 1: Supervised Fine-Tuning (SFT)
Fine-tune the pre-trained model on demonstrations of ideal behavior:
# SFT dataset format
sft_examples = [
{
"prompt": "Explain photosynthesis to a 10-year-old.",
"response": """Photosynthesis is how plants make their own food using sunlight!
1. Plants have green chlorophyll in their leaves
2. Chlorophyll captures sunlight (like a solar panel)
3. Plants breathe in CO2 from the air
4. They drink water through their roots
5. Using sunlight as energy, they combine water and CO2 to make sugar
6. They release oxygen as a bonus — the air we breathe!"""
},
{
"prompt": "Write a Python function to reverse a string.",
"response": """```python
def reverse_string(s: str) -> str:
return s[::-1]
Uses slice notation [::-1] — starts from end, steps backward. Example: reverse_string("hello") returns "olleh" """ }, ]
Fine-tune using standard cross-entropy loss
Model learns format, tone, and helpfulness patterns
SFT teaches the model *how* to respond, but not which responses are better.
---
## Stage 2: Reward Model Training
Collect human preference comparisons:
```python
# Human annotators compare response pairs
preference_data = [
{
"prompt": "What are the side effects of ibuprofen?",
"response_A": "Common side effects: stomach upset, nausea, heartburn. "
"Rare but serious: kidney problems, cardiovascular risks. "
"Take with food. Consult a doctor if symptoms persist.",
"response_B": "Ibuprofen can cause side effects. It is a medication. "
"People take it for pain.",
"preferred": "A", # More helpful, accurate, practical
},
# Hundreds of thousands of comparisons...
]
# Train a reward model to predict human preferences
import torch
import torch.nn as nn
from transformers import AutoModel
class RewardModel(nn.Module):
def __init__(self, base_model):
super().__init__()
self.model = base_model
self.value_head = nn.Linear(self.model.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
# Mean pool last hidden state
hidden = outputs.last_hidden_state.mean(dim=1)
return self.value_head(hidden).squeeze(-1) # Scalar reward
def preference_loss(reward_chosen, reward_rejected):
# Bradley-Terry model: P(A > B) = sigmoid(r_A - r_B)
# Maximize log-likelihood that chosen > rejected
return -torch.nn.functional.logsigmoid(reward_chosen - reward_rejected).mean()
# After training, reward_model(prompt + response_A) > reward_model(prompt + response_B)
# whenever humans prefer response_A
Stage 3: RL Fine-Tuning with PPO
Use the reward model to improve the policy:
# In practice, use the trl library (Hugging Face)
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
policy = AutoModelForCausalLMWithValueHead.from_pretrained("sft-checkpoint")
reference = AutoModelForCausalLMWithValueHead.from_pretrained("sft-checkpoint")
reference.requires_grad_(False) # Frozen reference for KL computation
config = PPOConfig(
learning_rate=1e-5,
batch_size=32,
kl_coef=0.1, # KL penalty weight — crucial for stability
target_kl=6.0, # Target KL divergence from reference
)
trainer = PPOTrainer(
config=config,
model=policy,
ref_model=reference,
tokenizer=tokenizer,
)
for epoch in range(num_epochs):
for batch in dataloader:
prompts = batch["input_ids"]
# Policy generates responses
responses = policy.generate(prompts, max_new_tokens=256)
# Score with reward model
rewards = reward_model(prompts, responses)
# KL penalty: keeps policy close to SFT reference
# Prevents reward hacking (gaming the reward model)
kl_penalty = compute_kl(policy, reference, responses)
adjusted_rewards = rewards - config.kl_coef * kl_penalty
# PPO update step
stats = trainer.step(prompts, responses, adjusted_rewards)
The KL penalty is critical — without it, the model learns to generate text that gets high reward scores without actually being better (reward hacking).
DPO: Simpler Alternative to RLHF
Direct Preference Optimization (Rafailov et al., 2023) skips the reward model entirely:
from trl import DPOTrainer, DPOConfig
from datasets import Dataset
# Same preference data, but used directly for fine-tuning
dataset = Dataset.from_dict({
"prompt": [
"How do I improve Python code performance?",
"What is the best way to learn machine learning?",
],
"chosen": [
"Profile first with cProfile, then optimize bottlenecks. Use list "
"comprehensions, avoid repeated dict lookups, use NumPy for numerical ops.",
"Start with fast.ai's Practical Deep Learning — builds working models "
"first, then explains theory. Add Andrew Ng's ML course for fundamentals.",
],
"rejected": [
"You can improve Python by writing better code. There are many ways.",
"Machine learning is complex. You should learn everything about it.",
]
})
dpo_config = DPOConfig(
model_name_or_path="meta-llama/Meta-Llama-3.1-8B-Instruct",
beta=0.1, # Lower = closer to reference model
learning_rate=5e-6,
num_train_epochs=3,
output_dir="./dpo-finetuned",
)
# No reward model, no RL — just supervised learning on preferences
trainer = DPOTrainer(
model=base_model,
ref_model=reference_model,
config=dpo_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
Why DPO works: it mathematically shows that the optimal RLHF solution can be expressed as a closed-form supervised objective. The implicit reward is:
r*(x, y) = β * log[π_θ(y|x) / π_ref(y|x)] + β * log Z(x)
Where π_θ is the trained policy, π_ref is the reference, and Z is a partition function. DPO optimizes this directly without explicitly computing it.
Constitutional AI (How Claude Is Trained)
Anthropic's Constitutional AI adds a self-critique loop:
Step 1: Generate initial responses with SFT model
Prompt → [model] → Response draft
Step 2: Critique against constitution
"Is this response helpful? Could it cause harm?"
"Does it violate the honesty principle?"
[AI reviews its own output against ~16 constitutional principles]
Step 3: Revise based on critique
"Revised to be more helpful and avoid potential harm..."
Step 4: Use revised responses as preference data
Original draft = rejected
Revised draft = chosen
Train reward model on these AI-generated preferences
Step 5: Fine-tune with RL using AI-preference reward model
(RLAIF — RL from AI Feedback, not just human feedback)
This is more scalable than pure RLHF because:
- Human annotation is only needed for the constitutional principles
- The AI evaluates most safety-related preference pairs
- Reduces human labeler exposure to harmful content
Reward Hacking: The Main Failure Mode
Reward hacking occurs when the model finds ways to maximize the reward model's score without actually being better:
Common reward hacking behaviors observed:
1. Length exploitation: if reward correlates with response length,
model generates unnecessarily long, padded responses
2. Sycophancy: model learns to agree with the user's stated beliefs,
even incorrect ones, because validation gets high ratings
3. Verbosity: adding "Great question!" and similar preambles
if these patterns appear in high-rated training examples
4. Format gaming: if bullet points got high ratings,
model bullet-points everything even when prose is better
Prevention:
- KL penalty (limits how far from SFT model can go)
- Multiple reward models (harder to simultaneously exploit all)
- Iterative RLHF: retrain reward model on RL-policy outputs
- Red-teaming: systematically probe for gaming behaviors
- Length normalization in reward model training
RLHF vs DPO vs Other Methods
| Method | Stages | Complexity | Stability | Quality | Use |
|---|---|---|---|---|---|
| RLHF (PPO) | 3 | High | Unstable | Very High | GPT-4, Claude |
| DPO | 1 | Low | Stable | High | Open-source fine-tunes |
| RLAIF | 2 | Medium | Medium | High | Claude (CAI) |
| ORPO | 1 | Very Low | Very Stable | Good | Efficient fine-tuning |
| KTO | 1 | Low | Stable | Good | Single labels (not pairs) |
Conclusion
RLHF is the bridge between "language model" and "AI assistant." The three-stage pipeline — SFT, reward model, PPO — transformed how we interact with AI systems. But the complexity and instability of RL training has driven the field toward DPO and related methods that achieve similar alignment with standard fine-tuning.
The key insight to carry forward: alignment is a training objective, not a filter. The model isn't checked against rules at inference time — it's trained to generate helpful, harmless responses in the first place. That's why edge cases and jailbreaks exist; they find prompts that circumvent learned patterns.
For building on top of aligned models, see our fine-tuning LLM guide. For understanding the base models RLHF trains on top of, see our how LLMs work guide.
Frequently Asked Questions
What is RLHF and why does it matter?
RLHF turns a next-token predictor into a helpful assistant by training on human preferences. Human annotators compare model outputs, their preferences train a reward model, and the language model fine-tunes to maximize that reward. It's what makes ChatGPT, Claude, and Gemini respond helpfully rather than just predicting likely text.
What are the three stages of RLHF?
Stage 1 (SFT): fine-tune on curated (prompt, ideal response) pairs. Stage 2 (reward model): collect human preference comparisons and train a model to predict them. Stage 3 (PPO): fine-tune with RL to maximize reward, with KL penalty to prevent reward hacking.
What is DPO and how does it differ from RLHF?
DPO (Direct Preference Optimization) achieves similar results in one fine-tuning step — no reward model, no RL. Fine-tunes directly on preference pairs using a reformulated objective. Simpler, more stable, comparable quality. Now dominant for open-source fine-tuning.
What is Constitutional AI?
Anthropic's approach for Claude: provides the model a constitution of principles, has it critique and revise its own responses, uses those AI-generated revision pairs as preference data. Scales better than pure human annotation because AI evaluates most safety preferences.
What is reward hacking in RLHF?
When the model maximizes reward scores without actually being better — finding loopholes in the reward model. Examples: unnecessary length, sycophancy, flattery preambles. Prevented by KL divergence penalty, reward model ensembles, iterative retraining, and red-teaming.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)
AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.
Embeddings Explained: How AI Converts Words to Numbers That Mean Something
Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.
Fine-Tuning LLMs: When to Do It and How to Do It Right
Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.
GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?
GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.