AiTechWorlds
AiTechWorlds
Side-by-side comparison of prompt engineering, LoRA fine-tuning, and RLHF/DPO — when to use each and full code examples.
When you need an LLM to do something specific, you have three main levers:
1. Prompt Engineering — craft the input to guide the model
2. Fine-Tuning — train the model further on your data
3. RLHF — align the model to human preferences via reward modeling
Each approach has a different cost, complexity, and effectiveness ceiling.
| Factor | Prompt Engineering | Fine-Tuning | RLHF |
|---|---|---|---|
| Cost | Free / API cost only | $10–$10,000+ | $100,000+ |
| Data needed | None | 100–100K examples | Human feedback pairs |
| Technical skill | Low | Medium | Very high |
| Latency | No change | No change | No change |
| Reversibility | Instant | Requires retraining | Requires retraining |
| Output consistency | Variable | High | Very high |
| Best for | Iteration, prototyping | Domain adaptation | Safety and alignment |
Writing system prompts, few-shot examples, and instruction structures that guide the base model's output without changing its weights.
| Technique | When to Use |
|---|---|
| Zero-shot | Simple, clear tasks |
| Few-shot | Format consistency, classification |
| Chain-of-thought (CoT) | Multi-step reasoning |
| Role prompting | Tone/persona control |
| Structured output (JSON) | Downstream parsing |
| Retrieval augmentation | Grounding responses in facts |
Continuing training on a pre-trained LLM using your own supervised dataset. The model's weights are updated to produce outputs matching your examples.
| Type | How It Works | Data Needed |
|---|---|---|
| Full fine-tuning | Update all weights | 10K–1M examples |
| LoRA (Low-Rank Adaptation) | Add small adapter layers, freeze base | 100–10K examples |
| QLoRA | LoRA on quantized (4-bit) model | 100–10K examples |
| Instruction tuning | Fine-tune on instruction-response pairs | 1K–100K examples |
| Domain adaptation | Fine-tune on domain corpus | 1M+ tokens |
# Standard layer: W is frozen
output = x @ W
# LoRA: train only A and B (small matrices)
output = x @ W + x @ (A @ B) * scale
# W shape: (d_in, d_out) — frozen
# A shape: (d_in, r), B shape: (r, d_out) — trainable
# r (rank) is typically 8, 16, or 64A three-phase process that aligns a model to human preferences, used to create the instruct/chat versions of GPT, Claude, and LLaMA.
Fine-tune the base model on human-written demonstrations.
Human annotators compare model outputs and rank them. A separate model is trained to predict these rankings — the reward model.
Use the reward model as a signal for RL (Proximal Policy Optimization) to update the LLM toward higher-reward outputs.
# DPO skips the explicit reward model
# Train directly on (preferred, rejected) response pairs
# Loss: maximize probability of preferred, minimize rejected
loss = -log(σ(β * log(π_θ(y_w)/π_ref(y_w)) - β * log(π_θ(y_l)/π_ref(y_l))))DPO is now preferred over PPO-RLHF for most alignment tasks — simpler, stabler, and comparable quality.
Need to guide an existing capable model?
└─ Try prompt engineering first (zero cost)
Model is capable but inconsistent in format/style?
└─ Fine-tune with LoRA (100-1K examples often enough)
Model lacks domain knowledge?
└─ RAG for factual queries OR domain fine-tuning for reasoning
Need safety, refusal, and alignment?
└─ RLHF/DPO at scale (requires significant resources)
Need smallest possible model for edge deployment?
└─ Fine-tune + quantize (QLoRA → GGUF/ONNX)| Stage | Tool |
|---|---|
| Prompt testing | OpenAI Playground, PromptFlow, LangSmith |
| Fine-tuning | Hugging Face TRL, LLaMA-Factory, Axolotl |
| LoRA/QLoRA | peft library (HuggingFace) |
| DPO | HuggingFace TRL DPOTrainer |
| Reward models | trl + custom preference datasets |
Download Prompt Engineering vs Fine-Tuning vs RLHF
Get this note + 100s more free on Telegram
Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!
No spam. Leave anytime.