Fine-Tuning LLMs: When to Do It and How to Do It Right
Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Fine-Tuning LLMs: When to Do It and How to Do It Right
The fine-tuning question comes up constantly in AI development: "Should I just fine-tune the model on our data?"
Most of the time, the answer is no — good prompting with a frontier model is faster, easier to maintain, and often performs just as well. But for certain problems, fine-tuning provides clear advantages that prompting can't match.
This guide covers exactly when fine-tuning is worth it, how to do it efficiently with LoRA on modest hardware, and how to evaluate whether it worked.
When to Fine-Tune vs. Prompt
Prompt first (99% of cases)
Before considering fine-tuning, ensure you've exhausted the prompt engineering options:
# Few-shot prompting often achieves surprisingly strong results
system_prompt = """
You are a support ticket classifier. Classify tickets into:
billing, technical, account, feature_request, other.
Examples:
Input: "My invoice shows wrong amount"
Output: billing
Input: "App crashes when I upload PDF files"
Output: technical
Input: "How do I change my password?"
Output: account
Respond with only the category label, nothing else.
"""
# This approach may already achieve 90%+ accuracy
# without any fine-tuning
Fine-tune when:
1. Consistent specialized output format
Scenario: Extracting structured JSON from medical notes
Problem: GPT-4 sometimes adds explanatory text, changes field names,
or omits optional fields — inconsistency breaks downstream code
Solution: Fine-tune on 500 medical note → JSON pairs
Result: 100% output format compliance
2. Domain-specific style or terminology
Scenario: Legal document drafting firm
Problem: Model uses consumer-friendly language instead of legal terminology;
doesn't follow jurisdiction-specific formatting conventions
Solution: Fine-tune on 2,000 firm documents with their preferred style
Result: Outputs match firm style guide without extensive prompting
3. Production cost reduction
Scenario: High-volume classification (1M requests/day)
Problem: Each request needs a 1,500-token system prompt with examples
Cost: 1.5B tokens/day at $5/M = $7,500/day
Solution: Fine-tune Llama 3.1 8B to learn the task
Deploy locally or on cheaper inference endpoint
Cost: $50/day in compute vs $7,500/day in API costs
4. Size reduction with maintained quality
Scenario: Mobile/edge deployment
Problem: Can't run a 70B model on device
Solution: Fine-tune a 7B model on 1,000 examples of the 70B model's outputs
(knowledge distillation approach)
Result: 7B model performs at 70B level for the specific task
Don't fine-tune for:
- Tasks that good prompting already handles well
- When you have fewer than 100 high-quality examples
- Teaching the model new factual information (use RAG instead)
- Quick iteration and experimentation (fine-tuning takes hours)
The Fine-Tuning Stack in 2025
Popular combinations:
- Unsloth + Llama 3.1 8B + QLoRA: fastest, most memory-efficient
- Hugging Face TRL + any model: most flexible, best ecosystem
- OpenAI fine-tuning API: simplest if you use GPT-3.5/GPT-4o mini
Hardware requirements (QLoRA):
- 7B model fine-tuning: 12-16GB VRAM (RTX 3090, A10G, T4)
- 13B model fine-tuning: 24GB VRAM (A100 40GB, RTX 4090)
- 70B model fine-tuning: 48-80GB VRAM or multi-GPU
Data Preparation
Data quality is the most important factor in fine-tuning:
# Training data format (Alpaca/instruction-following style)
import json
training_examples = [
{
"instruction": "Classify this support ticket",
"input": "My payment was charged twice for the same order",
"output": "billing"
},
{
"instruction": "Classify this support ticket",
"input": "I can't log in after resetting my password",
"output": "account"
}
]
# Check data quality
def validate_training_data(examples):
issues = []
for i, ex in enumerate(examples):
if not ex.get('instruction'):
issues.append(f"Example {i}: missing instruction")
if not ex.get('output'):
issues.append(f"Example {i}: missing output")
if len(ex.get('output', '')) == 0:
issues.append(f"Example {i}: empty output")
return issues
issues = validate_training_data(training_examples)
print(f"Issues found: {len(issues)}")
if issues:
for issue in issues:
print(f" - {issue}")
# Save in JSONL format
with open('training_data.jsonl', 'w') as f:
for example in training_examples:
f.write(json.dumps(example) + '\n')
print(f"Training examples: {len(training_examples)}")
Data Quality Checklist
□ Consistent output format (exact same structure in every example)
□ No contradictions (two examples with same input but different output)
□ Representative distribution (covers all cases you'll see in production)
□ Edge cases included (examples with unusual inputs)
□ Held-out test set (10-20% never used in training)
□ Balanced classes (for classification tasks)
□ Quality > quantity (500 excellent >> 5,000 mediocre)
Fine-Tuning with Unsloth + QLoRA
Unsloth makes fine-tuning significantly faster and more memory-efficient:
# Install
# pip install unsloth
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# 1. Load base model with QLoRA configuration
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length = 2048,
dtype = None, # Auto-detect: float16 or bfloat16
load_in_4bit = True, # QLoRA: quantize to 4-bit for memory efficiency
)
# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r = 16, # LoRA rank (higher = more parameters = better but slower)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0, # 0 is optimal for LoRA
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 42,
)
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
# 3. Load and format dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Format into chat template
def format_prompt(example):
return f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
# 4. Training configuration
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = 2048,
args = TrainingArguments(
output_dir = "./outputs",
num_train_epochs = 3,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 4, # Effective batch size = 16
warmup_steps = 10,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 10,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "cosine",
seed = 42,
),
)
# 5. Train
trainer_stats = trainer.train()
print(f"Training time: {trainer_stats.metrics['train_runtime']:.1f}s")
print(f"Training loss: {trainer_stats.metrics['train_loss']:.4f}")
# 6. Save adapter
model.save_pretrained("my_finetuned_model")
tokenizer.save_pretrained("my_finetuned_model")
Inference with Your Fine-Tuned Model
from unsloth import FastLanguageModel
# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "my_finetuned_model",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
# Enable fast inference
FastLanguageModel.for_inference(model)
# Generate response
def classify_ticket(ticket_text):
prompt = f"""### Instruction:
Classify this support ticket
### Input:
{ticket_text}
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=20,
temperature=0.1, # Low temperature for consistent classification
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the response part
category = response.split("### Response:")[-1].strip()
return category
# Test
print(classify_ticket("My account shows duplicate charges from last week"))
# Output: billing
OpenAI Fine-Tuning API (Simpler Option)
For teams using OpenAI models, their API makes fine-tuning accessible:
from openai import OpenAI
import json
client = OpenAI()
# 1. Prepare data in OpenAI format
training_data = [
{
"messages": [
{"role": "system", "content": "You are a support ticket classifier."},
{"role": "user", "content": "My payment failed twice"},
{"role": "assistant", "content": "billing"}
]
},
# ... more examples
]
# Save as JSONL
with open('openai_training.jsonl', 'w') as f:
for example in training_data:
f.write(json.dumps(example) + '\n')
# 2. Upload training file
with open("openai_training.jsonl", "rb") as f:
response = client.files.create(file=f, purpose="fine-tune")
file_id = response.id
print(f"File uploaded: {file_id}")
# 3. Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file_id,
model="gpt-4o-mini-2024-07-18", # Cheapest, fastest to fine-tune
hyperparameters={"n_epochs": 3}
)
print(f"Fine-tuning job: {job.id}")
# 4. Check status
import time
while True:
status = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {status.status}")
if status.status in ["succeeded", "failed"]:
break
time.sleep(30)
if status.status == "succeeded":
fine_tuned_model_id = status.fine_tuned_model
print(f"Fine-tuned model: {fine_tuned_model_id}")
Evaluation
Always evaluate systematically before deploying:
import json
from openai import OpenAI
def evaluate_model(model_id, test_data, client):
"""Compare fine-tuned model vs base model on test set"""
results = {
'fine_tuned': {'correct': 0, 'total': 0},
'base': {'correct': 0, 'total': 0}
}
for example in test_data:
question = example['input']
ground_truth = example['output']
for model_name in ['fine_tuned', 'base']:
model = model_id if model_name == 'fine_tuned' else 'gpt-4o-mini'
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Classify support tickets."},
{"role": "user", "content": question}
],
max_tokens=10,
temperature=0
)
prediction = response.choices[0].message.content.strip()
results[model_name]['total'] += 1
if prediction.lower() == ground_truth.lower():
results[model_name]['correct'] += 1
for model_name, r in results.items():
accuracy = r['correct'] / r['total']
print(f"{model_name}: {accuracy:.1%} ({r['correct']}/{r['total']})")
evaluate_model(fine_tuned_model_id, test_data, client)
Cost Estimates
OpenAI fine-tuning (gpt-4o-mini):
- Training: $0.003/1K tokens
- 1,000 examples × 300 tokens avg × 3 epochs = 900K tokens = $2.70
- Inference: $0.30/1M input + $1.20/1M output (same price as regular)
Self-hosted QLoRA (Llama 3.1 8B):
- Cloud GPU for training: ~$1-3/hour (A10G or A100)
- 1,000 examples × 3 epochs: ~1-2 hours = $2-6
- Inference: fixed server cost (much cheaper at scale)
Time requirements:
- 100-500 examples: 30-60 min fine-tuning
- 1,000-5,000 examples: 1-4 hours
- 10,000+ examples: 4-12+ hours
Conclusion
Fine-tuning is powerful but often overkill. Try prompt engineering first — many tasks that seem to need fine-tuning are solved by well-structured few-shot examples.
When fine-tuning is justified, QLoRA with Unsloth dramatically reduces the hardware requirements. A fine-tuned 7B model deployed on a single GPU can outperform prompted 70B models for specific tasks at a fraction of the inference cost.
For the broader LLM context, see our how LLMs work guide. For using LLMs in applications without fine-tuning, see our RAG guide.
Frequently Asked Questions
When should I fine-tune an LLM instead of using prompting?
Fine-tune for: consistent output formats prompting can't reliably achieve, domain-specific style/terminology, high-volume production where long system prompts are expensive, or size reduction for deployment. Don't fine-tune for: tasks prompting already handles, fewer than 100 examples, factual grounding (use RAG), or quick iteration.
What is LoRA fine-tuning and why is it preferred?
LoRA trains small adapter matrices (0.1-1% of parameters) instead of updating all weights. QLoRA further quantizes the base model to 4-bit — enabling 7B model fine-tuning on a 16GB GPU. Results comparable to full fine-tuning for most tasks at a fraction of the compute cost.
How much training data do I need to fine-tune an LLM?
For format/style: 100-500 quality examples. For domain adaptation: 1,000-5,000 examples. For complex behaviors: 5,000-50,000. Quality matters more than quantity: 500 excellent examples beat 5,000 mediocre ones consistently.
What base model should I fine-tune?
Llama 3.1 8B is the practical default for most new projects in 2025 — widely supported, strong baseline, feasible on single A100. For smaller deployment: Phi-3 or Mistral 7B. For maximum quality: Llama 3.1 70B if you have the compute.
How do I evaluate whether my fine-tuned model is better?
Hold out 10-20% test set. For classification: accuracy/F1. For generation: LLM-as-judge evaluation (use GPT-4 to rate outputs). Best: human evaluation on a sample. Compare against base model and few-shot prompted base model — fine-tuning should outperform both to justify the cost.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)
AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.
Embeddings Explained: How AI Converts Words to Numbers That Mean Something
Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.
GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?
GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.
How Large Language Models Work: A Clear Technical Explanation
How large language models work explained clearly — from tokenization and transformers to training on billions of tokens, RLHF alignment, and why they sometimes hallucinate.