What is LoRA fine-tuning and why is it preferred?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains a small set of adapter weights rather than updating all model parameters. Full fine-tuning of a 7B model requires storing and computing gradients for 7 billion parameters — extremely expensive. LoRA adds small trainable matrices (rank 8-64) to each transformer layer and only trains those, typically 0.1-1% of total parameters. Results are comparable to full fine-tuning for most tasks. QLoRA (Quantized LoRA) further reduces memory by quantizing the base model to 4-bit — enabling fine-tuning of 7B models on a single 24GB GPU, and 13B models on a consumer A100 (40GB).

How much training data do I need to fine-tune an LLM?

It depends heavily on the task. For teaching a consistent output format or style: 100-500 high-quality examples are often sufficient. For domain adaptation (teaching industry-specific language): 1,000-5,000 examples typically produce good results. For teaching complex new behaviors: 5,000-50,000 examples. For matching GPT-4-level instruction following: 100K+ diverse examples (like the Alpaca or FLAN datasets). Quality matters more than quantity: 500 carefully curated, high-quality examples consistently outperform 5,000 noisy or inconsistent examples. The data should represent the exact distribution of inputs and desired outputs you'll see in production.

What base model should I fine-tune?

In 2025, the most commonly fine-tuned base models: Llama 3.1 (8B and 70B) — Meta's open-source models, excellent for fine-tuning, very active community. Mistral 7B / Mixtral 8x7B — strong performance/size ratio, good for production deployment. Phi-3 / Phi-4 (Microsoft) — strong small models (3.8B-14B) for resource-constrained deployment. Gemma 2 (Google) — solid open-source option with commercial license. For most new fine-tuning projects: Llama 3.1 8B is the practical default — widely supported in Hugging Face and Unsloth, strong baseline performance, feasible to fine-tune on a single A100 GPU. If you need commercial deployment without restrictions, check the specific model's license.

How do I evaluate whether my fine-tuned model is better?

Evaluation requires a held-out test set (never use training data for evaluation). For classification tasks: standard metrics (accuracy, F1, AUC). For generation tasks: automatic metrics (BLEU, ROUGE for translation/summarization; METEOR) — though these are imperfect proxies. Better: LLM-as-judge evaluation — use GPT-4 or Claude to rate outputs on quality dimensions (accuracy, helpfulness, format compliance). Best: human evaluation on a sample. Practical workflow: hold out 10-20% of your data for testing; compare fine-tuned model vs. base model and vs. few-shot prompted base model on the same inputs; identify failure modes (look at worst-performing examples); decide if improvement justifies cost of fine-tuning and maintaining the model.

AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

large language model architecture diagram on screen — fine-tuning llms fine tuning llm guide

Llm Learning

Fine-Tuning LLMs: When to Do It and How to Do It Right

⚡ Quick Answer

Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.

AiTechWorlds Team May 27, 2026 9 min read

#fine-tuning-llm-guide #llm-fine-tuning #lora-fine-tuning #llm-learning

📚Part of the Llm Learning guide — explore all Llm Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Fine-Tuning LLMs: When to Do It and How to Do It Right

The fine-tuning question comes up constantly in AI development: "Should I just fine-tune the model on our data?"

Most of the time, the answer is no — good prompting with a frontier model is faster, easier to maintain, and often performs just as well. But for certain problems, fine-tuning provides clear advantages that prompting can't match.

This guide covers exactly when fine-tuning is worth it, how to do it efficiently with LoRA on modest hardware, and how to evaluate whether it worked.

When to Fine-Tune vs. Prompt

Prompt first (99% of cases)

Before considering fine-tuning, ensure you've exhausted the prompt engineering options:

# Few-shot prompting often achieves surprisingly strong results
system_prompt = """
You are a support ticket classifier. Classify tickets into: 
billing, technical, account, feature_request, other.

Examples:
Input: "My invoice shows wrong amount"
Output: billing

Input: "App crashes when I upload PDF files"
Output: technical

Input: "How do I change my password?"
Output: account

Respond with only the category label, nothing else.
"""

# This approach may already achieve 90%+ accuracy
# without any fine-tuning

Fine-tune when:

1. Consistent specialized output format

Scenario: Extracting structured JSON from medical notes
Problem: GPT-4 sometimes adds explanatory text, changes field names,
         or omits optional fields — inconsistency breaks downstream code
Solution: Fine-tune on 500 medical note → JSON pairs
Result: 100% output format compliance

2. Domain-specific style or terminology

Scenario: Legal document drafting firm
Problem: Model uses consumer-friendly language instead of legal terminology;
         doesn't follow jurisdiction-specific formatting conventions
Solution: Fine-tune on 2,000 firm documents with their preferred style
Result: Outputs match firm style guide without extensive prompting

3. Production cost reduction

Scenario: High-volume classification (1M requests/day)
Problem: Each request needs a 1,500-token system prompt with examples
         Cost: 1.5B tokens/day at $5/M = $7,500/day
Solution: Fine-tune Llama 3.1 8B to learn the task
         Deploy locally or on cheaper inference endpoint
Cost: $50/day in compute vs $7,500/day in API costs

4. Size reduction with maintained quality

Scenario: Mobile/edge deployment
Problem: Can't run a 70B model on device
Solution: Fine-tune a 7B model on 1,000 examples of the 70B model's outputs
          (knowledge distillation approach)
Result: 7B model performs at 70B level for the specific task

Don't fine-tune for:

Tasks that good prompting already handles well
When you have fewer than 100 high-quality examples
Teaching the model new factual information (use RAG instead)
Quick iteration and experimentation (fine-tuning takes hours)

The Fine-Tuning Stack in 2025

Popular combinations:
- Unsloth + Llama 3.1 8B + QLoRA: fastest, most memory-efficient
- Hugging Face TRL + any model: most flexible, best ecosystem
- OpenAI fine-tuning API: simplest if you use GPT-3.5/GPT-4o mini

Hardware requirements (QLoRA):
- 7B model fine-tuning: 12-16GB VRAM (RTX 3090, A10G, T4)
- 13B model fine-tuning: 24GB VRAM (A100 40GB, RTX 4090)
- 70B model fine-tuning: 48-80GB VRAM or multi-GPU

Data Preparation

Data quality is the most important factor in fine-tuning:

# Training data format (Alpaca/instruction-following style)
import json

training_examples = [
    {
        "instruction": "Classify this support ticket",
        "input": "My payment was charged twice for the same order",
        "output": "billing"
    },
    {
        "instruction": "Classify this support ticket",
        "input": "I can't log in after resetting my password",
        "output": "account"
    }
]

# Check data quality
def validate_training_data(examples):
    issues = []
    for i, ex in enumerate(examples):
        if not ex.get('instruction'):
            issues.append(f"Example {i}: missing instruction")
        if not ex.get('output'):
            issues.append(f"Example {i}: missing output")
        if len(ex.get('output', '')) == 0:
            issues.append(f"Example {i}: empty output")
    return issues

issues = validate_training_data(training_examples)
print(f"Issues found: {len(issues)}")
if issues:
    for issue in issues:
        print(f"  - {issue}")

# Save in JSONL format
with open('training_data.jsonl', 'w') as f:
    for example in training_examples:
        f.write(json.dumps(example) + '\n')

print(f"Training examples: {len(training_examples)}")

Data Quality Checklist

□ Consistent output format (exact same structure in every example)
□ No contradictions (two examples with same input but different output)
□ Representative distribution (covers all cases you'll see in production)
□ Edge cases included (examples with unusual inputs)
□ Held-out test set (10-20% never used in training)
□ Balanced classes (for classification tasks)
□ Quality > quantity (500 excellent >> 5,000 mediocre)

Fine-Tuning with Unsloth + QLoRA

Unsloth makes fine-tuning significantly faster and more memory-efficient:

# Install
# pip install unsloth

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Load base model with QLoRA configuration
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = 2048,
    dtype = None,           # Auto-detect: float16 or bfloat16
    load_in_4bit = True,    # QLoRA: quantize to 4-bit for memory efficiency
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,                 # LoRA rank (higher = more parameters = better but slower)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,       # 0 is optimal for LoRA
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

# 3. Load and format dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Format into chat template
def format_prompt(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

# 4. Training configuration
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    args = TrainingArguments(
        output_dir = "./outputs",
        num_train_epochs = 3,
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,  # Effective batch size = 16
        warmup_steps = 10,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 42,
    ),
)

# 5. Train
trainer_stats = trainer.train()
print(f"Training time: {trainer_stats.metrics['train_runtime']:.1f}s")
print(f"Training loss: {trainer_stats.metrics['train_loss']:.4f}")

# 6. Save adapter
model.save_pretrained("my_finetuned_model")
tokenizer.save_pretrained("my_finetuned_model")

Inference with Your Fine-Tuned Model

from unsloth import FastLanguageModel

# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "my_finetuned_model",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

# Enable fast inference
FastLanguageModel.for_inference(model)

# Generate response
def classify_ticket(ticket_text):
    prompt = f"""### Instruction:
Classify this support ticket

### Input:
{ticket_text}

### Response:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=20,
            temperature=0.1,    # Low temperature for consistent classification
            do_sample=True,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract only the response part
    category = response.split("### Response:")[-1].strip()
    return category

# Test
print(classify_ticket("My account shows duplicate charges from last week"))
# Output: billing

OpenAI Fine-Tuning API (Simpler Option)

For teams using OpenAI models, their API makes fine-tuning accessible:

from openai import OpenAI
import json

client = OpenAI()

# 1. Prepare data in OpenAI format
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a support ticket classifier."},
            {"role": "user", "content": "My payment failed twice"},
            {"role": "assistant", "content": "billing"}
        ]
    },
    # ... more examples
]

# Save as JSONL
with open('openai_training.jsonl', 'w') as f:
    for example in training_data:
        f.write(json.dumps(example) + '\n')

# 2. Upload training file
with open("openai_training.jsonl", "rb") as f:
    response = client.files.create(file=f, purpose="fine-tune")
file_id = response.id
print(f"File uploaded: {file_id}")

# 3. Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file_id,
    model="gpt-4o-mini-2024-07-18",  # Cheapest, fastest to fine-tune
    hyperparameters={"n_epochs": 3}
)
print(f"Fine-tuning job: {job.id}")

# 4. Check status
import time
while True:
    status = client.fine_tuning.jobs.retrieve(job.id)
    print(f"Status: {status.status}")
    if status.status in ["succeeded", "failed"]:
        break
    time.sleep(30)

if status.status == "succeeded":
    fine_tuned_model_id = status.fine_tuned_model
    print(f"Fine-tuned model: {fine_tuned_model_id}")

Evaluation

Always evaluate systematically before deploying:

import json
from openai import OpenAI

def evaluate_model(model_id, test_data, client):
    """Compare fine-tuned model vs base model on test set"""
    results = {
        'fine_tuned': {'correct': 0, 'total': 0},
        'base': {'correct': 0, 'total': 0}
    }
    
    for example in test_data:
        question = example['input']
        ground_truth = example['output']
        
        for model_name in ['fine_tuned', 'base']:
            model = model_id if model_name == 'fine_tuned' else 'gpt-4o-mini'
            
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "Classify support tickets."},
                    {"role": "user", "content": question}
                ],
                max_tokens=10,
                temperature=0
            )
            
            prediction = response.choices[0].message.content.strip()
            results[model_name]['total'] += 1
            if prediction.lower() == ground_truth.lower():
                results[model_name]['correct'] += 1
    
    for model_name, r in results.items():
        accuracy = r['correct'] / r['total']
        print(f"{model_name}: {accuracy:.1%} ({r['correct']}/{r['total']})")

evaluate_model(fine_tuned_model_id, test_data, client)

Cost Estimates

OpenAI fine-tuning (gpt-4o-mini):
- Training: $0.003/1K tokens
- 1,000 examples × 300 tokens avg × 3 epochs = 900K tokens = $2.70
- Inference: $0.30/1M input + $1.20/1M output (same price as regular)

Self-hosted QLoRA (Llama 3.1 8B):
- Cloud GPU for training: ~$1-3/hour (A10G or A100)
- 1,000 examples × 3 epochs: ~1-2 hours = $2-6
- Inference: fixed server cost (much cheaper at scale)

Time requirements:
- 100-500 examples: 30-60 min fine-tuning
- 1,000-5,000 examples: 1-4 hours
- 10,000+ examples: 4-12+ hours

Conclusion

Fine-tuning is powerful but often overkill. Try prompt engineering first — many tasks that seem to need fine-tuning are solved by well-structured few-shot examples.

When fine-tuning is justified, QLoRA with Unsloth dramatically reduces the hardware requirements. A fine-tuned 7B model deployed on a single GPU can outperform prompted 70B models for specific tasks at a fraction of the inference cost.

For the broader LLM context, see our how LLMs work guide. For using LLMs in applications without fine-tuning, see our RAG guide.

Frequently Asked Questions

Fine-tune when: (1) you need consistent output format/style that prompting can't reliably achieve; (2) you need to teach domain-specific knowledge, terminology, or writing style that general models don't have; (3) you need to run many inferences and the cost of long system prompts adds up; (4) you want to reduce model size for deployment — a fine-tuned 7B model often outperforms a 70B model prompted with many examples for a specific task. Don't fine-tune for: tasks that good prompting can already handle; when you have fewer than 100 high-quality examples; to teach the model 'new information' (RAG is better for factual grounding); or when you need quick iteration — fine-tuning takes hours to days.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

large language model architecture diagram on screen — ai hallucination explained

AI Learning

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.

May 27, 2026 10 min read

large language model architecture diagram on screen — embeddings explained

AI Learning

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.

May 27, 2026 8 min read

large language model architecture diagram on screen — gpt-4 vs claude vs gemini gpt4 vs claude vs gemini

AI Learning

🔥 Trending

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.

May 27, 2026 8 min read

large language model architecture diagram on screen — how large language models work how llms work

AI Learning

🔥 Trending

How Large Language Models Work: A Clear Technical Explanation

How large language models work explained clearly — from tokenization and transformers to training on billions of tokens, RLHF alignment, and why they sometimes hallucinate.

May 27, 2026 9 min read

Go deeper on this topic

NotesPrompt Engineering Cheat Sheet NotesLLM Core Concepts Explained NotesChatGPT Tips & Tricks Cheat Sheet NotesTransformer Architecture Cheat Sheet NotesPrompt Engineering vs Fine-Tuning vs RLHF NotesRAG: Retrieval-Augmented Generation Guide

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Llm Learning

Fine-Tuning LLMs: When to Do It and How to Do It Right

⚡ Quick Answer

Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.

AiTechWorlds Team May 27, 2026 9 min read

#fine-tuning-llm-guide #llm-fine-tuning #lora-fine-tuning #llm-learning

📚Part of the Llm Learning guide — explore all Llm Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Fine-Tuning LLMs: When to Do It and How to Do It Right

The fine-tuning question comes up constantly in AI development: "Should I just fine-tune the model on our data?"

This guide covers exactly when fine-tuning is worth it, how to do it efficiently with LoRA on modest hardware, and how to evaluate whether it worked.

When to Fine-Tune vs. Prompt

Prompt first (99% of cases)

Before considering fine-tuning, ensure you've exhausted the prompt engineering options:

# Few-shot prompting often achieves surprisingly strong results
system_prompt = """
You are a support ticket classifier. Classify tickets into: 
billing, technical, account, feature_request, other.

Examples:
Input: "My invoice shows wrong amount"
Output: billing

Input: "App crashes when I upload PDF files"
Output: technical

Input: "How do I change my password?"
Output: account

Respond with only the category label, nothing else.
"""

# This approach may already achieve 90%+ accuracy
# without any fine-tuning

Fine-tune when:

1. Consistent specialized output format

Scenario: Extracting structured JSON from medical notes
Problem: GPT-4 sometimes adds explanatory text, changes field names,
         or omits optional fields — inconsistency breaks downstream code
Solution: Fine-tune on 500 medical note → JSON pairs
Result: 100% output format compliance

2. Domain-specific style or terminology

Scenario: Legal document drafting firm
Problem: Model uses consumer-friendly language instead of legal terminology;
         doesn't follow jurisdiction-specific formatting conventions
Solution: Fine-tune on 2,000 firm documents with their preferred style
Result: Outputs match firm style guide without extensive prompting

3. Production cost reduction

Scenario: High-volume classification (1M requests/day)
Problem: Each request needs a 1,500-token system prompt with examples
         Cost: 1.5B tokens/day at $5/M = $7,500/day
Solution: Fine-tune Llama 3.1 8B to learn the task
         Deploy locally or on cheaper inference endpoint
Cost: $50/day in compute vs $7,500/day in API costs

4. Size reduction with maintained quality

Scenario: Mobile/edge deployment
Problem: Can't run a 70B model on device
Solution: Fine-tune a 7B model on 1,000 examples of the 70B model's outputs
          (knowledge distillation approach)
Result: 7B model performs at 70B level for the specific task

Don't fine-tune for:

Tasks that good prompting already handles well
When you have fewer than 100 high-quality examples
Teaching the model new factual information (use RAG instead)
Quick iteration and experimentation (fine-tuning takes hours)

The Fine-Tuning Stack in 2025

Popular combinations:
- Unsloth + Llama 3.1 8B + QLoRA: fastest, most memory-efficient
- Hugging Face TRL + any model: most flexible, best ecosystem
- OpenAI fine-tuning API: simplest if you use GPT-3.5/GPT-4o mini

Hardware requirements (QLoRA):
- 7B model fine-tuning: 12-16GB VRAM (RTX 3090, A10G, T4)
- 13B model fine-tuning: 24GB VRAM (A100 40GB, RTX 4090)
- 70B model fine-tuning: 48-80GB VRAM or multi-GPU

Data Preparation

Data quality is the most important factor in fine-tuning:

# Training data format (Alpaca/instruction-following style)
import json

training_examples = [
    {
        "instruction": "Classify this support ticket",
        "input": "My payment was charged twice for the same order",
        "output": "billing"
    },
    {
        "instruction": "Classify this support ticket",
        "input": "I can't log in after resetting my password",
        "output": "account"
    }
]

# Check data quality
def validate_training_data(examples):
    issues = []
    for i, ex in enumerate(examples):
        if not ex.get('instruction'):
            issues.append(f"Example {i}: missing instruction")
        if not ex.get('output'):
            issues.append(f"Example {i}: missing output")
        if len(ex.get('output', '')) == 0:
            issues.append(f"Example {i}: empty output")
    return issues

issues = validate_training_data(training_examples)
print(f"Issues found: {len(issues)}")
if issues:
    for issue in issues:
        print(f"  - {issue}")

# Save in JSONL format
with open('training_data.jsonl', 'w') as f:
    for example in training_examples:
        f.write(json.dumps(example) + '\n')

print(f"Training examples: {len(training_examples)}")

Data Quality Checklist

□ Consistent output format (exact same structure in every example)
□ No contradictions (two examples with same input but different output)
□ Representative distribution (covers all cases you'll see in production)
□ Edge cases included (examples with unusual inputs)
□ Held-out test set (10-20% never used in training)
□ Balanced classes (for classification tasks)
□ Quality > quantity (500 excellent >> 5,000 mediocre)

Fine-Tuning with Unsloth + QLoRA

Unsloth makes fine-tuning significantly faster and more memory-efficient:

# Install
# pip install unsloth

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# 1. Load base model with QLoRA configuration
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = 2048,
    dtype = None,           # Auto-detect: float16 or bfloat16
    load_in_4bit = True,    # QLoRA: quantize to 4-bit for memory efficiency
)

# 2. Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,                 # LoRA rank (higher = more parameters = better but slower)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,       # 0 is optimal for LoRA
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

# 3. Load and format dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Format into chat template
def format_prompt(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

# 4. Training configuration
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    args = TrainingArguments(
        output_dir = "./outputs",
        num_train_epochs = 3,
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,  # Effective batch size = 16
        warmup_steps = 10,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 42,
    ),
)

# 5. Train
trainer_stats = trainer.train()
print(f"Training time: {trainer_stats.metrics['train_runtime']:.1f}s")
print(f"Training loss: {trainer_stats.metrics['train_loss']:.4f}")

# 6. Save adapter
model.save_pretrained("my_finetuned_model")
tokenizer.save_pretrained("my_finetuned_model")

Inference with Your Fine-Tuned Model

from unsloth import FastLanguageModel

# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "my_finetuned_model",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

# Enable fast inference
FastLanguageModel.for_inference(model)

# Generate response
def classify_ticket(ticket_text):
    prompt = f"""### Instruction:
Classify this support ticket

### Input:
{ticket_text}

### Response:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=20,
            temperature=0.1,    # Low temperature for consistent classification
            do_sample=True,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract only the response part
    category = response.split("### Response:")[-1].strip()
    return category

# Test
print(classify_ticket("My account shows duplicate charges from last week"))
# Output: billing

OpenAI Fine-Tuning API (Simpler Option)

For teams using OpenAI models, their API makes fine-tuning accessible:

from openai import OpenAI
import json

client = OpenAI()

# 1. Prepare data in OpenAI format
training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a support ticket classifier."},
            {"role": "user", "content": "My payment failed twice"},
            {"role": "assistant", "content": "billing"}
        ]
    },
    # ... more examples
]

# Save as JSONL
with open('openai_training.jsonl', 'w') as f:
    for example in training_data:
        f.write(json.dumps(example) + '\n')

# 2. Upload training file
with open("openai_training.jsonl", "rb") as f:
    response = client.files.create(file=f, purpose="fine-tune")
file_id = response.id
print(f"File uploaded: {file_id}")

# 3. Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file_id,
    model="gpt-4o-mini-2024-07-18",  # Cheapest, fastest to fine-tune
    hyperparameters={"n_epochs": 3}
)
print(f"Fine-tuning job: {job.id}")

# 4. Check status
import time
while True:
    status = client.fine_tuning.jobs.retrieve(job.id)
    print(f"Status: {status.status}")
    if status.status in ["succeeded", "failed"]:
        break
    time.sleep(30)

if status.status == "succeeded":
    fine_tuned_model_id = status.fine_tuned_model
    print(f"Fine-tuned model: {fine_tuned_model_id}")

Evaluation

Always evaluate systematically before deploying:

import json
from openai import OpenAI

def evaluate_model(model_id, test_data, client):
    """Compare fine-tuned model vs base model on test set"""
    results = {
        'fine_tuned': {'correct': 0, 'total': 0},
        'base': {'correct': 0, 'total': 0}
    }
    
    for example in test_data:
        question = example['input']
        ground_truth = example['output']
        
        for model_name in ['fine_tuned', 'base']:
            model = model_id if model_name == 'fine_tuned' else 'gpt-4o-mini'
            
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "Classify support tickets."},
                    {"role": "user", "content": question}
                ],
                max_tokens=10,
                temperature=0
            )
            
            prediction = response.choices[0].message.content.strip()
            results[model_name]['total'] += 1
            if prediction.lower() == ground_truth.lower():
                results[model_name]['correct'] += 1
    
    for model_name, r in results.items():
        accuracy = r['correct'] / r['total']
        print(f"{model_name}: {accuracy:.1%} ({r['correct']}/{r['total']})")

evaluate_model(fine_tuned_model_id, test_data, client)

Cost Estimates

OpenAI fine-tuning (gpt-4o-mini):
- Training: $0.003/1K tokens
- 1,000 examples × 300 tokens avg × 3 epochs = 900K tokens = $2.70
- Inference: $0.30/1M input + $1.20/1M output (same price as regular)

Self-hosted QLoRA (Llama 3.1 8B):
- Cloud GPU for training: ~$1-3/hour (A10G or A100)
- 1,000 examples × 3 epochs: ~1-2 hours = $2-6
- Inference: fixed server cost (much cheaper at scale)

Time requirements:
- 100-500 examples: 30-60 min fine-tuning
- 1,000-5,000 examples: 1-4 hours
- 10,000+ examples: 4-12+ hours

Conclusion

Fine-tuning is powerful but often overkill. Try prompt engineering first — many tasks that seem to need fine-tuning are solved by well-structured few-shot examples.

For the broader LLM context, see our how LLMs work guide. For using LLMs in applications without fine-tuning, see our RAG guide.

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI Learning

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.

May 27, 2026 10 min read

AI Learning

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.

May 27, 2026 8 min read

AI Learning

🔥 Trending

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.

May 27, 2026 8 min read

AI Learning

🔥 Trending

How Large Language Models Work: A Clear Technical Explanation

How large language models work explained clearly — from tokenization and transformers to training on billions of tokens, RLHF alignment, and why they sometimes hallucinate.

May 27, 2026 9 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Fine-Tuning LLMs: When to Do It and How to Do It Right

Fine-Tuning LLMs: When to Do It and How to Do It Right

When to Fine-Tune vs. Prompt

Prompt first (99% of cases)

Fine-tune when:

Don't fine-tune for:

The Fine-Tuning Stack in 2025

Data Preparation

Data Quality Checklist

Fine-Tuning with Unsloth + QLoRA

Inference with Your Fine-Tuned Model

OpenAI Fine-Tuning API (Simpler Option)

Evaluation

Cost Estimates

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

How Large Language Models Work: A Clear Technical Explanation

Go deeper on this topic

Get Free AI Notes Daily

Fine-Tuning LLMs: When to Do It and How to Do It Right

Fine-Tuning LLMs: When to Do It and How to Do It Right

When to Fine-Tune vs. Prompt

Prompt first (99% of cases)

Fine-tune when:

Don't fine-tune for:

The Fine-Tuning Stack in 2025

Data Preparation

Data Quality Checklist

Fine-Tuning with Unsloth + QLoRA

Inference with Your Fine-Tuned Model

OpenAI Fine-Tuning API (Simpler Option)

Evaluation

Cost Estimates

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

How Large Language Models Work: A Clear Technical Explanation

Go deeper on this topic

Get Free AI Notes Daily