When should I use RAG vs fine-tuning?

RAG is better than fine-tuning when: the knowledge needs to be current (RAG can update its document store without retraining); you need citations and transparency (RAG shows which documents it used); the knowledge is large and specific (company docs, product manuals, research papers); you need to avoid hallucination on facts (RAG grounds answers in source documents). Fine-tuning is better when: you need a consistent output style or format that prompting can't achieve; you want to reduce inference cost (fine-tuned smaller model vs. RAG with large model); the task requires behavior patterns, not factual knowledge. Many production systems use both: RAG for factual grounding + fine-tuning for style/format consistency.

What are vector databases and why does RAG use them?

Vector databases store documents as high-dimensional numerical vectors (embeddings) and support fast similarity search — finding the most similar vectors to a query vector. RAG uses vector databases because: standard text search (keyword matching) finds documents containing the query words; vector search finds documents with the same meaning, even if different words are used. Example: a user asks 'how do I cancel my subscription?' — vector search retrieves documents about 'membership termination procedures' even without those exact words. Popular vector databases: Pinecone (managed, production-grade), Weaviate (open-source, full-featured), Chroma (open-source, lightweight, great for development), pgvector (PostgreSQL extension for smaller scale).

What is the difference between dense and sparse retrieval in RAG?

Dense retrieval uses embedding models to convert text to dense vectors (768-1536 dimensions) and retrieves by cosine similarity. It captures semantic meaning — finds relevant documents even with different vocabulary. Sparse retrieval uses keyword-based methods (BM25, TF-IDF) that create sparse vectors of word frequencies and retrieve by lexical overlap. Hybrid retrieval combines both — dense for semantic understanding, sparse for exact keyword matching. Dense is better for conversational queries, semantic understanding. Sparse is better for technical terms, product codes, named entities where exact match matters. Hybrid consistently outperforms either alone for most RAG applications. Tools: LangChain's EnsembleRetriever, Pinecone hybrid search, Weaviate hybrid search.

How do I handle documents that are too long for the context window?

Chunking strategies for long documents: Fixed-size chunks (split every 512 tokens) — simple but may split mid-sentence or mid-concept. Semantic chunks (split at paragraph/section boundaries) — better preserves context but variable size. Recursive character splitting (LangChain's default) — splits at paragraphs, then sentences, then words, maintaining context as much as possible. Overlap chunks (100-200 token overlap between adjacent chunks) — ensures context isn't lost at chunk boundaries. Hierarchical indexing: store both full documents and chunks; retrieve full document context for relevant chunks. Advanced: parent document retrieval — retrieve small chunks for accuracy, return the full parent document for context.

AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

large language model architecture diagram on screen — rag explained rag ai explained

Llm Learning

RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)

⚡ Quick Answer

RAG (Retrieval-Augmented Generation) explained — how it works, why it beats fine-tuning for factual accuracy, and how to build a RAG system with LangChain and vector databases.

AiTechWorlds Team May 27, 2026 7 min read

#rag-ai-explained #retrieval-augmented-generation #rag-vs-fine-tuning #llm-learning

📚Part of the Llm Learning guide — explore all Llm Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)

I built my first chatbot for a company's internal documentation by fine-tuning a language model on their docs. Three months later, the documentation was updated, and every answer the chatbot gave was outdated. Fine-tuning had baked the knowledge into the model's weights — changing it required a new fine-tuning run.

RAG (Retrieval-Augmented Generation) solved this problem directly. Instead of baking knowledge into weights, the system retrieves relevant documents at query time and includes them in the prompt. When documentation updates, you just update the document store — no retraining.

This guide covers how RAG works architecturally, when to use it versus alternatives, and how to build a complete RAG system with LangChain and a vector database.

The Core Problem RAG Solves

LLMs have knowledge limitations:

Knowledge cutoff: training data has a date; the model doesn't know what happened after
Hallucination: when uncertain, models generate plausible-sounding but false information
Private knowledge: company documents, proprietary data, personal files are not in training data
Staleness: even within training date, specific details change

RAG solves all four by grounding generation in retrieved documents:

Without RAG:
User: "What is your refund policy?"
GPT-4: "Most companies offer 30-day returns..." (generic, not your policy)

With RAG:
User: "What is your refund policy?"
System: [searches document store] → retrieves refund_policy.pdf pages 3-4
GPT-4: "Per our policy document: Full refunds are available within 14 days
        of purchase. After 14 days, store credit only..." (accurate, sourced)

How RAG Works

Architecture Overview

Offline (Index Build):
Document → Chunk → Embed → Store in Vector DB

Online (Query):
User Query → Embed → Search Vector DB → Retrieve Top-K → 
Augment Prompt → LLM → Response

Step 1: Document Processing and Chunking

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents
loader = DirectoryLoader('./docs/', glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
print(f"Loaded {len(documents)} document chunks")

# Chunk documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Characters per chunk
    chunk_overlap=200,     # Overlap between chunks (preserves context at boundaries)
    length_function=len,
    separators=["\n\n", "\n", " ", ""],  # Split hierarchy
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")

# Inspect chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i}:")
    print(f"  Length: {len(chunk.page_content)} chars")
    print(f"  Source: {chunk.metadata.get('source', 'unknown')}")
    print(f"  Preview: {chunk.page_content[:200]}...")

Step 2: Creating Embeddings

from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

# Option 1: OpenAI embeddings (best quality, requires API key)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Option 2: Open-source embeddings (free, runs locally)
# all-MiniLM-L6-v2: fast, small (22M params), good quality
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Test embedding
sample_text = "What is the refund policy?"
embedding_vector = embeddings.embed_query(sample_text)
print(f"Embedding dimension: {len(embedding_vector)}")  # 384 for MiniLM, 1536 for OpenAI

Step 3: Vector Database

from langchain_community.vectorstores import Chroma  # Local, development
# from langchain_pinecone import PineconeVectorStore  # Production

# Create and persist vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

print(f"Vector store created with {vectorstore._collection.count()} chunks")

# Load existing vector store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

# Test retrieval
query = "What is the refund policy for digital products?"
docs = vectorstore.similarity_search(query, k=4)

for i, doc in enumerate(docs):
    print(f"\nResult {i+1}:")
    print(f"Source: {doc.metadata.get('source')}")
    print(f"Content: {doc.page_content[:200]}...")

Step 4: RAG Chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Define the prompt template
template = """Use the following context to answer the question. 
If the answer is not in the context, say "I don't have that information in my documents."
Don't make up information not in the context.

Context:
{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=template
)

# Initialize LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",             # "stuff": put all retrieved docs in one prompt
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True    # Include source docs in response
)

# Query
result = qa_chain.invoke({"query": "What is the refund policy for digital products?"})

print("Answer:", result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata.get('source', 'unknown')}")

Advanced RAG Techniques

Hybrid Search (Combining Semantic + Keyword)

from langchain.retrievers import BM25Retriever, EnsembleRetriever

# Create BM25 retriever (keyword-based)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4

# Create vector retriever (semantic)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Combine: 50% semantic, 50% keyword
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]
)

docs = ensemble_retriever.invoke("refund policy digital products")

Reranking for Better Precision

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Reranker: cross-encoder scores all candidates more accurately
reranker = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
compressor = CrossEncoderReranker(model=reranker, top_n=3)

# First retrieve more candidates, then rerank to top-N
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})  # Get 20 candidates
reranking_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

docs = reranking_retriever.invoke("refund policy digital products")
# Returns top 3 most relevant after reranking 20 candidates

Multi-Query RAG

Generate multiple phrasings of the user query for better recall:

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)

# User asks: "How do I get my money back?"
# Internally generates:
# 1. "What is the refund process?"
# 2. "How can I request a refund?"
# 3. "What are the conditions for returning a product?"
# Retrieves docs for all 3 queries, deduplicates
docs = multi_query_retriever.invoke("How do I get my money back?")

Production RAG Architecture

Production RAG System:

Document Pipeline (offline):
→ Document ingestion (S3/GCS trigger or scheduled)
→ Chunking and preprocessing
→ Embedding generation (batch)
→ Vector database upsert
→ Metadata indexing (for filtering)

Query Pipeline (online, <500ms target):
→ Query preprocessing (cleaning, intent detection)
→ Query embedding
→ Hybrid retrieval (dense + sparse)
→ Reranking
→ Context construction (include metadata, sources)
→ LLM generation with citations
→ Response streaming

Monitoring:
→ Retrieval quality metrics (are we finding relevant docs?)
→ Answer quality evaluation (LLM-as-judge or human review)
→ Latency and cost per query
→ Coverage gaps (queries with no relevant documents)

Evaluating Your RAG System

# RAG evaluation with RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Build evaluation dataset
from datasets import Dataset

test_data = {
    "question": ["What is your refund policy?", "How long does shipping take?"],
    "answer": ["Full refunds within 14 days...", "Standard shipping takes 5-7 days..."],
    "contexts": [
        ["Policy doc content 1..."],  # Retrieved contexts for q1
        ["Shipping doc content 1..."],  # Retrieved contexts for q2
    ],
    "ground_truth": ["Official policy text...", "Official shipping text..."]
}

dataset = Dataset.from_dict(test_data)
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

print(result)
# faithfulness: 0.85 (how grounded is the answer in retrieved context?)
# answer_relevancy: 0.92 (how relevant is the answer to the question?)
# context_precision: 0.88 (what fraction of retrieved context is actually useful?)

RAG vs Fine-Tuning vs Prompt Engineering

Approach	Best For	Not For
Prompting	Quick prototypes, general tasks	Specific private knowledge
RAG	Dynamic knowledge, citations, anti-hallucination	Style/format consistency
Fine-tuning	Style, format, specific behavior patterns	Factual grounding
RAG + Fine-tuning	Production systems requiring both	Simple use cases

Conclusion

RAG is one of the highest-impact architectural patterns in applied LLM development. It directly addresses LLM hallucination on private or dynamic data, provides transparent sourcing, and allows knowledge updates without retraining.

The implementation path: start with LangChain + Chroma for development, add hybrid search and reranking for production quality, and migrate to a managed vector database (Pinecone, Weaviate, MongoDB Atlas) when you need scale.

For building the full application around RAG, see our RAG system tutorial and vector database guide.

Frequently Asked Questions

RAG is an architecture that combines a retrieval system (searching a document database) with a language model (generating answers). Instead of relying solely on knowledge encoded in the model's weights, RAG retrieves relevant documents at inference time and includes them in the prompt as context. The model then generates an answer grounded in those documents. Example: instead of asking GPT-4 'What is our refund policy?' (it doesn't know your policy), RAG searches your policy documents, retrieves the relevant sections, and includes them in the prompt. GPT-4 then generates a specific, accurate answer based on your actual policy document rather than hallucinating.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

large language model architecture diagram on screen — ai hallucination explained

AI Learning

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.

May 27, 2026 10 min read

large language model architecture diagram on screen — embeddings explained

AI Learning

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.

May 27, 2026 8 min read

large language model architecture diagram on screen — fine-tuning llms fine tuning llm guide

AI Learning

Fine-Tuning LLMs: When to Do It and How to Do It Right

Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.

May 27, 2026 9 min read

large language model architecture diagram on screen — gpt-4 vs claude vs gemini gpt4 vs claude vs gemini

AI Learning

🔥 Trending

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.

May 27, 2026 8 min read

Go deeper on this topic

NotesPrompt Engineering Cheat Sheet NotesLLM Core Concepts Explained NotesChatGPT Tips & Tricks Cheat Sheet NotesTransformer Architecture Cheat Sheet NotesPrompt Engineering vs Fine-Tuning vs RLHF NotesRAG: Retrieval-Augmented Generation Guide

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Llm Learning

RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)

⚡ Quick Answer

RAG (Retrieval-Augmented Generation) explained — how it works, why it beats fine-tuning for factual accuracy, and how to build a RAG system with LangChain and vector databases.

AiTechWorlds Team May 27, 2026 7 min read

#rag-ai-explained #retrieval-augmented-generation #rag-vs-fine-tuning #llm-learning

📚Part of the Llm Learning guide — explore all Llm Learning articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)

This guide covers how RAG works architecturally, when to use it versus alternatives, and how to build a complete RAG system with LangChain and a vector database.

The Core Problem RAG Solves

LLMs have knowledge limitations:

Knowledge cutoff: training data has a date; the model doesn't know what happened after
Hallucination: when uncertain, models generate plausible-sounding but false information
Private knowledge: company documents, proprietary data, personal files are not in training data
Staleness: even within training date, specific details change

RAG solves all four by grounding generation in retrieved documents:

Without RAG:
User: "What is your refund policy?"
GPT-4: "Most companies offer 30-day returns..." (generic, not your policy)

With RAG:
User: "What is your refund policy?"
System: [searches document store] → retrieves refund_policy.pdf pages 3-4
GPT-4: "Per our policy document: Full refunds are available within 14 days
        of purchase. After 14 days, store credit only..." (accurate, sourced)

How RAG Works

Architecture Overview

Offline (Index Build):
Document → Chunk → Embed → Store in Vector DB

Online (Query):
User Query → Embed → Search Vector DB → Retrieve Top-K → 
Augment Prompt → LLM → Response

Step 1: Document Processing and Chunking

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents
loader = DirectoryLoader('./docs/', glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
print(f"Loaded {len(documents)} document chunks")

# Chunk documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Characters per chunk
    chunk_overlap=200,     # Overlap between chunks (preserves context at boundaries)
    length_function=len,
    separators=["\n\n", "\n", " ", ""],  # Split hierarchy
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")

# Inspect chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i}:")
    print(f"  Length: {len(chunk.page_content)} chars")
    print(f"  Source: {chunk.metadata.get('source', 'unknown')}")
    print(f"  Preview: {chunk.page_content[:200]}...")

Step 2: Creating Embeddings

from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

# Option 1: OpenAI embeddings (best quality, requires API key)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Option 2: Open-source embeddings (free, runs locally)
# all-MiniLM-L6-v2: fast, small (22M params), good quality
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Test embedding
sample_text = "What is the refund policy?"
embedding_vector = embeddings.embed_query(sample_text)
print(f"Embedding dimension: {len(embedding_vector)}")  # 384 for MiniLM, 1536 for OpenAI

Step 3: Vector Database

from langchain_community.vectorstores import Chroma  # Local, development
# from langchain_pinecone import PineconeVectorStore  # Production

# Create and persist vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

print(f"Vector store created with {vectorstore._collection.count()} chunks")

# Load existing vector store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

# Test retrieval
query = "What is the refund policy for digital products?"
docs = vectorstore.similarity_search(query, k=4)

for i, doc in enumerate(docs):
    print(f"\nResult {i+1}:")
    print(f"Source: {doc.metadata.get('source')}")
    print(f"Content: {doc.page_content[:200]}...")

Step 4: RAG Chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Define the prompt template
template = """Use the following context to answer the question. 
If the answer is not in the context, say "I don't have that information in my documents."
Don't make up information not in the context.

Context:
{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=template
)

# Initialize LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",             # "stuff": put all retrieved docs in one prompt
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True    # Include source docs in response
)

# Query
result = qa_chain.invoke({"query": "What is the refund policy for digital products?"})

print("Answer:", result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata.get('source', 'unknown')}")

Advanced RAG Techniques

Hybrid Search (Combining Semantic + Keyword)

from langchain.retrievers import BM25Retriever, EnsembleRetriever

# Create BM25 retriever (keyword-based)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4

# Create vector retriever (semantic)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Combine: 50% semantic, 50% keyword
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]
)

docs = ensemble_retriever.invoke("refund policy digital products")

Reranking for Better Precision

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Reranker: cross-encoder scores all candidates more accurately
reranker = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
compressor = CrossEncoderReranker(model=reranker, top_n=3)

# First retrieve more candidates, then rerank to top-N
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})  # Get 20 candidates
reranking_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

docs = reranking_retriever.invoke("refund policy digital products")
# Returns top 3 most relevant after reranking 20 candidates

Multi-Query RAG

Generate multiple phrasings of the user query for better recall:

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)

# User asks: "How do I get my money back?"
# Internally generates:
# 1. "What is the refund process?"
# 2. "How can I request a refund?"
# 3. "What are the conditions for returning a product?"
# Retrieves docs for all 3 queries, deduplicates
docs = multi_query_retriever.invoke("How do I get my money back?")

Production RAG Architecture

Production RAG System:

Document Pipeline (offline):
→ Document ingestion (S3/GCS trigger or scheduled)
→ Chunking and preprocessing
→ Embedding generation (batch)
→ Vector database upsert
→ Metadata indexing (for filtering)

Query Pipeline (online, <500ms target):
→ Query preprocessing (cleaning, intent detection)
→ Query embedding
→ Hybrid retrieval (dense + sparse)
→ Reranking
→ Context construction (include metadata, sources)
→ LLM generation with citations
→ Response streaming

Monitoring:
→ Retrieval quality metrics (are we finding relevant docs?)
→ Answer quality evaluation (LLM-as-judge or human review)
→ Latency and cost per query
→ Coverage gaps (queries with no relevant documents)

Evaluating Your RAG System

# RAG evaluation with RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Build evaluation dataset
from datasets import Dataset

test_data = {
    "question": ["What is your refund policy?", "How long does shipping take?"],
    "answer": ["Full refunds within 14 days...", "Standard shipping takes 5-7 days..."],
    "contexts": [
        ["Policy doc content 1..."],  # Retrieved contexts for q1
        ["Shipping doc content 1..."],  # Retrieved contexts for q2
    ],
    "ground_truth": ["Official policy text...", "Official shipping text..."]
}

dataset = Dataset.from_dict(test_data)
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

print(result)
# faithfulness: 0.85 (how grounded is the answer in retrieved context?)
# answer_relevancy: 0.92 (how relevant is the answer to the question?)
# context_precision: 0.88 (what fraction of retrieved context is actually useful?)

RAG vs Fine-Tuning vs Prompt Engineering

Approach	Best For	Not For
Prompting	Quick prototypes, general tasks	Specific private knowledge
RAG	Dynamic knowledge, citations, anti-hallucination	Style/format consistency
Fine-tuning	Style, format, specific behavior patterns	Factual grounding
RAG + Fine-tuning	Production systems requiring both	Simple use cases

Conclusion

For building the full application around RAG, see our RAG system tutorial and vector database guide.

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI Learning

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.

May 27, 2026 10 min read

AI Learning

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.

May 27, 2026 8 min read

AI Learning

Fine-Tuning LLMs: When to Do It and How to Do It Right

Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.

May 27, 2026 9 min read

AI Learning

🔥 Trending

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.

May 27, 2026 8 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)

RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)

The Core Problem RAG Solves

How RAG Works

Architecture Overview

Step 1: Document Processing and Chunking

Step 2: Creating Embeddings

Step 3: Vector Database

Step 4: RAG Chain

Advanced RAG Techniques

Hybrid Search (Combining Semantic + Keyword)

Reranking for Better Precision

Multi-Query RAG

Production RAG Architecture

Evaluating Your RAG System

RAG vs Fine-Tuning vs Prompt Engineering

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Fine-Tuning LLMs: When to Do It and How to Do It Right

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

Go deeper on this topic

Get Free AI Notes Daily

RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)

RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)

The Core Problem RAG Solves

How RAG Works

Architecture Overview

Step 1: Document Processing and Chunking

Step 2: Creating Embeddings

Step 3: Vector Database

Step 4: RAG Chain

Advanced RAG Techniques

Hybrid Search (Combining Semantic + Keyword)

Reranking for Better Precision

Multi-Query RAG

Production RAG Architecture

Evaluating Your RAG System

RAG vs Fine-Tuning vs Prompt Engineering

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)

Embeddings Explained: How AI Converts Words to Numbers That Mean Something

Fine-Tuning LLMs: When to Do It and How to Do It Right

GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?

Go deeper on this topic

Get Free AI Notes Daily