RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)
RAG (Retrieval-Augmented Generation) explained — how it works, why it beats fine-tuning for factual accuracy, and how to build a RAG system with LangChain and vector databases.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
RAG Explained: How Retrieval-Augmented Generation Works (and When to Use It)
I built my first chatbot for a company's internal documentation by fine-tuning a language model on their docs. Three months later, the documentation was updated, and every answer the chatbot gave was outdated. Fine-tuning had baked the knowledge into the model's weights — changing it required a new fine-tuning run.
RAG (Retrieval-Augmented Generation) solved this problem directly. Instead of baking knowledge into weights, the system retrieves relevant documents at query time and includes them in the prompt. When documentation updates, you just update the document store — no retraining.
This guide covers how RAG works architecturally, when to use it versus alternatives, and how to build a complete RAG system with LangChain and a vector database.
The Core Problem RAG Solves
LLMs have knowledge limitations:
- Knowledge cutoff: training data has a date; the model doesn't know what happened after
- Hallucination: when uncertain, models generate plausible-sounding but false information
- Private knowledge: company documents, proprietary data, personal files are not in training data
- Staleness: even within training date, specific details change
RAG solves all four by grounding generation in retrieved documents:
Without RAG:
User: "What is your refund policy?"
GPT-4: "Most companies offer 30-day returns..." (generic, not your policy)
With RAG:
User: "What is your refund policy?"
System: [searches document store] → retrieves refund_policy.pdf pages 3-4
GPT-4: "Per our policy document: Full refunds are available within 14 days
of purchase. After 14 days, store credit only..." (accurate, sourced)
How RAG Works
Architecture Overview
Offline (Index Build):
Document → Chunk → Embed → Store in Vector DB
Online (Query):
User Query → Embed → Search Vector DB → Retrieve Top-K →
Augment Prompt → LLM → Response
Step 1: Document Processing and Chunking
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load documents
loader = DirectoryLoader('./docs/', glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
print(f"Loaded {len(documents)} document chunks")
# Chunk documents
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap between chunks (preserves context at boundaries)
length_function=len,
separators=["\n\n", "\n", " ", ""], # Split hierarchy
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
# Inspect chunks
for i, chunk in enumerate(chunks[:3]):
print(f"\nChunk {i}:")
print(f" Length: {len(chunk.page_content)} chars")
print(f" Source: {chunk.metadata.get('source', 'unknown')}")
print(f" Preview: {chunk.page_content[:200]}...")
Step 2: Creating Embeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
# Option 1: OpenAI embeddings (best quality, requires API key)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Option 2: Open-source embeddings (free, runs locally)
# all-MiniLM-L6-v2: fast, small (22M params), good quality
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Test embedding
sample_text = "What is the refund policy?"
embedding_vector = embeddings.embed_query(sample_text)
print(f"Embedding dimension: {len(embedding_vector)}") # 384 for MiniLM, 1536 for OpenAI
Step 3: Vector Database
from langchain_community.vectorstores import Chroma # Local, development
# from langchain_pinecone import PineconeVectorStore # Production
# Create and persist vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Vector store created with {vectorstore._collection.count()} chunks")
# Load existing vector store
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
# Test retrieval
query = "What is the refund policy for digital products?"
docs = vectorstore.similarity_search(query, k=4)
for i, doc in enumerate(docs):
print(f"\nResult {i+1}:")
print(f"Source: {doc.metadata.get('source')}")
print(f"Content: {doc.page_content[:200]}...")
Step 4: RAG Chain
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Define the prompt template
template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't have that information in my documents."
Don't make up information not in the context.
Context:
{context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
input_variables=["context", "question"],
template=template
)
# Initialize LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff": put all retrieved docs in one prompt
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True # Include source docs in response
)
# Query
result = qa_chain.invoke({"query": "What is the refund policy for digital products?"})
print("Answer:", result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata.get('source', 'unknown')}")
Advanced RAG Techniques
Hybrid Search (Combining Semantic + Keyword)
from langchain.retrievers import BM25Retriever, EnsembleRetriever
# Create BM25 retriever (keyword-based)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4
# Create vector retriever (semantic)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Combine: 50% semantic, 50% keyword
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5]
)
docs = ensemble_retriever.invoke("refund policy digital products")
Reranking for Better Precision
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Reranker: cross-encoder scores all candidates more accurately
reranker = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
compressor = CrossEncoderReranker(model=reranker, top_n=3)
# First retrieve more candidates, then rerank to top-N
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20}) # Get 20 candidates
reranking_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
docs = reranking_retriever.invoke("refund policy digital products")
# Returns top 3 most relevant after reranking 20 candidates
Multi-Query RAG
Generate multiple phrasings of the user query for better recall:
from langchain.retrievers.multi_query import MultiQueryRetriever
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=llm
)
# User asks: "How do I get my money back?"
# Internally generates:
# 1. "What is the refund process?"
# 2. "How can I request a refund?"
# 3. "What are the conditions for returning a product?"
# Retrieves docs for all 3 queries, deduplicates
docs = multi_query_retriever.invoke("How do I get my money back?")
Production RAG Architecture
Production RAG System:
Document Pipeline (offline):
→ Document ingestion (S3/GCS trigger or scheduled)
→ Chunking and preprocessing
→ Embedding generation (batch)
→ Vector database upsert
→ Metadata indexing (for filtering)
Query Pipeline (online, <500ms target):
→ Query preprocessing (cleaning, intent detection)
→ Query embedding
→ Hybrid retrieval (dense + sparse)
→ Reranking
→ Context construction (include metadata, sources)
→ LLM generation with citations
→ Response streaming
Monitoring:
→ Retrieval quality metrics (are we finding relevant docs?)
→ Answer quality evaluation (LLM-as-judge or human review)
→ Latency and cost per query
→ Coverage gaps (queries with no relevant documents)
Evaluating Your RAG System
# RAG evaluation with RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Build evaluation dataset
from datasets import Dataset
test_data = {
"question": ["What is your refund policy?", "How long does shipping take?"],
"answer": ["Full refunds within 14 days...", "Standard shipping takes 5-7 days..."],
"contexts": [
["Policy doc content 1..."], # Retrieved contexts for q1
["Shipping doc content 1..."], # Retrieved contexts for q2
],
"ground_truth": ["Official policy text...", "Official shipping text..."]
}
dataset = Dataset.from_dict(test_data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result)
# faithfulness: 0.85 (how grounded is the answer in retrieved context?)
# answer_relevancy: 0.92 (how relevant is the answer to the question?)
# context_precision: 0.88 (what fraction of retrieved context is actually useful?)
RAG vs Fine-Tuning vs Prompt Engineering
| Approach | Best For | Not For |
|---|---|---|
| Prompting | Quick prototypes, general tasks | Specific private knowledge |
| RAG | Dynamic knowledge, citations, anti-hallucination | Style/format consistency |
| Fine-tuning | Style, format, specific behavior patterns | Factual grounding |
| RAG + Fine-tuning | Production systems requiring both | Simple use cases |
Conclusion
RAG is one of the highest-impact architectural patterns in applied LLM development. It directly addresses LLM hallucination on private or dynamic data, provides transparent sourcing, and allows knowledge updates without retraining.
The implementation path: start with LangChain + Chroma for development, add hybrid search and reranking for production quality, and migrate to a managed vector database (Pinecone, Weaviate, MongoDB Atlas) when you need scale.
For building the full application around RAG, see our RAG system tutorial and vector database guide.
Frequently Asked Questions
What is RAG?
An architecture combining retrieval (searching a document database) with generation (LLM). At query time, retrieves relevant documents and includes them in the prompt. The LLM answers based on retrieved documents rather than only its training data — enabling factual grounding on private or dynamic knowledge.
When should I use RAG vs fine-tuning?
RAG: when knowledge needs to be current, when citations matter, when the knowledge base is large/specific. Fine-tuning: for consistent output style/format, reducing inference cost at scale. Many production systems use both.
What are vector databases and why does RAG use them?
Store documents as embedding vectors and support fast semantic similarity search. Find documents with same meaning as the query, not just same words. Popular: Pinecone (managed), Weaviate (open-source), Chroma (development), pgvector (PostgreSQL extension).
What is the difference between dense and sparse retrieval?
Dense: embedding-based semantic search. Sparse: keyword-based (BM25). Hybrid combines both — dense for semantic understanding, sparse for exact technical terms. Hybrid consistently outperforms either alone for production RAG.
How do I handle documents too long for the context window?
Chunking strategies: fixed-size, semantic (paragraph boundaries), recursive (LangChain default), or overlapping chunks. Hierarchical indexing stores both chunks and full documents. Advanced: parent document retrieval — retrieve small chunks for precision, return parent document for context.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI Hallucination Explained: Why LLMs Make Things Up (and How to Fix It)
AI hallucination explained — why large language models confidently generate false facts, how to detect it, and practical mitigation strategies for production systems.
Embeddings Explained: How AI Converts Words to Numbers That Mean Something
Embeddings explained — how LLMs convert text, images, and code into vector representations that capture meaning, enable semantic search, and power recommendation systems.
Fine-Tuning LLMs: When to Do It and How to Do It Right
Fine-tuning LLMs explained — when fine-tuning beats prompting, how to prepare data, run LoRA fine-tuning with minimal GPU, and evaluate results with real cost and time estimates.
GPT-4 vs Claude vs Gemini: Which AI Model Is Best in 2025?
GPT-4 vs Claude vs Gemini comparison for 2025 — honest benchmarks, real-world performance across coding, writing, analysis, and reasoning, and which model to use for each task.