RAG System Tutorial: Build a Production Retrieval-Augmented Generation System
RAG system tutorial — build a production-ready retrieval-augmented generation system with document ingestion, hybrid search, reranking, and evaluation from scratch in Python.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
RAG System Tutorial: Build a Production Retrieval-Augmented Generation System
My first RAG system worked great on the documents I tested during development. In production, 30% of user queries got wrong answers — not because the model hallucinated, but because the right chunks weren't being retrieved.
The gap between a RAG prototype and a production system is significant. This tutorial builds the full stack: document processing, hybrid retrieval, reranking, evaluation, and monitoring. Each component addresses a specific failure mode I encountered in real deployments.
Architecture Overview
Production RAG System:
Document Ingestion Pipeline:
→ Parse PDFs/HTML/DOCX (preserve structure)
→ Chunk with overlap
→ Generate dense embeddings
→ Generate sparse BM25 index
→ Store in vector database with metadata
Query Pipeline:
→ Query preprocessing
→ Dense retrieval (semantic)
→ Sparse retrieval (keyword)
→ Fusion (RRF or weighted)
→ Reranking (cross-encoder)
→ Context assembly
→ LLM generation with prompt
→ Response streaming
Evaluation Layer:
→ Faithfulness score
→ Context precision
→ Answer relevancy
→ Latency tracking
Part 1: Document Processing
# pip install langchain langchain-openai langchain-chroma unstructured pdfplumber
from langchain_community.document_loaders import (
PyPDFLoader,
UnstructuredPDFLoader, # Better for complex PDFs
WebBaseLoader,
DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
import logging
logger = logging.getLogger(__name__)
class DocumentProcessor:
def __init__(self, chunk_size: int = 512, chunk_overlap: int = 64):
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""],
)
def load_pdf(self, file_path: str, use_unstructured: bool = False):
"""Load PDF with appropriate loader."""
if use_unstructured:
# Better for complex layouts, tables, multi-column
loader = UnstructuredPDFLoader(
file_path,
mode="elements", # Preserves tables as separate elements
strategy="hi_res" # Better accuracy, slower
)
else:
loader = PyPDFLoader(file_path)
return loader.load()
def process_documents(self, file_paths: list[str]) -> list:
all_chunks = []
for path in file_paths:
logger.info(f"Processing: {path}")
ext = Path(path).suffix.lower()
if ext == ".pdf":
docs = self.load_pdf(path)
elif ext in [".html", ".htm"]:
docs = WebBaseLoader(path).load()
else:
from langchain_community.document_loaders import TextLoader
docs = TextLoader(path).load()
chunks = self.splitter.split_documents(docs)
# Add source metadata
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source": path,
"chunk_index": i,
"total_chunks": len(chunks),
})
all_chunks.extend(chunks)
logger.info(f" Created {len(chunks)} chunks from {path}")
logger.info(f"Total chunks: {len(all_chunks)}")
return all_chunks
Part 2: Hybrid Vector Store
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever
class HybridVectorStore:
def __init__(self, chunks: list, persist_dir: str = "./rag_db"):
self.chunks = chunks
self.persist_dir = persist_dir
# Dense (semantic) retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
self.vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=persist_dir
)
# Sparse (keyword) retriever — BM25
self.bm25_retriever = BM25Retriever.from_documents(chunks)
self.bm25_retriever.k = 10
# Dense retriever
self.dense_retriever = self.vectorstore.as_retriever(
search_kwargs={"k": 10}
)
# Hybrid: 40% BM25, 60% semantic
self.hybrid_retriever = EnsembleRetriever(
retrievers=[self.bm25_retriever, self.dense_retriever],
weights=[0.4, 0.6]
)
def retrieve(self, query: str, k: int = 6) -> list:
"""Retrieve top-k most relevant chunks."""
docs = self.hybrid_retriever.invoke(query)
return docs[:k] # EnsembleRetriever returns merged, deduplicated results
@classmethod
def load(cls, persist_dir: str, chunks: list):
"""Load existing vector store."""
instance = cls.__new__(cls)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
instance.vectorstore = Chroma(
persist_directory=persist_dir,
embedding_function=embeddings
)
instance.bm25_retriever = BM25Retriever.from_documents(chunks)
instance.dense_retriever = instance.vectorstore.as_retriever(search_kwargs={"k": 10})
instance.hybrid_retriever = EnsembleRetriever(
retrievers=[instance.bm25_retriever, instance.dense_retriever],
weights=[0.4, 0.6]
)
return instance
Part 3: Reranking
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
def create_reranking_retriever(base_retriever, top_n: int = 4):
"""Wrap retriever with cross-encoder reranking."""
# Cross-encoder is more accurate than bi-encoder for ranking
# First retrieve more candidates with fast bi-encoder, then rerank
reranker = HuggingFaceCrossEncoder(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
compressor = CrossEncoderReranker(model=reranker, top_n=top_n)
return ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
# Usage: retrieve 10, rerank to top 4
reranking_retriever = create_reranking_retriever(
hybrid_store.hybrid_retriever,
top_n=4
)
Part 4: Generation with Citations
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
class RAGPipeline:
def __init__(self, retriever, model: str = "gpt-4o-mini"):
self.retriever = retriever
self.llm = ChatOpenAI(model=model, temperature=0)
self.prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant that answers questions based on provided documents.
Rules:
1. Answer ONLY from the provided context
2. If the answer isn't in the context, say "I don't have that information in the documents."
3. Cite sources using [Doc 1], [Doc 2] format
4. Be concise and specific
Context:
{context}"""),
("human", "{question}")
])
self.output_parser = StrOutputParser()
def format_docs_with_sources(self, docs) -> str:
formatted = []
for i, doc in enumerate(docs):
source = doc.metadata.get("source", "Unknown")
page = doc.metadata.get("page", "")
formatted.append(
f"[Doc {i+1}] (Source: {source}{', Page ' + str(page) if page else ''})\n"
f"{doc.page_content}"
)
return "\n\n---\n\n".join(formatted)
def query(self, question: str) -> dict:
# Retrieve
docs = self.retriever.invoke(question)
context = self.format_docs_with_sources(docs)
# Generate
chain = self.prompt | self.llm | self.output_parser
answer = chain.invoke({"question": question, "context": context})
return {
"answer": answer,
"sources": [doc.metadata.get("source") for doc in docs],
"retrieved_chunks": len(docs)
}
def stream_query(self, question: str):
docs = self.retriever.invoke(question)
context = self.format_docs_with_sources(docs)
chain = self.prompt | self.llm | self.output_parser
for chunk in chain.stream({"question": question, "context": context}):
yield chunk
Part 5: Evaluation with RAGAS
# pip install ragas datasets
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
def evaluate_rag_system(rag_pipeline, test_questions: list[dict]) -> dict:
"""
test_questions: list of {"question": "...", "ground_truth": "..."}
"""
results = []
for item in test_questions:
result = rag_pipeline.query(item["question"])
results.append({
"question": item["question"],
"answer": result["answer"],
"contexts": [
doc.page_content
for doc in rag_pipeline.retriever.invoke(item["question"])
],
"ground_truth": item["ground_truth"]
})
# Create evaluation dataset
dataset = Dataset.from_list(results)
# Evaluate
scores = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
return {
"faithfulness": scores["faithfulness"],
"answer_relevancy": scores["answer_relevancy"],
"context_precision": scores["context_precision"],
"context_recall": scores["context_recall"],
"num_evaluated": len(test_questions)
}
# Example test set
test_set = [
{
"question": "What is the return policy for electronics?",
"ground_truth": "Electronics can be returned within 30 days of purchase with original packaging."
},
{
"question": "How long does standard shipping take?",
"ground_truth": "Standard shipping takes 5-7 business days."
},
]
scores = evaluate_rag_system(rag_pipeline, test_set)
print(f"Faithfulness: {scores['faithfulness']:.2f}") # Target: > 0.8
print(f"Answer Relevancy: {scores['answer_relevancy']:.2f}") # Target: > 0.85
print(f"Context Precision: {scores['context_precision']:.2f}") # Target: > 0.7
Putting It All Together
# Complete pipeline initialization
# 1. Process documents
processor = DocumentProcessor(chunk_size=512, chunk_overlap=64)
chunks = processor.process_documents(["./docs/manual.pdf", "./docs/faq.pdf"])
# 2. Build hybrid vector store
hybrid_store = HybridVectorStore(chunks, persist_dir="./rag_production_db")
# 3. Add reranking
reranking_retriever = create_reranking_retriever(
hybrid_store.hybrid_retriever,
top_n=4
)
# 4. Create RAG pipeline
rag = RAGPipeline(retriever=reranking_retriever, model="gpt-4o-mini")
# 5. Query
result = rag.query("What is the warranty on laptops?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
# 6. Stream
for chunk in rag.stream_query("How do I contact customer support?"):
print(chunk, end="", flush=True)
Conclusion
A production RAG system is built in layers — each one addressing a specific failure mode. Start with the basic retrieval + generation pipeline, measure with RAGAS, then add hybrid search, reranking, and better document processing where the metrics show gaps.
The reranking step alone typically improves answer quality by 15-25% in my experience. It's the single highest-ROI improvement after switching from basic semantic search to hybrid retrieval.
For the vector database underlying this system, see our vector database guide. For the RAG architecture concepts, see our RAG explained guide.
Frequently Asked Questions
What makes a RAG system production-ready?
Quality document processing (handle tables, images), hybrid search (dense + sparse), reranking, evaluation metrics (faithfulness, precision), monitoring for retrieval failures, and cost controls. Most tutorials show prototypes; production needs all these layers.
What chunk size should I use for RAG?
512 tokens with 50-100 overlap is a good default for Q&A. 1024 for technical docs needing context. Use parent document retrieval for best of both: small chunks for precision, parent for context. Test on your specific dataset — chunk size has large impact on quality.
How do I evaluate my RAG system?
RAGAS metrics: faithfulness (answers grounded in context), context precision (relevant retrieved content), context recall (all needed info retrieved), answer relevancy (answer addresses question). Create 50-100 test (question, ground truth) pairs from your domain and run RAGAS against them before and after changes.
What is parent document retrieval?
Retrieve small chunks (256 tokens) for precision, return the larger parent chunk (1024+ tokens) for context. Avoids the trade-off between retrieval precision and answer quality. Implemented as ParentDocumentRetriever in LangChain.
How do I handle tables and images in RAG?
Tables: use Unstructured.io or LlamaParse for complex PDFs. Never use a text splitter that splits a table row. Images: extract and describe with GPT-4 Vision, store descriptions as searchable text. LlamaParse ($0.003/page) is worth the cost for complex document layouts.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AI API Cost Management: How to Cut LLM Costs by 80% Without Losing Quality
AI API cost management — practical strategies to reduce OpenAI, Claude, and Gemini API costs by 80% using model selection, caching, RAG, prompt optimization, and batch processing.
Build an AI Chatbot with Python: Complete Guide from Scratch to Deployment
Build an AI chatbot with Python — complete tutorial from OpenAI API integration to conversation memory, streaming responses, and deploying a production-ready chatbot application.
Build a Personal AI Assistant: Complete Python Project with Memory and Tools
Build a personal AI assistant in Python with persistent memory, web search, file access, and calendar integration — a complete project from architecture to working prototype.
CrewAI Tutorial: Build Multi-Agent AI Systems That Work Together
CrewAI tutorial — build multi-agent AI systems where specialized agents collaborate to complete complex tasks, with practical Python examples for research, coding, and content workflows.