Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

RAG system tutorial — build a production-ready retrieval-augmented generation system with document ingestion, hybrid search, reranking, and evaluation from scratch in Python.

A
AiTechWorlds Team
May 27, 2026 7 min read
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

My first RAG system worked great on the documents I tested during development. In production, 30% of user queries got wrong answers — not because the model hallucinated, but because the right chunks weren't being retrieved.

The gap between a RAG prototype and a production system is significant. This tutorial builds the full stack: document processing, hybrid retrieval, reranking, evaluation, and monitoring. Each component addresses a specific failure mode I encountered in real deployments.


Architecture Overview

Production RAG System:

Document Ingestion Pipeline:
  → Parse PDFs/HTML/DOCX (preserve structure)
  → Chunk with overlap
  → Generate dense embeddings
  → Generate sparse BM25 index
  → Store in vector database with metadata

Query Pipeline:
  → Query preprocessing
  → Dense retrieval (semantic)
  → Sparse retrieval (keyword)
  → Fusion (RRF or weighted)
  → Reranking (cross-encoder)
  → Context assembly
  → LLM generation with prompt
  → Response streaming

Evaluation Layer:
  → Faithfulness score
  → Context precision
  → Answer relevancy
  → Latency tracking

Part 1: Document Processing

# pip install langchain langchain-openai langchain-chroma unstructured pdfplumber

from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredPDFLoader,  # Better for complex PDFs
    WebBaseLoader,
    DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
import logging

logger = logging.getLogger(__name__)

class DocumentProcessor:
    def __init__(self, chunk_size: int = 512, chunk_overlap: int = 64):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=["\n\n", "\n", ". ", " ", ""],
        )
    
    def load_pdf(self, file_path: str, use_unstructured: bool = False):
        """Load PDF with appropriate loader."""
        if use_unstructured:
            # Better for complex layouts, tables, multi-column
            loader = UnstructuredPDFLoader(
                file_path,
                mode="elements",  # Preserves tables as separate elements
                strategy="hi_res"  # Better accuracy, slower
            )
        else:
            loader = PyPDFLoader(file_path)
        
        return loader.load()
    
    def process_documents(self, file_paths: list[str]) -> list:
        all_chunks = []
        
        for path in file_paths:
            logger.info(f"Processing: {path}")
            
            ext = Path(path).suffix.lower()
            if ext == ".pdf":
                docs = self.load_pdf(path)
            elif ext in [".html", ".htm"]:
                docs = WebBaseLoader(path).load()
            else:
                from langchain_community.document_loaders import TextLoader
                docs = TextLoader(path).load()
            
            chunks = self.splitter.split_documents(docs)
            
            # Add source metadata
            for i, chunk in enumerate(chunks):
                chunk.metadata.update({
                    "source": path,
                    "chunk_index": i,
                    "total_chunks": len(chunks),
                })
            
            all_chunks.extend(chunks)
            logger.info(f"  Created {len(chunks)} chunks from {path}")
        
        logger.info(f"Total chunks: {len(all_chunks)}")
        return all_chunks

Part 2: Hybrid Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever

class HybridVectorStore:
    def __init__(self, chunks: list, persist_dir: str = "./rag_db"):
        self.chunks = chunks
        self.persist_dir = persist_dir
        
        # Dense (semantic) retriever
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Chroma.from_documents(
            documents=chunks,
            embedding=embeddings,
            persist_directory=persist_dir
        )
        
        # Sparse (keyword) retriever — BM25
        self.bm25_retriever = BM25Retriever.from_documents(chunks)
        self.bm25_retriever.k = 10
        
        # Dense retriever
        self.dense_retriever = self.vectorstore.as_retriever(
            search_kwargs={"k": 10}
        )
        
        # Hybrid: 40% BM25, 60% semantic
        self.hybrid_retriever = EnsembleRetriever(
            retrievers=[self.bm25_retriever, self.dense_retriever],
            weights=[0.4, 0.6]
        )
    
    def retrieve(self, query: str, k: int = 6) -> list:
        """Retrieve top-k most relevant chunks."""
        docs = self.hybrid_retriever.invoke(query)
        return docs[:k]  # EnsembleRetriever returns merged, deduplicated results
    
    @classmethod
    def load(cls, persist_dir: str, chunks: list):
        """Load existing vector store."""
        instance = cls.__new__(cls)
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        instance.vectorstore = Chroma(
            persist_directory=persist_dir,
            embedding_function=embeddings
        )
        instance.bm25_retriever = BM25Retriever.from_documents(chunks)
        instance.dense_retriever = instance.vectorstore.as_retriever(search_kwargs={"k": 10})
        instance.hybrid_retriever = EnsembleRetriever(
            retrievers=[instance.bm25_retriever, instance.dense_retriever],
            weights=[0.4, 0.6]
        )
        return instance

Part 3: Reranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

def create_reranking_retriever(base_retriever, top_n: int = 4):
    """Wrap retriever with cross-encoder reranking."""
    
    # Cross-encoder is more accurate than bi-encoder for ranking
    # First retrieve more candidates with fast bi-encoder, then rerank
    reranker = HuggingFaceCrossEncoder(
        model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
    )
    compressor = CrossEncoderReranker(model=reranker, top_n=top_n)
    
    return ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever
    )

# Usage: retrieve 10, rerank to top 4
reranking_retriever = create_reranking_retriever(
    hybrid_store.hybrid_retriever,
    top_n=4
)

Part 4: Generation with Citations

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

class RAGPipeline:
    def __init__(self, retriever, model: str = "gpt-4o-mini"):
        self.retriever = retriever
        self.llm = ChatOpenAI(model=model, temperature=0)
        
        self.prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful assistant that answers questions based on provided documents.

Rules:
1. Answer ONLY from the provided context
2. If the answer isn't in the context, say "I don't have that information in the documents."
3. Cite sources using [Doc 1], [Doc 2] format
4. Be concise and specific

Context:
{context}"""),
            ("human", "{question}")
        ])
        
        self.output_parser = StrOutputParser()
    
    def format_docs_with_sources(self, docs) -> str:
        formatted = []
        for i, doc in enumerate(docs):
            source = doc.metadata.get("source", "Unknown")
            page = doc.metadata.get("page", "")
            formatted.append(
                f"[Doc {i+1}] (Source: {source}{', Page ' + str(page) if page else ''})\n"
                f"{doc.page_content}"
            )
        return "\n\n---\n\n".join(formatted)
    
    def query(self, question: str) -> dict:
        # Retrieve
        docs = self.retriever.invoke(question)
        context = self.format_docs_with_sources(docs)
        
        # Generate
        chain = self.prompt | self.llm | self.output_parser
        answer = chain.invoke({"question": question, "context": context})
        
        return {
            "answer": answer,
            "sources": [doc.metadata.get("source") for doc in docs],
            "retrieved_chunks": len(docs)
        }
    
    def stream_query(self, question: str):
        docs = self.retriever.invoke(question)
        context = self.format_docs_with_sources(docs)
        
        chain = self.prompt | self.llm | self.output_parser
        
        for chunk in chain.stream({"question": question, "context": context}):
            yield chunk

Part 5: Evaluation with RAGAS

# pip install ragas datasets

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

def evaluate_rag_system(rag_pipeline, test_questions: list[dict]) -> dict:
    """
    test_questions: list of {"question": "...", "ground_truth": "..."}
    """
    
    results = []
    for item in test_questions:
        result = rag_pipeline.query(item["question"])
        results.append({
            "question": item["question"],
            "answer": result["answer"],
            "contexts": [
                doc.page_content 
                for doc in rag_pipeline.retriever.invoke(item["question"])
            ],
            "ground_truth": item["ground_truth"]
        })
    
    # Create evaluation dataset
    dataset = Dataset.from_list(results)
    
    # Evaluate
    scores = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )
    
    return {
        "faithfulness": scores["faithfulness"],
        "answer_relevancy": scores["answer_relevancy"],
        "context_precision": scores["context_precision"],
        "context_recall": scores["context_recall"],
        "num_evaluated": len(test_questions)
    }

# Example test set
test_set = [
    {
        "question": "What is the return policy for electronics?",
        "ground_truth": "Electronics can be returned within 30 days of purchase with original packaging."
    },
    {
        "question": "How long does standard shipping take?",
        "ground_truth": "Standard shipping takes 5-7 business days."
    },
]

scores = evaluate_rag_system(rag_pipeline, test_set)
print(f"Faithfulness: {scores['faithfulness']:.2f}")  # Target: > 0.8
print(f"Answer Relevancy: {scores['answer_relevancy']:.2f}")  # Target: > 0.85
print(f"Context Precision: {scores['context_precision']:.2f}")  # Target: > 0.7

Putting It All Together

# Complete pipeline initialization

# 1. Process documents
processor = DocumentProcessor(chunk_size=512, chunk_overlap=64)
chunks = processor.process_documents(["./docs/manual.pdf", "./docs/faq.pdf"])

# 2. Build hybrid vector store
hybrid_store = HybridVectorStore(chunks, persist_dir="./rag_production_db")

# 3. Add reranking
reranking_retriever = create_reranking_retriever(
    hybrid_store.hybrid_retriever,
    top_n=4
)

# 4. Create RAG pipeline
rag = RAGPipeline(retriever=reranking_retriever, model="gpt-4o-mini")

# 5. Query
result = rag.query("What is the warranty on laptops?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")

# 6. Stream
for chunk in rag.stream_query("How do I contact customer support?"):
    print(chunk, end="", flush=True)

Conclusion

A production RAG system is built in layers — each one addressing a specific failure mode. Start with the basic retrieval + generation pipeline, measure with RAGAS, then add hybrid search, reranking, and better document processing where the metrics show gaps.

The reranking step alone typically improves answer quality by 15-25% in my experience. It's the single highest-ROI improvement after switching from basic semantic search to hybrid retrieval.

For the vector database underlying this system, see our vector database guide. For the RAG architecture concepts, see our RAG explained guide.


Frequently Asked Questions

What makes a RAG system production-ready?

Quality document processing (handle tables, images), hybrid search (dense + sparse), reranking, evaluation metrics (faithfulness, precision), monitoring for retrieval failures, and cost controls. Most tutorials show prototypes; production needs all these layers.

What chunk size should I use for RAG?

512 tokens with 50-100 overlap is a good default for Q&A. 1024 for technical docs needing context. Use parent document retrieval for best of both: small chunks for precision, parent for context. Test on your specific dataset — chunk size has large impact on quality.

How do I evaluate my RAG system?

RAGAS metrics: faithfulness (answers grounded in context), context precision (relevant retrieved content), context recall (all needed info retrieved), answer relevancy (answer addresses question). Create 50-100 test (question, ground truth) pairs from your domain and run RAGAS against them before and after changes.

What is parent document retrieval?

Retrieve small chunks (256 tokens) for precision, return the larger parent chunk (1024+ tokens) for context. Avoids the trade-off between retrieval precision and answer quality. Implemented as ParentDocumentRetriever in LangChain.

How do I handle tables and images in RAG?

Tables: use Unstructured.io or LlamaParse for complex PDFs. Never use a text splitter that splits a table row. Images: extract and describe with GPT-4 Vision, store descriptions as searchable text. LlamaParse ($0.003/page) is worth the cost for complex document layouts.

Share this article:

Frequently Asked Questions

A production RAG system needs more than basic document retrieval: quality document processing (handle PDFs, tables, images, not just plain text), hybrid search (dense + sparse), reranking for precision, context construction (how you format retrieved chunks matters), evaluation metrics (faithfulness, relevancy, precision), monitoring (what queries fail to retrieve relevant context), and cost management (cache frequent queries, use cheaper models for classification). Most RAG tutorials show a working prototype; production requires adding each of these layers systematically.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!