What chunk size should I use for RAG?

There is no universal answer — chunk size depends on your document type and query patterns. Smaller chunks (256-512 tokens): better precision, retrieve the exact relevant sentence. Worse for questions needing context across sentences. Larger chunks (1024-2048 tokens): more context per chunk, better for synthesis questions. Lower precision. Good defaults: 512 tokens with 50-100 token overlap for general Q&A. 1024 tokens for technical documentation where context matters. Use parent document retrieval to get the best of both: retrieve small chunks for precision, return the parent chunk for context. Test on your specific dataset — chunk size has a larger impact than most other parameters.

How do I evaluate my RAG system?

RAGAS (RAG Assessment) metrics: Faithfulness — are answers grounded in retrieved context, or hallucinated? Context Precision — what fraction of retrieved context is actually relevant? Context Recall — what fraction of needed information was retrieved? Answer Relevancy — does the answer address the question? Create a test set of 50-100 (question, ground truth answer) pairs from your domain. Run your RAG pipeline and score with RAGAS. Faithfulness below 0.7 means hallucination is happening. Context precision below 0.5 means too much irrelevant content is being retrieved. A/B test changes with this test set before deploying.

What is parent document retrieval?

Parent document retrieval is a technique that maintains two granularities: small child chunks (256 tokens) for precise retrieval, and larger parent chunks (1024+ tokens) for providing context. At query time, retrieve small chunks by similarity. Then return the parent chunk that contains the small chunk. Benefits: small chunks improve retrieval precision (similar to exact answer location); parent chunks provide enough context for the model to answer well. Avoids the trade-off between retrieval precision (short chunks) and answer quality (long chunks). Implemented in LangChain as ParentDocumentRetriever.

How do I handle tables and images in RAG?

Tables in PDFs are notoriously hard to parse correctly. Options: Unstructured.io library — extracts tables as structured data, preserves layout. LlamaParse — commercial tool, handles complex PDF layouts well. Camelot or pdfplumber for table-heavy documents. For images: extract with pdfminer, then use GPT-4 Vision to describe images and add descriptions as text chunks. Store image descriptions as searchable text alongside visual content. For production: LlamaParse ($0.003/page) is often worth the cost for complex documents. For simple text PDFs, PyPDFLoader is sufficient. Never use a text splitter that can split a table row in half.

AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

AI application development code in Python editor — rag system tutorial

Ai Development

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

⚡ Quick Answer

RAG system tutorial — build a production-ready retrieval-augmented generation system with document ingestion, hybrid search, reranking, and evaluation from scratch in Python.

AiTechWorlds Team May 27, 2026 7 min read

#rag-system-tutorial #build-rag-system #langchain-rag #ai-development

📚Part of the Ai Development guide — explore all Ai Development articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

My first RAG system worked great on the documents I tested during development. In production, 30% of user queries got wrong answers — not because the model hallucinated, but because the right chunks weren't being retrieved.

The gap between a RAG prototype and a production system is significant. This tutorial builds the full stack: document processing, hybrid retrieval, reranking, evaluation, and monitoring. Each component addresses a specific failure mode I encountered in real deployments.

Architecture Overview

Production RAG System:

Document Ingestion Pipeline:
  → Parse PDFs/HTML/DOCX (preserve structure)
  → Chunk with overlap
  → Generate dense embeddings
  → Generate sparse BM25 index
  → Store in vector database with metadata

Query Pipeline:
  → Query preprocessing
  → Dense retrieval (semantic)
  → Sparse retrieval (keyword)
  → Fusion (RRF or weighted)
  → Reranking (cross-encoder)
  → Context assembly
  → LLM generation with prompt
  → Response streaming

Evaluation Layer:
  → Faithfulness score
  → Context precision
  → Answer relevancy
  → Latency tracking

Part 1: Document Processing

# pip install langchain langchain-openai langchain-chroma unstructured pdfplumber

from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredPDFLoader,  # Better for complex PDFs
    WebBaseLoader,
    DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
import logging

logger = logging.getLogger(__name__)

class DocumentProcessor:
    def __init__(self, chunk_size: int = 512, chunk_overlap: int = 64):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=["\n\n", "\n", ". ", " ", ""],
        )
    
    def load_pdf(self, file_path: str, use_unstructured: bool = False):
        """Load PDF with appropriate loader."""
        if use_unstructured:
            # Better for complex layouts, tables, multi-column
            loader = UnstructuredPDFLoader(
                file_path,
                mode="elements",  # Preserves tables as separate elements
                strategy="hi_res"  # Better accuracy, slower
            )
        else:
            loader = PyPDFLoader(file_path)
        
        return loader.load()
    
    def process_documents(self, file_paths: list[str]) -> list:
        all_chunks = []
        
        for path in file_paths:
            logger.info(f"Processing: {path}")
            
            ext = Path(path).suffix.lower()
            if ext == ".pdf":
                docs = self.load_pdf(path)
            elif ext in [".html", ".htm"]:
                docs = WebBaseLoader(path).load()
            else:
                from langchain_community.document_loaders import TextLoader
                docs = TextLoader(path).load()
            
            chunks = self.splitter.split_documents(docs)
            
            # Add source metadata
            for i, chunk in enumerate(chunks):
                chunk.metadata.update({
                    "source": path,
                    "chunk_index": i,
                    "total_chunks": len(chunks),
                })
            
            all_chunks.extend(chunks)
            logger.info(f"  Created {len(chunks)} chunks from {path}")
        
        logger.info(f"Total chunks: {len(all_chunks)}")
        return all_chunks

Part 2: Hybrid Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever

class HybridVectorStore:
    def __init__(self, chunks: list, persist_dir: str = "./rag_db"):
        self.chunks = chunks
        self.persist_dir = persist_dir
        
        # Dense (semantic) retriever
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Chroma.from_documents(
            documents=chunks,
            embedding=embeddings,
            persist_directory=persist_dir
        )
        
        # Sparse (keyword) retriever — BM25
        self.bm25_retriever = BM25Retriever.from_documents(chunks)
        self.bm25_retriever.k = 10
        
        # Dense retriever
        self.dense_retriever = self.vectorstore.as_retriever(
            search_kwargs={"k": 10}
        )
        
        # Hybrid: 40% BM25, 60% semantic
        self.hybrid_retriever = EnsembleRetriever(
            retrievers=[self.bm25_retriever, self.dense_retriever],
            weights=[0.4, 0.6]
        )
    
    def retrieve(self, query: str, k: int = 6) -> list:
        """Retrieve top-k most relevant chunks."""
        docs = self.hybrid_retriever.invoke(query)
        return docs[:k]  # EnsembleRetriever returns merged, deduplicated results
    
    @classmethod
    def load(cls, persist_dir: str, chunks: list):
        """Load existing vector store."""
        instance = cls.__new__(cls)
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        instance.vectorstore = Chroma(
            persist_directory=persist_dir,
            embedding_function=embeddings
        )
        instance.bm25_retriever = BM25Retriever.from_documents(chunks)
        instance.dense_retriever = instance.vectorstore.as_retriever(search_kwargs={"k": 10})
        instance.hybrid_retriever = EnsembleRetriever(
            retrievers=[instance.bm25_retriever, instance.dense_retriever],
            weights=[0.4, 0.6]
        )
        return instance

Part 3: Reranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

def create_reranking_retriever(base_retriever, top_n: int = 4):
    """Wrap retriever with cross-encoder reranking."""
    
    # Cross-encoder is more accurate than bi-encoder for ranking
    # First retrieve more candidates with fast bi-encoder, then rerank
    reranker = HuggingFaceCrossEncoder(
        model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
    )
    compressor = CrossEncoderReranker(model=reranker, top_n=top_n)
    
    return ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever
    )

# Usage: retrieve 10, rerank to top 4
reranking_retriever = create_reranking_retriever(
    hybrid_store.hybrid_retriever,
    top_n=4
)

Part 4: Generation with Citations

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

class RAGPipeline:
    def __init__(self, retriever, model: str = "gpt-4o-mini"):
        self.retriever = retriever
        self.llm = ChatOpenAI(model=model, temperature=0)
        
        self.prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful assistant that answers questions based on provided documents.

Rules:
1. Answer ONLY from the provided context
2. If the answer isn't in the context, say "I don't have that information in the documents."
3. Cite sources using [Doc 1], [Doc 2] format
4. Be concise and specific

Context:
{context}"""),
            ("human", "{question}")
        ])
        
        self.output_parser = StrOutputParser()
    
    def format_docs_with_sources(self, docs) -> str:
        formatted = []
        for i, doc in enumerate(docs):
            source = doc.metadata.get("source", "Unknown")
            page = doc.metadata.get("page", "")
            formatted.append(
                f"[Doc {i+1}] (Source: {source}{', Page ' + str(page) if page else ''})\n"
                f"{doc.page_content}"
            )
        return "\n\n---\n\n".join(formatted)
    
    def query(self, question: str) -> dict:
        # Retrieve
        docs = self.retriever.invoke(question)
        context = self.format_docs_with_sources(docs)
        
        # Generate
        chain = self.prompt | self.llm | self.output_parser
        answer = chain.invoke({"question": question, "context": context})
        
        return {
            "answer": answer,
            "sources": [doc.metadata.get("source") for doc in docs],
            "retrieved_chunks": len(docs)
        }
    
    def stream_query(self, question: str):
        docs = self.retriever.invoke(question)
        context = self.format_docs_with_sources(docs)
        
        chain = self.prompt | self.llm | self.output_parser
        
        for chunk in chain.stream({"question": question, "context": context}):
            yield chunk

Part 5: Evaluation with RAGAS

# pip install ragas datasets

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

def evaluate_rag_system(rag_pipeline, test_questions: list[dict]) -> dict:
    """
    test_questions: list of {"question": "...", "ground_truth": "..."}
    """
    
    results = []
    for item in test_questions:
        result = rag_pipeline.query(item["question"])
        results.append({
            "question": item["question"],
            "answer": result["answer"],
            "contexts": [
                doc.page_content 
                for doc in rag_pipeline.retriever.invoke(item["question"])
            ],
            "ground_truth": item["ground_truth"]
        })
    
    # Create evaluation dataset
    dataset = Dataset.from_list(results)
    
    # Evaluate
    scores = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )
    
    return {
        "faithfulness": scores["faithfulness"],
        "answer_relevancy": scores["answer_relevancy"],
        "context_precision": scores["context_precision"],
        "context_recall": scores["context_recall"],
        "num_evaluated": len(test_questions)
    }

# Example test set
test_set = [
    {
        "question": "What is the return policy for electronics?",
        "ground_truth": "Electronics can be returned within 30 days of purchase with original packaging."
    },
    {
        "question": "How long does standard shipping take?",
        "ground_truth": "Standard shipping takes 5-7 business days."
    },
]

scores = evaluate_rag_system(rag_pipeline, test_set)
print(f"Faithfulness: {scores['faithfulness']:.2f}")  # Target: > 0.8
print(f"Answer Relevancy: {scores['answer_relevancy']:.2f}")  # Target: > 0.85
print(f"Context Precision: {scores['context_precision']:.2f}")  # Target: > 0.7

Putting It All Together

# Complete pipeline initialization

# 1. Process documents
processor = DocumentProcessor(chunk_size=512, chunk_overlap=64)
chunks = processor.process_documents(["./docs/manual.pdf", "./docs/faq.pdf"])

# 2. Build hybrid vector store
hybrid_store = HybridVectorStore(chunks, persist_dir="./rag_production_db")

# 3. Add reranking
reranking_retriever = create_reranking_retriever(
    hybrid_store.hybrid_retriever,
    top_n=4
)

# 4. Create RAG pipeline
rag = RAGPipeline(retriever=reranking_retriever, model="gpt-4o-mini")

# 5. Query
result = rag.query("What is the warranty on laptops?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")

# 6. Stream
for chunk in rag.stream_query("How do I contact customer support?"):
    print(chunk, end="", flush=True)

Conclusion

A production RAG system is built in layers — each one addressing a specific failure mode. Start with the basic retrieval + generation pipeline, measure with RAGAS, then add hybrid search, reranking, and better document processing where the metrics show gaps.

The reranking step alone typically improves answer quality by 15-25% in my experience. It's the single highest-ROI improvement after switching from basic semantic search to hybrid retrieval.

For the vector database underlying this system, see our vector database guide. For the RAG architecture concepts, see our RAG explained guide.

Frequently Asked Questions

A production RAG system needs more than basic document retrieval: quality document processing (handle PDFs, tables, images, not just plain text), hybrid search (dense + sparse), reranking for precision, context construction (how you format retrieved chunks matters), evaluation metrics (faithfulness, relevancy, precision), monitoring (what queries fail to retrieve relevant context), and cost management (cache frequent queries, use cheaper models for classification). Most RAG tutorials show a working prototype; production requires adding each of these layers systematically.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI application development code in Python editor — ai api cost management

AI Learning

AI API Cost Management: How to Cut LLM Costs by 80% Without Losing Quality

AI API cost management — practical strategies to reduce OpenAI, Claude, and Gemini API costs by 80% using model selection, caching, RAG, prompt optimization, and batch processing.

May 27, 2026 7 min read

AI application development code in Python editor — build an ai chatbot with python build ai chatbot python

AI Learning

🔥 Trending

Build an AI Chatbot with Python: Complete Guide from Scratch to Deployment

Build an AI chatbot with Python — complete tutorial from OpenAI API integration to conversation memory, streaming responses, and deploying a production-ready chatbot application.

May 27, 2026 7 min read

AI application development code in Python editor — build a personal ai assistant build personal ai assistant

AI Learning

Build a Personal AI Assistant: Complete Python Project with Memory and Tools

Build a personal AI assistant in Python with persistent memory, web search, file access, and calendar integration — a complete project from architecture to working prototype.

May 27, 2026 7 min read

AI application development code in Python editor — crewai tutorial

AI Learning

CrewAI Tutorial: Build Multi-Agent AI Systems That Work Together

CrewAI tutorial — build multi-agent AI systems where specialized agents collaborate to complete complex tasks, with practical Python examples for research, coding, and content workflows.

May 27, 2026 8 min read

Go deeper on this topic

NotesPrompt Engineering Cheat Sheet NotesLLM Core Concepts Explained NotesChatGPT Tips & Tricks Cheat Sheet NotesAI Agent Development Notes NotesTransformer Architecture Cheat Sheet NotesPrompt Engineering vs Fine-Tuning vs RLHF

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Ai Development

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

⚡ Quick Answer

RAG system tutorial — build a production-ready retrieval-augmented generation system with document ingestion, hybrid search, reranking, and evaluation from scratch in Python.

AiTechWorlds Team May 27, 2026 7 min read

#rag-system-tutorial #build-rag-system #langchain-rag #ai-development

📚Part of the Ai Development guide — explore all Ai Development articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

Architecture Overview

Production RAG System:

Document Ingestion Pipeline:
  → Parse PDFs/HTML/DOCX (preserve structure)
  → Chunk with overlap
  → Generate dense embeddings
  → Generate sparse BM25 index
  → Store in vector database with metadata

Query Pipeline:
  → Query preprocessing
  → Dense retrieval (semantic)
  → Sparse retrieval (keyword)
  → Fusion (RRF or weighted)
  → Reranking (cross-encoder)
  → Context assembly
  → LLM generation with prompt
  → Response streaming

Evaluation Layer:
  → Faithfulness score
  → Context precision
  → Answer relevancy
  → Latency tracking

Part 1: Document Processing

# pip install langchain langchain-openai langchain-chroma unstructured pdfplumber

from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredPDFLoader,  # Better for complex PDFs
    WebBaseLoader,
    DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
import logging

logger = logging.getLogger(__name__)

class DocumentProcessor:
    def __init__(self, chunk_size: int = 512, chunk_overlap: int = 64):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=["\n\n", "\n", ". ", " ", ""],
        )
    
    def load_pdf(self, file_path: str, use_unstructured: bool = False):
        """Load PDF with appropriate loader."""
        if use_unstructured:
            # Better for complex layouts, tables, multi-column
            loader = UnstructuredPDFLoader(
                file_path,
                mode="elements",  # Preserves tables as separate elements
                strategy="hi_res"  # Better accuracy, slower
            )
        else:
            loader = PyPDFLoader(file_path)
        
        return loader.load()
    
    def process_documents(self, file_paths: list[str]) -> list:
        all_chunks = []
        
        for path in file_paths:
            logger.info(f"Processing: {path}")
            
            ext = Path(path).suffix.lower()
            if ext == ".pdf":
                docs = self.load_pdf(path)
            elif ext in [".html", ".htm"]:
                docs = WebBaseLoader(path).load()
            else:
                from langchain_community.document_loaders import TextLoader
                docs = TextLoader(path).load()
            
            chunks = self.splitter.split_documents(docs)
            
            # Add source metadata
            for i, chunk in enumerate(chunks):
                chunk.metadata.update({
                    "source": path,
                    "chunk_index": i,
                    "total_chunks": len(chunks),
                })
            
            all_chunks.extend(chunks)
            logger.info(f"  Created {len(chunks)} chunks from {path}")
        
        logger.info(f"Total chunks: {len(all_chunks)}")
        return all_chunks

Part 2: Hybrid Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever

class HybridVectorStore:
    def __init__(self, chunks: list, persist_dir: str = "./rag_db"):
        self.chunks = chunks
        self.persist_dir = persist_dir
        
        # Dense (semantic) retriever
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Chroma.from_documents(
            documents=chunks,
            embedding=embeddings,
            persist_directory=persist_dir
        )
        
        # Sparse (keyword) retriever — BM25
        self.bm25_retriever = BM25Retriever.from_documents(chunks)
        self.bm25_retriever.k = 10
        
        # Dense retriever
        self.dense_retriever = self.vectorstore.as_retriever(
            search_kwargs={"k": 10}
        )
        
        # Hybrid: 40% BM25, 60% semantic
        self.hybrid_retriever = EnsembleRetriever(
            retrievers=[self.bm25_retriever, self.dense_retriever],
            weights=[0.4, 0.6]
        )
    
    def retrieve(self, query: str, k: int = 6) -> list:
        """Retrieve top-k most relevant chunks."""
        docs = self.hybrid_retriever.invoke(query)
        return docs[:k]  # EnsembleRetriever returns merged, deduplicated results
    
    @classmethod
    def load(cls, persist_dir: str, chunks: list):
        """Load existing vector store."""
        instance = cls.__new__(cls)
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        instance.vectorstore = Chroma(
            persist_directory=persist_dir,
            embedding_function=embeddings
        )
        instance.bm25_retriever = BM25Retriever.from_documents(chunks)
        instance.dense_retriever = instance.vectorstore.as_retriever(search_kwargs={"k": 10})
        instance.hybrid_retriever = EnsembleRetriever(
            retrievers=[instance.bm25_retriever, instance.dense_retriever],
            weights=[0.4, 0.6]
        )
        return instance

Part 3: Reranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

def create_reranking_retriever(base_retriever, top_n: int = 4):
    """Wrap retriever with cross-encoder reranking."""
    
    # Cross-encoder is more accurate than bi-encoder for ranking
    # First retrieve more candidates with fast bi-encoder, then rerank
    reranker = HuggingFaceCrossEncoder(
        model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
    )
    compressor = CrossEncoderReranker(model=reranker, top_n=top_n)
    
    return ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever
    )

# Usage: retrieve 10, rerank to top 4
reranking_retriever = create_reranking_retriever(
    hybrid_store.hybrid_retriever,
    top_n=4
)

Part 4: Generation with Citations

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

class RAGPipeline:
    def __init__(self, retriever, model: str = "gpt-4o-mini"):
        self.retriever = retriever
        self.llm = ChatOpenAI(model=model, temperature=0)
        
        self.prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful assistant that answers questions based on provided documents.

Rules:
1. Answer ONLY from the provided context
2. If the answer isn't in the context, say "I don't have that information in the documents."
3. Cite sources using [Doc 1], [Doc 2] format
4. Be concise and specific

Context:
{context}"""),
            ("human", "{question}")
        ])
        
        self.output_parser = StrOutputParser()
    
    def format_docs_with_sources(self, docs) -> str:
        formatted = []
        for i, doc in enumerate(docs):
            source = doc.metadata.get("source", "Unknown")
            page = doc.metadata.get("page", "")
            formatted.append(
                f"[Doc {i+1}] (Source: {source}{', Page ' + str(page) if page else ''})\n"
                f"{doc.page_content}"
            )
        return "\n\n---\n\n".join(formatted)
    
    def query(self, question: str) -> dict:
        # Retrieve
        docs = self.retriever.invoke(question)
        context = self.format_docs_with_sources(docs)
        
        # Generate
        chain = self.prompt | self.llm | self.output_parser
        answer = chain.invoke({"question": question, "context": context})
        
        return {
            "answer": answer,
            "sources": [doc.metadata.get("source") for doc in docs],
            "retrieved_chunks": len(docs)
        }
    
    def stream_query(self, question: str):
        docs = self.retriever.invoke(question)
        context = self.format_docs_with_sources(docs)
        
        chain = self.prompt | self.llm | self.output_parser
        
        for chunk in chain.stream({"question": question, "context": context}):
            yield chunk

Part 5: Evaluation with RAGAS

# pip install ragas datasets

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

def evaluate_rag_system(rag_pipeline, test_questions: list[dict]) -> dict:
    """
    test_questions: list of {"question": "...", "ground_truth": "..."}
    """
    
    results = []
    for item in test_questions:
        result = rag_pipeline.query(item["question"])
        results.append({
            "question": item["question"],
            "answer": result["answer"],
            "contexts": [
                doc.page_content 
                for doc in rag_pipeline.retriever.invoke(item["question"])
            ],
            "ground_truth": item["ground_truth"]
        })
    
    # Create evaluation dataset
    dataset = Dataset.from_list(results)
    
    # Evaluate
    scores = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )
    
    return {
        "faithfulness": scores["faithfulness"],
        "answer_relevancy": scores["answer_relevancy"],
        "context_precision": scores["context_precision"],
        "context_recall": scores["context_recall"],
        "num_evaluated": len(test_questions)
    }

# Example test set
test_set = [
    {
        "question": "What is the return policy for electronics?",
        "ground_truth": "Electronics can be returned within 30 days of purchase with original packaging."
    },
    {
        "question": "How long does standard shipping take?",
        "ground_truth": "Standard shipping takes 5-7 business days."
    },
]

scores = evaluate_rag_system(rag_pipeline, test_set)
print(f"Faithfulness: {scores['faithfulness']:.2f}")  # Target: > 0.8
print(f"Answer Relevancy: {scores['answer_relevancy']:.2f}")  # Target: > 0.85
print(f"Context Precision: {scores['context_precision']:.2f}")  # Target: > 0.7

Putting It All Together

# Complete pipeline initialization

# 1. Process documents
processor = DocumentProcessor(chunk_size=512, chunk_overlap=64)
chunks = processor.process_documents(["./docs/manual.pdf", "./docs/faq.pdf"])

# 2. Build hybrid vector store
hybrid_store = HybridVectorStore(chunks, persist_dir="./rag_production_db")

# 3. Add reranking
reranking_retriever = create_reranking_retriever(
    hybrid_store.hybrid_retriever,
    top_n=4
)

# 4. Create RAG pipeline
rag = RAGPipeline(retriever=reranking_retriever, model="gpt-4o-mini")

# 5. Query
result = rag.query("What is the warranty on laptops?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")

# 6. Stream
for chunk in rag.stream_query("How do I contact customer support?"):
    print(chunk, end="", flush=True)

Conclusion

The reranking step alone typically improves answer quality by 15-25% in my experience. It's the single highest-ROI improvement after switching from basic semantic search to hybrid retrieval.

For the vector database underlying this system, see our vector database guide. For the RAG architecture concepts, see our RAG explained guide.

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI Learning

AI API Cost Management: How to Cut LLM Costs by 80% Without Losing Quality

AI API cost management — practical strategies to reduce OpenAI, Claude, and Gemini API costs by 80% using model selection, caching, RAG, prompt optimization, and batch processing.

May 27, 2026 7 min read

AI Learning

🔥 Trending

Build an AI Chatbot with Python: Complete Guide from Scratch to Deployment

Build an AI chatbot with Python — complete tutorial from OpenAI API integration to conversation memory, streaming responses, and deploying a production-ready chatbot application.

May 27, 2026 7 min read

AI Learning

Build a Personal AI Assistant: Complete Python Project with Memory and Tools

Build a personal AI assistant in Python with persistent memory, web search, file access, and calendar integration — a complete project from architecture to working prototype.

May 27, 2026 7 min read

AI Learning

CrewAI Tutorial: Build Multi-Agent AI Systems That Work Together

CrewAI tutorial — build multi-agent AI systems where specialized agents collaborate to complete complex tasks, with practical Python examples for research, coding, and content workflows.

May 27, 2026 8 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

Architecture Overview

Part 1: Document Processing

Part 2: Hybrid Vector Store

Part 3: Reranking

Part 4: Generation with Citations

Part 5: Evaluation with RAGAS

Putting It All Together

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI API Cost Management: How to Cut LLM Costs by 80% Without Losing Quality

Build an AI Chatbot with Python: Complete Guide from Scratch to Deployment

Build a Personal AI Assistant: Complete Python Project with Memory and Tools

CrewAI Tutorial: Build Multi-Agent AI Systems That Work Together

Go deeper on this topic

Get Free AI Notes Daily

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

RAG System Tutorial: Build a Production Retrieval-Augmented Generation System

Architecture Overview

Part 1: Document Processing

Part 2: Hybrid Vector Store

Part 3: Reranking

Part 4: Generation with Citations

Part 5: Evaluation with RAGAS

Putting It All Together

Conclusion

Further Reading

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

AI API Cost Management: How to Cut LLM Costs by 80% Without Losing Quality

Build an AI Chatbot with Python: Complete Guide from Scratch to Deployment

Build a Personal AI Assistant: Complete Python Project with Memory and Tools

CrewAI Tutorial: Build Multi-Agent AI Systems That Work Together

Go deeper on this topic

Get Free AI Notes Daily