AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

neural network vector embedding diagram — LangChain RAG pipeline vector store retriever

How to Build a RAG Pipeline with LangChain (Step-by-Step)

⚡ Quick Answer

Build a complete RAG pipeline with LangChain, Chroma, and OpenAI embeddings — document loading, chunking, vector storage, and retrieval in one guide.

AiTechWorlds Team May 31, 2026 13 min read

#LangChain #RAG #Chroma #OpenAI Embeddings #Vector Store

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Retrieval-Augmented Generation is one of those patterns that sounds complex until you see it in practice. Then it clicks. You have documents. You want to ask questions about them. You can't fit all the documents in a single prompt. So you embed them, store the embeddings, retrieve the relevant ones at query time, and include only those in the prompt.

That's RAG. The complexity is in the details — how you split documents, which embedding model you use, how you tune retrieval, what you do when retrieval fails.

I've built RAG systems for document search, internal knowledge bases, customer support automation, and research assistants. The pattern is the same each time, but the tuning decisions matter enormously. This guide walks through the full pipeline with working code, then covers the decisions that actually move the needle.

For context on where RAG fits in the broader LangChain ecosystem, check the LangChain tutorial 2025 first if you're new to the framework.

What Makes a Good RAG System

Most RAG tutorials stop at "it returns answers." Production RAG needs to:

Return accurate answers (not hallucinations)
Return relevant answers (right context retrieved)
Handle edge cases gracefully (no relevant docs, ambiguous queries)
Be fast enough for user-facing apps (sub-2 second ideally)
Be cheap enough to run at scale

These goals sometimes conflict. Better accuracy often means slower retrieval. Better coverage means more embedding costs. The pipeline we build here optimizes for accuracy and clarity first, then covers optimization options.

According to a 2024 survey published on arXiv, RAG-based systems reduced hallucination rates by 40-60% compared to pure generation across multiple benchmark tasks. It's not a cure-all, but it's the most practical way to ground LLM responses in real data.

Setting Up Dependencies

pip install langchain langchain-openai langchain-community langchain-chroma
pip install chromadb python-dotenv pypdf

from dotenv import load_dotenv
import os

load_dotenv()
assert os.getenv("OPENAI_API_KEY"), "Missing OPENAI_API_KEY"

Step 1: Loading Documents

Before you can embed anything, you need to load your documents. LangChain has loaders for PDFs, HTML pages, Word docs, CSVs, YouTube transcripts, and dozens of other formats.

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_community.document_loaders.text import TextLoader

# Load a single PDF
pdf_loader = PyPDFLoader("./documents/research_paper.pdf")
pages = pdf_loader.load()
print(f"Loaded {len(pages)} pages from PDF")
print(f"First page preview: {pages[0].page_content[:200]}")

# Load all PDFs from a directory
dir_loader = DirectoryLoader(
    "./documents/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True
)
all_docs = dir_loader.load()
print(f"Loaded {len(all_docs)} document pages total")

# Loading from text files
text_loader = TextLoader("./documents/notes.txt", encoding="utf-8")
text_docs = text_loader.load()

# Loading from web pages
from langchain_community.document_loaders import WebBaseLoader
import bs4

web_loader = WebBaseLoader(
    web_paths=["https://python.langchain.com/docs/introduction"],
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(class_=("article", "main-content"))
    )
)
web_docs = web_loader.load()

Each loaded document is a Document object with two main attributes: page_content (the text) and metadata (source, page number, etc.). That metadata becomes important during retrieval.

Step 2: Splitting Documents into Chunks

You can't embed a 50-page PDF as one unit. You split it into chunks that are small enough to retrieve individually but large enough to contain meaningful context.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Characters per chunk
    chunk_overlap=200,      # Overlap between consecutive chunks
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Try these in order
)

# Split the loaded documents
chunks = splitter.split_documents(all_docs)
print(f"Split {len(all_docs)} pages into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")

# Preview a chunk and its metadata
print(f"\nChunk 0 content:\n{chunks[0].page_content}")
print(f"\nChunk 0 metadata: {chunks[0].metadata}")

Choosing the Right Splitter

The RecursiveCharacterTextSplitter is the default choice for most document types. For specialized content, use the purpose-built splitters:

# For markdown documents
from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ]
)

md_chunks = markdown_splitter.split_text(markdown_content)
# Each chunk retains header metadata

# For Python/code files
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1500,
    chunk_overlap=100
)

The overlap parameter (200 in our example) ensures that information spanning a chunk boundary isn't lost. If a sentence starts at position 950 and ends at 1020, both chunks contain that sentence — chunk 0 has the beginning, chunk 1 has the full thing.

Step 3: Creating Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts end up close in vector space. This is what makes retrieval possible.

from langchain_openai import OpenAIEmbeddings

# OpenAI's latest embedding model
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # Good balance of quality and cost
    # model="text-embedding-3-large",  # Better quality, higher cost
)

# Test the embedding model
test_text = "What is Python?"
embedding_vector = embeddings.embed_query(test_text)
print(f"Embedding dimensions: {len(embedding_vector)}")  # 1536 for small, 3072 for large

The text-embedding-3-small model at 1536 dimensions costs roughly $0.02 per million tokens. For most projects, it's the right choice. Use text-embedding-3-large only if you're seeing retrieval quality issues that smaller dimensions can't solve.

Step 4: Storing in a Vector Database

Now we store the chunks as embeddings. Chroma is the easiest option for local development and small production deployments.

from langchain_chroma import Chroma

# Create vector store from documents (embeds and stores in one step)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",  # Persist to disk
    collection_name="my_documents"
)

print(f"Stored {vectorstore._collection.count()} chunks in Chroma")

# Loading an existing vector store (after the first run)
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="my_documents"
)

# Test similarity search directly
results = vectorstore.similarity_search(
    "How does Python handle memory management?",
    k=3
)
for i, doc in enumerate(results):
    print(f"\n[Result {i+1}] Score: {doc.metadata}")
    print(doc.page_content[:200])

Adding New Documents to an Existing Store

new_docs = pdf_loader.load()
new_chunks = splitter.split_documents(new_docs)

# Add to existing store without recreating
vectorstore.add_documents(new_chunks)
print(f"Now have {vectorstore._collection.count()} total chunks")

Step 5: Building the Retriever

The retriever is the component that takes a query and returns the most relevant chunks.

# Basic similarity retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Return top 4 chunks
)

# MMR (Maximal Marginal Relevance) — reduces redundancy in results
mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 4,
        "fetch_k": 20,      # Fetch 20 candidates
        "lambda_mult": 0.7  # 0=max diversity, 1=max similarity
    }
)

# Similarity with score threshold
threshold_retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.7, "k": 4}
)

# Test retrieval
docs = retriever.invoke("What is the main contribution of this paper?")
print(f"Retrieved {len(docs)} chunks")
for doc in docs:
    print(f"\nSource: {doc.metadata.get('source', 'unknown')}")
    print(f"Content: {doc.page_content[:150]}...")

MMR is often better than pure similarity retrieval because it penalizes redundant results. If your top 4 similar chunks are all essentially the same paragraph, MMR will diversify the results to cover more ground.

Step 6: The Complete RAG Chain

Now we put it all together into a full question-answering system:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# RAG prompt
rag_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an assistant that answers questions based on provided documents.
    
Answer the question using ONLY the information in the context below.
If the answer is not in the context, say "I don't have enough information to answer that."
Always mention which document/source you're drawing from when relevant.

Context:
{context}"""),
    ("human", "{question}")
])

def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'Unknown')}]\n{doc.page_content}"
        for doc in docs
    )

# Full RAG chain
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

# Ask questions
answer = rag_chain.invoke("What are the main findings of the research?")
print(answer)

Adding Source Citations

For production systems, you usually need to know where answers come from:

from langchain_core.runnables import RunnablePassthrough
from typing import TypedDict, List
from langchain_core.documents import Document

class RAGResponse(TypedDict):
    question: str
    answer: str
    sources: List[Document]

# Chain that returns both answer and sources
rag_chain_with_sources = (
    RunnablePassthrough.assign(
        context=lambda x: format_docs(retriever.invoke(x["question"])),
        sources=lambda x: retriever.invoke(x["question"])
    )
    | {
        "answer": rag_prompt | llm | StrOutputParser(),
        "sources": lambda x: x["sources"],
        "question": lambda x: x["question"]
    }
)

response = rag_chain_with_sources.invoke({"question": "Who are the authors?"})
print(f"Answer: {response['answer']}")
print(f"\nSources used:")
for doc in response['sources']:
    print(f"  - {doc.metadata.get('source')}, page {doc.metadata.get('page', 'N/A')}")

Vector Database Comparison for Local RAG

Choosing a vector database matters a lot for performance and cost. Here's my honest comparison of the main options for LangChain RAG:

Database	Hosting	Cost	ANN Algorithm	Metadata Filtering	Best For
FAISS	Local	Free	IVF / HNSW	Limited	Fast local dev, no server needed
Chroma	Local / Hosted	Free (self-host)	HNSW	Full	Dev and small-medium production
Pinecone	Cloud only	$0.08/million queries	Proprietary	Excellent	Large-scale production, SaaS
Qdrant	Local + Cloud	Free (self-host)	HNSW	Excellent	Production, complex filtering
Weaviate	Local + Cloud	Free (self-host)	HNSW	Excellent	Hybrid search, multi-modal
Milvus	Local + Cloud	Free (self-host)	HNSW / IVF	Good	High throughput, enterprise
PGVector	Local (Postgres)	Free (self-host)	HNSW / IVF	Excellent	Existing Postgres stack

For local RAG development: start with Chroma. It's the easiest to set up and works well up to a few hundred thousand documents. For production RAG at scale, Qdrant gives you the best combination of performance, filtering, and self-hosting flexibility. Our deeper vector database guide covers each one in more detail.

Tuning RAG Quality

Basic RAG gets you 60% of the way there. Tuning gets you the rest. Here are the highest-impact changes you can make.

Hypothetical Document Embeddings (HyDE)

HyDE generates a hypothetical answer first, then uses that to retrieve. It works surprisingly well for question-answering tasks.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Generate hypothetical document
hyde_prompt = ChatPromptTemplate.from_template(
    "Write a short paragraph that would directly answer this question:\n{question}"
)
hyde_chain = hyde_prompt | llm | StrOutputParser()

def hyde_retrieve(question: str) -> List[Document]:
    # Generate a hypothetical answer
    hypothetical = hyde_chain.invoke({"question": question})
    # Use it for retrieval instead of the raw question
    return retriever.invoke(hypothetical)

# Use in RAG chain
hyde_rag_chain = (
    {
        "context": lambda x: format_docs(hyde_retrieve(x["question"])),
        "question": lambda x: x["question"]
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

Semantic Chunking

Instead of fixed-size chunks, semantic chunking splits on meaning boundaries:

from langchain_experimental.text_splitter import SemanticChunker

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

semantic_chunks = semantic_splitter.split_documents(all_docs)
print(f"Semantic chunks: {len(semantic_chunks)}")

Semantic chunking produces better chunks for retrieval, at the cost of being slower to create. Worth it for document types with clear semantic structure.

Multi-Query Retrieval

Generate multiple query variations and combine the retrieved results:

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=retriever,
    llm=llm,
    include_original=True  # Include original query results too
)

# This automatically generates 3 query variations and deduplicates results
docs = multi_query_retriever.invoke("What methods did they use?")

Multi-query retrieval consistently improves recall at the cost of 3-4x more LLM calls. For high-stakes Q&A where missing relevant context is worse than extra cost, it's usually worth it.

For a full exploration of advanced retrieval patterns, see our LangChain advanced RAG strategies guide. The semantic search tutorial also covers the embedding fundamentals in more depth.

Building a Complete RAG App

Let's wrap everything into a clean, reusable class:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path
from typing import List, Optional

class RAGPipeline:
    def __init__(
        self,
        persist_dir: str = "./rag_db",
        model: str = "gpt-4o-mini",
        embedding_model: str = "text-embedding-3-small",
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
        k: int = 4
    ):
        self.embeddings = OpenAIEmbeddings(model=embedding_model)
        self.llm = ChatOpenAI(model=model, temperature=0)
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
        self.persist_dir = persist_dir
        self.k = k
        
        # Load or create vector store
        if Path(persist_dir).exists():
            self.vectorstore = Chroma(
                persist_directory=persist_dir,
                embedding_function=self.embeddings
            )
            print(f"Loaded existing store with {self.vectorstore._collection.count()} chunks")
        else:
            self.vectorstore = None
            print("No existing store found. Add documents first.")
        
        self._build_chain()
    
    def add_documents(self, documents: list) -> None:
        chunks = self.splitter.split_documents(documents)
        
        if self.vectorstore is None:
            self.vectorstore = Chroma.from_documents(
                documents=chunks,
                embedding=self.embeddings,
                persist_directory=self.persist_dir
            )
        else:
            self.vectorstore.add_documents(chunks)
        
        print(f"Added {len(chunks)} chunks. Total: {self.vectorstore._collection.count()}")
        self._build_chain()
    
    def _build_chain(self) -> None:
        if self.vectorstore is None:
            return
        
        retriever = self.vectorstore.as_retriever(
            search_type="mmr",
            search_kwargs={"k": self.k, "fetch_k": self.k * 5}
        )
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", """Answer questions using only the provided context.
If the answer isn't in the context, say so clearly.

Context: {context}"""),
            ("human", "{question}")
        ])
        
        def format_docs(docs):
            return "\n\n".join(d.page_content for d in docs)
        
        self.chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )
    
    def ask(self, question: str) -> str:
        if self.vectorstore is None:
            return "No documents loaded. Call add_documents() first."
        return self.chain.invoke(question)

# Usage
pipeline = RAGPipeline(persist_dir="./my_rag_db")

# Add documents
from langchain_community.document_loaders import PyPDFLoader
docs = PyPDFLoader("./my_document.pdf").load()
pipeline.add_documents(docs)

# Ask questions
print(pipeline.ask("What is the main topic of this document?"))
print(pipeline.ask("What are the key findings?"))

Conclusion

Building a RAG pipeline with LangChain involves five clear steps: load documents, split them into chunks, embed and store those chunks, build a retriever, then wire it into an LLM chain. The basic version takes maybe 50 lines of Python. The production version with proper error handling, metadata filtering, and retrieval tuning takes more work — but the scaffold is always the same.

The biggest quality improvements come from: better chunking strategy (semantic over character), MMR retrieval to reduce redundancy, and multi-query retrieval to improve recall. Start simple, measure your retrieval quality, then add complexity where it actually helps.

From here, explore the LangChain advanced RAG strategies guide for reranking, hybrid search, and contextual compression — the techniques that take a good RAG system to a great one.

Frequently Asked Questions

What chunk size should I use for RAG?

Start with 1000–1500 characters with 150–200 character overlap. Smaller chunks (500–800) work better for precise factual retrieval. Larger chunks (2000+) work better for complex reasoning tasks that need more context. Always experiment with your specific documents and measure retrieval quality.

How many documents should I retrieve per query (k value)?

k=3 to k=5 is a good starting point. Too few documents means missing relevant context. Too many means flooding the prompt with noise and increasing cost. Use reranking for large k values to filter down to the most relevant results.

Is Chroma good enough for production RAG, or do I need Pinecone?

Chroma is excellent for development and small-to-medium production workloads (under a million documents). For millions of documents, high availability requirements, or multi-tenant setups, a managed service like Pinecone or a self-hosted Qdrant/Weaviate deployment is worth the extra infrastructure investment.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Start with 1000-1500 characters with 150-200 character overlap. Smaller chunks (500-800) work better for precise factual retrieval. Larger chunks (2000+) work better for complex reasoning tasks that need more context. Always experiment with your specific documents.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes NotesEmbeddings & Vector Databases Reference BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

How to Build a RAG Pipeline with LangChain (Step-by-Step)

⚡ Quick Answer

Build a complete RAG pipeline with LangChain, Chroma, and OpenAI embeddings — document loading, chunking, vector storage, and retrieval in one guide.

AiTechWorlds Team May 31, 2026 13 min read

#LangChain #RAG #Chroma #OpenAI Embeddings #Vector Store

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

That's RAG. The complexity is in the details — how you split documents, which embedding model you use, how you tune retrieval, what you do when retrieval fails.

For context on where RAG fits in the broader LangChain ecosystem, check the LangChain tutorial 2025 first if you're new to the framework.

What Makes a Good RAG System

Most RAG tutorials stop at "it returns answers." Production RAG needs to:

Return accurate answers (not hallucinations)
Return relevant answers (right context retrieved)
Handle edge cases gracefully (no relevant docs, ambiguous queries)
Be fast enough for user-facing apps (sub-2 second ideally)
Be cheap enough to run at scale

Setting Up Dependencies

pip install langchain langchain-openai langchain-community langchain-chroma
pip install chromadb python-dotenv pypdf

from dotenv import load_dotenv
import os

load_dotenv()
assert os.getenv("OPENAI_API_KEY"), "Missing OPENAI_API_KEY"

Step 1: Loading Documents

Before you can embed anything, you need to load your documents. LangChain has loaders for PDFs, HTML pages, Word docs, CSVs, YouTube transcripts, and dozens of other formats.

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_community.document_loaders.text import TextLoader

# Load a single PDF
pdf_loader = PyPDFLoader("./documents/research_paper.pdf")
pages = pdf_loader.load()
print(f"Loaded {len(pages)} pages from PDF")
print(f"First page preview: {pages[0].page_content[:200]}")

# Load all PDFs from a directory
dir_loader = DirectoryLoader(
    "./documents/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True
)
all_docs = dir_loader.load()
print(f"Loaded {len(all_docs)} document pages total")

# Loading from text files
text_loader = TextLoader("./documents/notes.txt", encoding="utf-8")
text_docs = text_loader.load()

# Loading from web pages
from langchain_community.document_loaders import WebBaseLoader
import bs4

web_loader = WebBaseLoader(
    web_paths=["https://python.langchain.com/docs/introduction"],
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(class_=("article", "main-content"))
    )
)
web_docs = web_loader.load()

Each loaded document is a Document object with two main attributes: page_content (the text) and metadata (source, page number, etc.). That metadata becomes important during retrieval.

Step 2: Splitting Documents into Chunks

You can't embed a 50-page PDF as one unit. You split it into chunks that are small enough to retrieve individually but large enough to contain meaningful context.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Characters per chunk
    chunk_overlap=200,      # Overlap between consecutive chunks
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Try these in order
)

# Split the loaded documents
chunks = splitter.split_documents(all_docs)
print(f"Split {len(all_docs)} pages into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")

# Preview a chunk and its metadata
print(f"\nChunk 0 content:\n{chunks[0].page_content}")
print(f"\nChunk 0 metadata: {chunks[0].metadata}")

Choosing the Right Splitter

The RecursiveCharacterTextSplitter is the default choice for most document types. For specialized content, use the purpose-built splitters:

# For markdown documents
from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ]
)

md_chunks = markdown_splitter.split_text(markdown_content)
# Each chunk retains header metadata

# For Python/code files
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1500,
    chunk_overlap=100
)

Step 3: Creating Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts end up close in vector space. This is what makes retrieval possible.

from langchain_openai import OpenAIEmbeddings

# OpenAI's latest embedding model
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # Good balance of quality and cost
    # model="text-embedding-3-large",  # Better quality, higher cost
)

# Test the embedding model
test_text = "What is Python?"
embedding_vector = embeddings.embed_query(test_text)
print(f"Embedding dimensions: {len(embedding_vector)}")  # 1536 for small, 3072 for large

Step 4: Storing in a Vector Database

Now we store the chunks as embeddings. Chroma is the easiest option for local development and small production deployments.

from langchain_chroma import Chroma

# Create vector store from documents (embeds and stores in one step)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",  # Persist to disk
    collection_name="my_documents"
)

print(f"Stored {vectorstore._collection.count()} chunks in Chroma")

# Loading an existing vector store (after the first run)
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="my_documents"
)

# Test similarity search directly
results = vectorstore.similarity_search(
    "How does Python handle memory management?",
    k=3
)
for i, doc in enumerate(results):
    print(f"\n[Result {i+1}] Score: {doc.metadata}")
    print(doc.page_content[:200])

Adding New Documents to an Existing Store

new_docs = pdf_loader.load()
new_chunks = splitter.split_documents(new_docs)

# Add to existing store without recreating
vectorstore.add_documents(new_chunks)
print(f"Now have {vectorstore._collection.count()} total chunks")

Step 5: Building the Retriever

The retriever is the component that takes a query and returns the most relevant chunks.

# Basic similarity retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Return top 4 chunks
)

# MMR (Maximal Marginal Relevance) — reduces redundancy in results
mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 4,
        "fetch_k": 20,      # Fetch 20 candidates
        "lambda_mult": 0.7  # 0=max diversity, 1=max similarity
    }
)

# Similarity with score threshold
threshold_retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.7, "k": 4}
)

# Test retrieval
docs = retriever.invoke("What is the main contribution of this paper?")
print(f"Retrieved {len(docs)} chunks")
for doc in docs:
    print(f"\nSource: {doc.metadata.get('source', 'unknown')}")
    print(f"Content: {doc.page_content[:150]}...")

Step 6: The Complete RAG Chain

Now we put it all together into a full question-answering system:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# RAG prompt
rag_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an assistant that answers questions based on provided documents.
    
Answer the question using ONLY the information in the context below.
If the answer is not in the context, say "I don't have enough information to answer that."
Always mention which document/source you're drawing from when relevant.

Context:
{context}"""),
    ("human", "{question}")
])

def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'Unknown')}]\n{doc.page_content}"
        for doc in docs
    )

# Full RAG chain
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

# Ask questions
answer = rag_chain.invoke("What are the main findings of the research?")
print(answer)

Adding Source Citations

For production systems, you usually need to know where answers come from:

from langchain_core.runnables import RunnablePassthrough
from typing import TypedDict, List
from langchain_core.documents import Document

class RAGResponse(TypedDict):
    question: str
    answer: str
    sources: List[Document]

# Chain that returns both answer and sources
rag_chain_with_sources = (
    RunnablePassthrough.assign(
        context=lambda x: format_docs(retriever.invoke(x["question"])),
        sources=lambda x: retriever.invoke(x["question"])
    )
    | {
        "answer": rag_prompt | llm | StrOutputParser(),
        "sources": lambda x: x["sources"],
        "question": lambda x: x["question"]
    }
)

response = rag_chain_with_sources.invoke({"question": "Who are the authors?"})
print(f"Answer: {response['answer']}")
print(f"\nSources used:")
for doc in response['sources']:
    print(f"  - {doc.metadata.get('source')}, page {doc.metadata.get('page', 'N/A')}")

Vector Database Comparison for Local RAG

Choosing a vector database matters a lot for performance and cost. Here's my honest comparison of the main options for LangChain RAG:

Database	Hosting	Cost	ANN Algorithm	Metadata Filtering	Best For
FAISS	Local	Free	IVF / HNSW	Limited	Fast local dev, no server needed
Chroma	Local / Hosted	Free (self-host)	HNSW	Full	Dev and small-medium production
Pinecone	Cloud only	$0.08/million queries	Proprietary	Excellent	Large-scale production, SaaS
Qdrant	Local + Cloud	Free (self-host)	HNSW	Excellent	Production, complex filtering
Weaviate	Local + Cloud	Free (self-host)	HNSW	Excellent	Hybrid search, multi-modal
Milvus	Local + Cloud	Free (self-host)	HNSW / IVF	Good	High throughput, enterprise
PGVector	Local (Postgres)	Free (self-host)	HNSW / IVF	Excellent	Existing Postgres stack

Tuning RAG Quality

Basic RAG gets you 60% of the way there. Tuning gets you the rest. Here are the highest-impact changes you can make.

Hypothetical Document Embeddings (HyDE)

HyDE generates a hypothetical answer first, then uses that to retrieve. It works surprisingly well for question-answering tasks.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Generate hypothetical document
hyde_prompt = ChatPromptTemplate.from_template(
    "Write a short paragraph that would directly answer this question:\n{question}"
)
hyde_chain = hyde_prompt | llm | StrOutputParser()

def hyde_retrieve(question: str) -> List[Document]:
    # Generate a hypothetical answer
    hypothetical = hyde_chain.invoke({"question": question})
    # Use it for retrieval instead of the raw question
    return retriever.invoke(hypothetical)

# Use in RAG chain
hyde_rag_chain = (
    {
        "context": lambda x: format_docs(hyde_retrieve(x["question"])),
        "question": lambda x: x["question"]
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

Semantic Chunking

Instead of fixed-size chunks, semantic chunking splits on meaning boundaries:

from langchain_experimental.text_splitter import SemanticChunker

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

semantic_chunks = semantic_splitter.split_documents(all_docs)
print(f"Semantic chunks: {len(semantic_chunks)}")

Semantic chunking produces better chunks for retrieval, at the cost of being slower to create. Worth it for document types with clear semantic structure.

Multi-Query Retrieval

Generate multiple query variations and combine the retrieved results:

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=retriever,
    llm=llm,
    include_original=True  # Include original query results too
)

# This automatically generates 3 query variations and deduplicates results
docs = multi_query_retriever.invoke("What methods did they use?")

Multi-query retrieval consistently improves recall at the cost of 3-4x more LLM calls. For high-stakes Q&A where missing relevant context is worse than extra cost, it's usually worth it.

For a full exploration of advanced retrieval patterns, see our LangChain advanced RAG strategies guide. The semantic search tutorial also covers the embedding fundamentals in more depth.

Building a Complete RAG App

Let's wrap everything into a clean, reusable class:

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path
from typing import List, Optional

class RAGPipeline:
    def __init__(
        self,
        persist_dir: str = "./rag_db",
        model: str = "gpt-4o-mini",
        embedding_model: str = "text-embedding-3-small",
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
        k: int = 4
    ):
        self.embeddings = OpenAIEmbeddings(model=embedding_model)
        self.llm = ChatOpenAI(model=model, temperature=0)
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
        self.persist_dir = persist_dir
        self.k = k
        
        # Load or create vector store
        if Path(persist_dir).exists():
            self.vectorstore = Chroma(
                persist_directory=persist_dir,
                embedding_function=self.embeddings
            )
            print(f"Loaded existing store with {self.vectorstore._collection.count()} chunks")
        else:
            self.vectorstore = None
            print("No existing store found. Add documents first.")
        
        self._build_chain()
    
    def add_documents(self, documents: list) -> None:
        chunks = self.splitter.split_documents(documents)
        
        if self.vectorstore is None:
            self.vectorstore = Chroma.from_documents(
                documents=chunks,
                embedding=self.embeddings,
                persist_directory=self.persist_dir
            )
        else:
            self.vectorstore.add_documents(chunks)
        
        print(f"Added {len(chunks)} chunks. Total: {self.vectorstore._collection.count()}")
        self._build_chain()
    
    def _build_chain(self) -> None:
        if self.vectorstore is None:
            return
        
        retriever = self.vectorstore.as_retriever(
            search_type="mmr",
            search_kwargs={"k": self.k, "fetch_k": self.k * 5}
        )
        
        prompt = ChatPromptTemplate.from_messages([
            ("system", """Answer questions using only the provided context.
If the answer isn't in the context, say so clearly.

Context: {context}"""),
            ("human", "{question}")
        ])
        
        def format_docs(docs):
            return "\n\n".join(d.page_content for d in docs)
        
        self.chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )
    
    def ask(self, question: str) -> str:
        if self.vectorstore is None:
            return "No documents loaded. Call add_documents() first."
        return self.chain.invoke(question)

# Usage
pipeline = RAGPipeline(persist_dir="./my_rag_db")

# Add documents
from langchain_community.document_loaders import PyPDFLoader
docs = PyPDFLoader("./my_document.pdf").load()
pipeline.add_documents(docs)

# Ask questions
print(pipeline.ask("What is the main topic of this document?"))
print(pipeline.ask("What are the key findings?"))

Conclusion

From here, explore the LangChain advanced RAG strategies guide for reranking, hybrid search, and contextual compression — the techniques that take a good RAG system to a great one.

Frequently Asked Questions

What chunk size should I use for RAG?

How many documents should I retrieve per query (k value)?

Is Chroma good enough for production RAG, or do I need Pinecone?

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

How to Build a RAG Pipeline with LangChain (Step-by-Step)

What Makes a Good RAG System

Setting Up Dependencies

Step 1: Loading Documents

Step 2: Splitting Documents into Chunks

Choosing the Right Splitter

Step 3: Creating Embeddings

Step 4: Storing in a Vector Database

Adding New Documents to an Existing Store

Step 5: Building the Retriever

Step 6: The Complete RAG Chain

Adding Source Citations

Vector Database Comparison for Local RAG

Tuning RAG Quality

Hypothetical Document Embeddings (HyDE)

Semantic Chunking

Multi-Query Retrieval

Building a Complete RAG App

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

How to Build a RAG Pipeline with LangChain (Step-by-Step)

What Makes a Good RAG System

Setting Up Dependencies

Step 1: Loading Documents

Step 2: Splitting Documents into Chunks

Choosing the Right Splitter

Step 3: Creating Embeddings

Step 4: Storing in a Vector Database

Adding New Documents to an Existing Store

Step 5: Building the Retriever

Step 6: The Complete RAG Chain

Adding Source Citations

Vector Database Comparison for Local RAG

Tuning RAG Quality

Hypothetical Document Embeddings (HyDE)

Semantic Chunking

Multi-Query Retrieval

Building a Complete RAG App

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily