How to Build a RAG Pipeline with LangChain (Step-by-Step)
Build a complete RAG pipeline with LangChain, Chroma, and OpenAI embeddings — document loading, chunking, vector storage, and retrieval in one guide.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Retrieval-Augmented Generation is one of those patterns that sounds complex until you see it in practice. Then it clicks. You have documents. You want to ask questions about them. You can't fit all the documents in a single prompt. So you embed them, store the embeddings, retrieve the relevant ones at query time, and include only those in the prompt.
That's RAG. The complexity is in the details — how you split documents, which embedding model you use, how you tune retrieval, what you do when retrieval fails.
I've built RAG systems for document search, internal knowledge bases, customer support automation, and research assistants. The pattern is the same each time, but the tuning decisions matter enormously. This guide walks through the full pipeline with working code, then covers the decisions that actually move the needle.
For context on where RAG fits in the broader LangChain ecosystem, check the LangChain tutorial 2025 first if you're new to the framework.
What Makes a Good RAG System
Most RAG tutorials stop at "it returns answers." Production RAG needs to:
- Return accurate answers (not hallucinations)
- Return relevant answers (right context retrieved)
- Handle edge cases gracefully (no relevant docs, ambiguous queries)
- Be fast enough for user-facing apps (sub-2 second ideally)
- Be cheap enough to run at scale
These goals sometimes conflict. Better accuracy often means slower retrieval. Better coverage means more embedding costs. The pipeline we build here optimizes for accuracy and clarity first, then covers optimization options.
According to a 2024 survey published on arXiv, RAG-based systems reduced hallucination rates by 40-60% compared to pure generation across multiple benchmark tasks. It's not a cure-all, but it's the most practical way to ground LLM responses in real data.
Setting Up Dependencies
pip install langchain langchain-openai langchain-community langchain-chroma
pip install chromadb python-dotenv pypdf
from dotenv import load_dotenv
import os
load_dotenv()
assert os.getenv("OPENAI_API_KEY"), "Missing OPENAI_API_KEY"
Step 1: Loading Documents
Before you can embed anything, you need to load your documents. LangChain has loaders for PDFs, HTML pages, Word docs, CSVs, YouTube transcripts, and dozens of other formats.
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_community.document_loaders.text import TextLoader
# Load a single PDF
pdf_loader = PyPDFLoader("./documents/research_paper.pdf")
pages = pdf_loader.load()
print(f"Loaded {len(pages)} pages from PDF")
print(f"First page preview: {pages[0].page_content[:200]}")
# Load all PDFs from a directory
dir_loader = DirectoryLoader(
"./documents/",
glob="**/*.pdf",
loader_cls=PyPDFLoader,
show_progress=True
)
all_docs = dir_loader.load()
print(f"Loaded {len(all_docs)} document pages total")
# Loading from text files
text_loader = TextLoader("./documents/notes.txt", encoding="utf-8")
text_docs = text_loader.load()
# Loading from web pages
from langchain_community.document_loaders import WebBaseLoader
import bs4
web_loader = WebBaseLoader(
web_paths=["https://python.langchain.com/docs/introduction"],
bs_kwargs=dict(
parse_only=bs4.SoupStrainer(class_=("article", "main-content"))
)
)
web_docs = web_loader.load()
Each loaded document is a Document object with two main attributes: page_content (the text) and metadata (source, page number, etc.). That metadata becomes important during retrieval.
Step 2: Splitting Documents into Chunks
You can't embed a 50-page PDF as one unit. You split it into chunks that are small enough to retrieve individually but large enough to contain meaningful context.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap between consecutive chunks
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
)
# Split the loaded documents
chunks = splitter.split_documents(all_docs)
print(f"Split {len(all_docs)} pages into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
# Preview a chunk and its metadata
print(f"\nChunk 0 content:\n{chunks[0].page_content}")
print(f"\nChunk 0 metadata: {chunks[0].metadata}")
Choosing the Right Splitter
The RecursiveCharacterTextSplitter is the default choice for most document types. For specialized content, use the purpose-built splitters:
# For markdown documents
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
)
md_chunks = markdown_splitter.split_text(markdown_content)
# Each chunk retains header metadata
# For Python/code files
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter
code_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1500,
chunk_overlap=100
)
The overlap parameter (200 in our example) ensures that information spanning a chunk boundary isn't lost. If a sentence starts at position 950 and ends at 1020, both chunks contain that sentence — chunk 0 has the beginning, chunk 1 has the full thing.
Step 3: Creating Embeddings
Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts end up close in vector space. This is what makes retrieval possible.
from langchain_openai import OpenAIEmbeddings
# OpenAI's latest embedding model
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small", # Good balance of quality and cost
# model="text-embedding-3-large", # Better quality, higher cost
)
# Test the embedding model
test_text = "What is Python?"
embedding_vector = embeddings.embed_query(test_text)
print(f"Embedding dimensions: {len(embedding_vector)}") # 1536 for small, 3072 for large
The text-embedding-3-small model at 1536 dimensions costs roughly $0.02 per million tokens. For most projects, it's the right choice. Use text-embedding-3-large only if you're seeing retrieval quality issues that smaller dimensions can't solve.
Step 4: Storing in a Vector Database
Now we store the chunks as embeddings. Chroma is the easiest option for local development and small production deployments.
from langchain_chroma import Chroma
# Create vector store from documents (embeds and stores in one step)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db", # Persist to disk
collection_name="my_documents"
)
print(f"Stored {vectorstore._collection.count()} chunks in Chroma")
# Loading an existing vector store (after the first run)
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings,
collection_name="my_documents"
)
# Test similarity search directly
results = vectorstore.similarity_search(
"How does Python handle memory management?",
k=3
)
for i, doc in enumerate(results):
print(f"\n[Result {i+1}] Score: {doc.metadata}")
print(doc.page_content[:200])
Adding New Documents to an Existing Store
new_docs = pdf_loader.load()
new_chunks = splitter.split_documents(new_docs)
# Add to existing store without recreating
vectorstore.add_documents(new_chunks)
print(f"Now have {vectorstore._collection.count()} total chunks")
Step 5: Building the Retriever
The retriever is the component that takes a query and returns the most relevant chunks.
# Basic similarity retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4} # Return top 4 chunks
)
# MMR (Maximal Marginal Relevance) — reduces redundancy in results
mmr_retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 4,
"fetch_k": 20, # Fetch 20 candidates
"lambda_mult": 0.7 # 0=max diversity, 1=max similarity
}
)
# Similarity with score threshold
threshold_retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"score_threshold": 0.7, "k": 4}
)
# Test retrieval
docs = retriever.invoke("What is the main contribution of this paper?")
print(f"Retrieved {len(docs)} chunks")
for doc in docs:
print(f"\nSource: {doc.metadata.get('source', 'unknown')}")
print(f"Content: {doc.page_content[:150]}...")
MMR is often better than pure similarity retrieval because it penalizes redundant results. If your top 4 similar chunks are all essentially the same paragraph, MMR will diversify the results to cover more ground.
Step 6: The Complete RAG Chain
Now we put it all together into a full question-answering system:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# RAG prompt
rag_prompt = ChatPromptTemplate.from_messages([
("system", """You are an assistant that answers questions based on provided documents.
Answer the question using ONLY the information in the context below.
If the answer is not in the context, say "I don't have enough information to answer that."
Always mention which document/source you're drawing from when relevant.
Context:
{context}"""),
("human", "{question}")
])
def format_docs(docs):
return "\n\n---\n\n".join(
f"[Source: {doc.metadata.get('source', 'Unknown')}]\n{doc.page_content}"
for doc in docs
)
# Full RAG chain
rag_chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough()
}
| rag_prompt
| llm
| StrOutputParser()
)
# Ask questions
answer = rag_chain.invoke("What are the main findings of the research?")
print(answer)
Adding Source Citations
For production systems, you usually need to know where answers come from:
from langchain_core.runnables import RunnablePassthrough
from typing import TypedDict, List
from langchain_core.documents import Document
class RAGResponse(TypedDict):
question: str
answer: str
sources: List[Document]
# Chain that returns both answer and sources
rag_chain_with_sources = (
RunnablePassthrough.assign(
context=lambda x: format_docs(retriever.invoke(x["question"])),
sources=lambda x: retriever.invoke(x["question"])
)
| {
"answer": rag_prompt | llm | StrOutputParser(),
"sources": lambda x: x["sources"],
"question": lambda x: x["question"]
}
)
response = rag_chain_with_sources.invoke({"question": "Who are the authors?"})
print(f"Answer: {response['answer']}")
print(f"\nSources used:")
for doc in response['sources']:
print(f" - {doc.metadata.get('source')}, page {doc.metadata.get('page', 'N/A')}")
Vector Database Comparison for Local RAG
Choosing a vector database matters a lot for performance and cost. Here's my honest comparison of the main options for LangChain RAG:
| Database | Hosting | Cost | ANN Algorithm | Metadata Filtering | Best For |
|---|---|---|---|---|---|
| FAISS | Local | Free | IVF / HNSW | Limited | Fast local dev, no server needed |
| Chroma | Local / Hosted | Free (self-host) | HNSW | Full | Dev and small-medium production |
| Pinecone | Cloud only | $0.08/million queries | Proprietary | Excellent | Large-scale production, SaaS |
| Qdrant | Local + Cloud | Free (self-host) | HNSW | Excellent | Production, complex filtering |
| Weaviate | Local + Cloud | Free (self-host) | HNSW | Excellent | Hybrid search, multi-modal |
| Milvus | Local + Cloud | Free (self-host) | HNSW / IVF | Good | High throughput, enterprise |
| PGVector | Local (Postgres) | Free (self-host) | HNSW / IVF | Excellent | Existing Postgres stack |
For local RAG development: start with Chroma. It's the easiest to set up and works well up to a few hundred thousand documents. For production RAG at scale, Qdrant gives you the best combination of performance, filtering, and self-hosting flexibility. Our deeper vector database guide covers each one in more detail.
Tuning RAG Quality
Basic RAG gets you 60% of the way there. Tuning gets you the rest. Here are the highest-impact changes you can make.
Hypothetical Document Embeddings (HyDE)
HyDE generates a hypothetical answer first, then uses that to retrieve. It works surprisingly well for question-answering tasks.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Generate hypothetical document
hyde_prompt = ChatPromptTemplate.from_template(
"Write a short paragraph that would directly answer this question:\n{question}"
)
hyde_chain = hyde_prompt | llm | StrOutputParser()
def hyde_retrieve(question: str) -> List[Document]:
# Generate a hypothetical answer
hypothetical = hyde_chain.invoke({"question": question})
# Use it for retrieval instead of the raw question
return retriever.invoke(hypothetical)
# Use in RAG chain
hyde_rag_chain = (
{
"context": lambda x: format_docs(hyde_retrieve(x["question"])),
"question": lambda x: x["question"]
}
| rag_prompt
| llm
| StrOutputParser()
)
Semantic Chunking
Instead of fixed-size chunks, semantic chunking splits on meaning boundaries:
from langchain_experimental.text_splitter import SemanticChunker
semantic_splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
semantic_chunks = semantic_splitter.split_documents(all_docs)
print(f"Semantic chunks: {len(semantic_chunks)}")
Semantic chunking produces better chunks for retrieval, at the cost of being slower to create. Worth it for document types with clear semantic structure.
Multi-Query Retrieval
Generate multiple query variations and combine the retrieved results:
from langchain.retrievers.multi_query import MultiQueryRetriever
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=retriever,
llm=llm,
include_original=True # Include original query results too
)
# This automatically generates 3 query variations and deduplicates results
docs = multi_query_retriever.invoke("What methods did they use?")
Multi-query retrieval consistently improves recall at the cost of 3-4x more LLM calls. For high-stakes Q&A where missing relevant context is worse than extra cost, it's usually worth it.
For a full exploration of advanced retrieval patterns, see our LangChain advanced RAG strategies guide. The semantic search tutorial also covers the embedding fundamentals in more depth.
Building a Complete RAG App
Let's wrap everything into a clean, reusable class:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path
from typing import List, Optional
class RAGPipeline:
def __init__(
self,
persist_dir: str = "./rag_db",
model: str = "gpt-4o-mini",
embedding_model: str = "text-embedding-3-small",
chunk_size: int = 1000,
chunk_overlap: int = 200,
k: int = 4
):
self.embeddings = OpenAIEmbeddings(model=embedding_model)
self.llm = ChatOpenAI(model=model, temperature=0)
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
self.persist_dir = persist_dir
self.k = k
# Load or create vector store
if Path(persist_dir).exists():
self.vectorstore = Chroma(
persist_directory=persist_dir,
embedding_function=self.embeddings
)
print(f"Loaded existing store with {self.vectorstore._collection.count()} chunks")
else:
self.vectorstore = None
print("No existing store found. Add documents first.")
self._build_chain()
def add_documents(self, documents: list) -> None:
chunks = self.splitter.split_documents(documents)
if self.vectorstore is None:
self.vectorstore = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings,
persist_directory=self.persist_dir
)
else:
self.vectorstore.add_documents(chunks)
print(f"Added {len(chunks)} chunks. Total: {self.vectorstore._collection.count()}")
self._build_chain()
def _build_chain(self) -> None:
if self.vectorstore is None:
return
retriever = self.vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": self.k, "fetch_k": self.k * 5}
)
prompt = ChatPromptTemplate.from_messages([
("system", """Answer questions using only the provided context.
If the answer isn't in the context, say so clearly.
Context: {context}"""),
("human", "{question}")
])
def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)
self.chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| self.llm
| StrOutputParser()
)
def ask(self, question: str) -> str:
if self.vectorstore is None:
return "No documents loaded. Call add_documents() first."
return self.chain.invoke(question)
# Usage
pipeline = RAGPipeline(persist_dir="./my_rag_db")
# Add documents
from langchain_community.document_loaders import PyPDFLoader
docs = PyPDFLoader("./my_document.pdf").load()
pipeline.add_documents(docs)
# Ask questions
print(pipeline.ask("What is the main topic of this document?"))
print(pipeline.ask("What are the key findings?"))
Conclusion
Building a RAG pipeline with LangChain involves five clear steps: load documents, split them into chunks, embed and store those chunks, build a retriever, then wire it into an LLM chain. The basic version takes maybe 50 lines of Python. The production version with proper error handling, metadata filtering, and retrieval tuning takes more work — but the scaffold is always the same.
The biggest quality improvements come from: better chunking strategy (semantic over character), MMR retrieval to reduce redundancy, and multi-query retrieval to improve recall. Start simple, measure your retrieval quality, then add complexity where it actually helps.
From here, explore the LangChain advanced RAG strategies guide for reranking, hybrid search, and contextual compression — the techniques that take a good RAG system to a great one.
Frequently Asked Questions
What chunk size should I use for RAG?
Start with 1000–1500 characters with 150–200 character overlap. Smaller chunks (500–800) work better for precise factual retrieval. Larger chunks (2000+) work better for complex reasoning tasks that need more context. Always experiment with your specific documents and measure retrieval quality.
How many documents should I retrieve per query (k value)?
k=3 to k=5 is a good starting point. Too few documents means missing relevant context. Too many means flooding the prompt with noise and increasing cost. Use reranking for large k values to filter down to the most relevant results.
Is Chroma good enough for production RAG, or do I need Pinecone?
Chroma is excellent for development and small-to-medium production workloads (under a million documents). For millions of documents, high availability requirements, or multi-tenant setups, a managed service like Pinecone or a self-hosted Qdrant/Weaviate deployment is worth the extra infrastructure investment.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.