10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Basic similarity search works. Until it does not. You embed the user's question, find the nearest document chunks, and hope the phrasing of the question matches the phrasing of the answer. In practice, a question like "what causes memory leaks in Python?" might match poorly against documentation that says "unreferenced objects prevent garbage collection" — the concepts are identical but the words are different.
That gap between query phrasing and document phrasing is where most RAG pipelines lose accuracy. The retrieval strategies in this guide close that gap using different techniques: query expansion, hybrid search, better chunking, and hypothetical document generation. I will give you working code for all 10, plus honest guidance on when each one is worth the added complexity.
If you are new to RAG, start with the RAG system tutorial first — this guide assumes you already have a basic retrieval pipeline running. The vector database guide is also worth reading to understand the storage layer these strategies build on.
Setup
pip install langchain langchain-openai langchain-community \
chromadb rank-bm25 python-dotenv
Base setup used across all examples:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
load_dotenv()
# Sample documents for demonstration
raw_docs = [
Document(page_content="""Python memory management relies on reference counting and a cyclic garbage collector.
When an object's reference count reaches zero, CPython immediately deallocates it.
However, circular references prevent the reference count from reaching zero.
The cyclic garbage collector handles these circular references by periodically scanning for unreachable cycles.""",
metadata={"source": "python_internals.pdf", "page": 1}),
Document(page_content="""Memory leaks in Python occur when objects accumulate in memory without being freed.
Common causes include: global variables holding references, closures capturing variables,
event listeners not being removed, and circular references between objects.
The tracemalloc module can help identify which code is allocating the most memory.""",
metadata={"source": "python_internals.pdf", "page": 2}),
Document(page_content="""Profiling Python applications for memory usage requires specialized tools.
memory_profiler decorates functions to show line-by-line memory usage.
objgraph visualizes object reference graphs to find unexpected references.
The gc module provides access to the garbage collector and can force collection cycles.""",
metadata={"source": "python_debugging.pdf", "page": 5}),
Document(page_content="""Python's garbage collector uses a generational approach with three generations.
New objects start in generation 0. Objects that survive a collection move to generation 1.
Objects in generation 2 are considered long-lived. The collector runs most frequently on generation 0.""",
metadata={"source": "python_internals.pdf", "page": 3}),
]
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o", temperature=0)
Strategy 1: Basic Similarity Search (Baseline)
Start here so you have a baseline to compare against.
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(
documents=raw_docs,
embedding=embeddings,
collection_name="baseline"
)
basic_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 3}
)
query = "what causes memory leaks in Python?"
results = basic_retriever.invoke(query)
print(f"Basic similarity — {len(results)} results")
for doc in results:
print(f" Source: {doc.metadata['source']} p.{doc.metadata['page']}")
print(f" Preview: {doc.page_content[:100]}...")
Strategy 2: MMR (Maximal Marginal Relevance)
Standard similarity search can return 3 documents that all say nearly the same thing. MMR balances relevance with diversity — it picks the next document that is both relevant to the query AND different from what was already selected.
mmr_retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 3, # documents to return
"fetch_k": 10, # candidates to consider before MMR reranking
"lambda_mult": 0.7 # 0=max diversity, 1=max relevance
}
)
mmr_results = mmr_retriever.invoke("Python memory management")
print(f"MMR results: {len(mmr_results)}")
# These should be more diverse than basic similarity results
When to use it: When you have a large number of documents with significant overlap (e.g., multiple pages from the same document saying similar things). MMR reduces redundancy in retrieved context.
Strategy 3: Similarity Score Threshold
Only return documents above a confidence threshold. Prevents hallucinations from low-relevance retrievals.
threshold_retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={
"score_threshold": 0.75, # only return docs above this similarity
"k": 5
}
)
results = threshold_retriever.invoke("Python memory leaks debugging")
print(f"Threshold retriever: {len(results)} results above 0.75 threshold")
# For out-of-scope queries, this returns fewer or no results
oos_results = threshold_retriever.invoke("JavaScript async await patterns")
print(f"Out-of-scope query: {len(oos_results)} results (should be 0 or 1)")
Strategy 4: ParentDocumentRetriever
This is one of the most impactful strategies in practice. It stores small child chunks for scoring (precise matching) but returns larger parent chunks (more context).
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Larger parent chunks (what gets returned)
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=200,
)
# Smaller child chunks (what gets indexed and scored)
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=50,
)
parent_vectorstore = Chroma(
collection_name="parent_child",
embedding_function=embeddings
)
# InMemoryStore holds the full parent documents
parent_store = InMemoryStore()
parent_retriever = ParentDocumentRetriever(
vectorstore=parent_vectorstore,
docstore=parent_store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
parent_retriever.add_documents(raw_docs)
results = parent_retriever.invoke("Python garbage collector generations")
print(f"ParentDocumentRetriever: {len(results)} results")
for doc in results:
print(f" Content length: {len(doc.page_content)} chars (larger parent chunk)")
When to use it: Almost always better than basic chunking for technical documentation, legal documents, and any content where context spans multiple paragraphs.
Strategy 5: MultiQueryRetriever
The LLM rewrites the user's question into multiple different phrasings, runs all of them, and deduplicates the results. This catches documents that match one phrasing but not another.
from langchain.retrievers.multi_query import MultiQueryRetriever
import logging
# Enable logging to see generated queries
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=basic_retriever,
llm=llm,
)
# The retriever automatically generates alternative phrasings
results = multi_query_retriever.invoke("what makes Python programs use too much memory?")
print(f"MultiQuery results: {len(results)} unique documents")
# The log output shows the generated alternative queries, something like:
# - "Python memory leak causes"
# - "unreferenced objects Python memory"
# - "Python program memory consumption increase"
When to use it: When your users phrase questions very differently from how your documents are written. Especially useful for customer support knowledge bases.
Strategy 6: EnsembleRetriever (Hybrid Search)
Combines vector search (semantic) with BM25 keyword search (lexical). This is consistently the strongest single upgrade you can make to a basic RAG pipeline.
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# BM25 is a keyword-based retriever — no embeddings needed
bm25_retriever = BM25Retriever.from_documents(
raw_docs,
k=3
)
# Vector-based retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Combine them with Reciprocal Rank Fusion
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5] # equal weight; tune based on your data
)
results = ensemble_retriever.invoke("tracemalloc memory profiling")
print(f"Ensemble results: {len(results)} documents")
# BM25 will find "tracemalloc" exactly; vector search adds semantic context
A 2024 benchmarking study on RAG systems found hybrid retrieval (BM25 + vector) outperformed pure vector search by an average of 12% on NDCG@10 across multiple datasets. It is one of the highest-ROI improvements you can make.
Strategy 7: Contextual Compression Retriever
Retrieve chunks, then use an LLM to compress each chunk down to only the parts relevant to the query. Cuts noise from retrieved context.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# The compressor uses an LLM to extract only relevant portions
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=basic_retriever,
)
results = compression_retriever.invoke("how does cyclic garbage collection work?")
print(f"Compression retriever: {len(results)} results")
for doc in results:
print(f"\nCompressed content ({len(doc.page_content)} chars):")
print(doc.page_content)
# The content should be much shorter than the original chunk
# and contain only text relevant to cyclic garbage collection
When to use it: When your chunks are large and contain a lot of off-topic content relative to the query. Reduces hallucination risk from irrelevant context.
Strategy 8: HyDE (Hypothetical Document Embeddings)
Instead of embedding the query directly, ask the LLM to generate a hypothetical document that would answer the query, then embed that and search with it. This works because the hypothetical document has the same phrasing style as real documents.
from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# HyDE embeddings — generates hypothetical docs before embedding
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
llm=llm,
base_embeddings=embeddings,
)
# Build a vector store with HyDE embeddings
hyde_vectorstore = Chroma.from_documents(
documents=raw_docs,
embedding=hyde_embeddings, # note: uses the HyDE embedder
collection_name="hyde"
)
hyde_retriever = hyde_vectorstore.as_retriever(search_kwargs={"k": 3})
# At query time, the question "what causes memory leaks" gets turned into
# a hypothetical answer like "Memory leaks in Python are caused by..."
# That hypothetical answer is then embedded and used for similarity search
results = hyde_retriever.invoke("what causes memory leaks in Python?")
print(f"HyDE results: {len(results)}")
When to use it: When queries are short questions and documents are longer technical explanations. The embedding space mismatch between a 5-word question and a 200-word answer paragraph causes poor retrieval — HyDE fixes this by making the query look like an answer.
Strategy 9: Self-Query Retriever
The LLM translates natural language filter conditions into structured metadata filters. "Documents from 2025 about Python" becomes a vector search + metadata filter {"year": 2025, "topic": "Python"}.
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
# Describe your metadata fields to the LLM
metadata_field_info = [
AttributeInfo(
name="source",
description="The source PDF file name",
type="string",
),
AttributeInfo(
name="page",
description="Page number in the source document",
type="integer",
),
]
document_content_description = "Technical documentation about Python internals and debugging"
self_query_retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents=document_content_description,
metadata_field_info=metadata_field_info,
verbose=True,
)
# This should filter to python_debugging.pdf only
results = self_query_retriever.invoke(
"What debugging tools are mentioned in the debugging documentation?"
)
print(f"Self-query results: {len(results)}")
for doc in results:
print(f" Source: {doc.metadata['source']}")
Strategy 10: Multi-Vector Retriever
Store multiple embeddings per document: the document itself, its summary, and hypothetical questions it answers. Retrieval checks all three.
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
import uuid
mv_vectorstore = Chroma(
collection_name="multi_vector",
embedding_function=embeddings
)
mv_docstore = InMemoryStore()
mv_retriever = MultiVectorRetriever(
vectorstore=mv_vectorstore,
docstore=mv_docstore,
id_key="doc_id",
)
# Generate summaries for each document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
summary_chain = (
ChatPromptTemplate.from_messages([
("system", "Summarize this document in 2-3 sentences for retrieval purposes."),
("human", "{doc}"),
])
| llm
| StrOutputParser()
)
# Generate questions each document could answer
question_chain = (
ChatPromptTemplate.from_messages([
("system", "Generate 3 questions this document could answer. One per line."),
("human", "{doc}"),
])
| llm
| StrOutputParser()
)
# Add documents with multiple embedding types
all_embed_docs = []
doc_ids = []
for doc in raw_docs:
doc_id = str(uuid.uuid4())
doc_ids.append(doc_id)
# Summary embedding
summary = summary_chain.invoke({"doc": doc.page_content})
summary_doc = Document(
page_content=summary,
metadata={"doc_id": doc_id, "embed_type": "summary"}
)
# Questions embedding
questions = question_chain.invoke({"doc": doc.page_content})
for q in questions.strip().split("\n"):
if q.strip():
question_doc = Document(
page_content=q.strip(),
metadata={"doc_id": doc_id, "embed_type": "question"}
)
all_embed_docs.append(question_doc)
all_embed_docs.append(summary_doc)
# Store original document
doc.metadata["doc_id"] = doc_id
mv_retriever.vectorstore.add_documents(all_embed_docs)
mv_retriever.docstore.mset(zip(doc_ids, raw_docs))
results = mv_retriever.invoke("Python memory leak debugging tools")
print(f"MultiVector results: {len(results)}")
Strategy Comparison Table
| Strategy | Accuracy Gain | Added Latency | Added Cost | Best For |
|---|---|---|---|---|
| Basic Similarity | Baseline | Baseline | Baseline | Simple Q&A, prototypes |
| MMR | Slight (diversity) | ~5% | None | Redundant document collections |
| Score Threshold | Precision up | ~5% | None | When "no answer" is better than wrong answer |
| ParentDocumentRetriever | High (context) | ~10% | None | Technical docs, long content |
| MultiQueryRetriever | Medium | +1 LLM call | ~$0.001 | Query style mismatch |
| EnsembleRetriever | High (+12% avg) | ~20% | Minimal | Almost any production RAG |
| Compression Retriever | Medium | +1 LLM call | ~$0.003 | Large, noisy chunks |
| HyDE | High (Q&A tasks) | +1 LLM call | ~$0.002 | Short queries vs long docs |
| Self-Query | High (filtered) | +1 LLM call | ~$0.001 | Metadata-rich document collections |
| Multi-Vector | Very High | Slower indexing | Higher indexing | High-precision knowledge bases |
Recommended Starting Strategy
If I were starting a new RAG project today, this is the order I would add strategies:
- Start with EnsembleRetriever (BM25 + vector) — biggest single improvement, minimal complexity
- Add ParentDocumentRetriever — better context without extra LLM calls
- Add score threshold — prevents confidently wrong answers
- Add MultiQueryRetriever if accuracy is still low — especially if users ask questions in unpredictable ways
- Add HyDE if your queries are very short and documents are long
Do not add all of them at once. Each one adds complexity and some add latency. Add them incrementally and measure whether accuracy actually improves on your specific data.
For evaluation tooling, the semantic search tutorial covers how to set up evaluation metrics, and the LangSmith guide (linked below) shows how to run A/B comparisons between retrieval strategies.
What to Build Next
These retrieval strategies pair with the full pipeline covered in Build AI agent with LangChain. For building a complete document Q&A system, AI research agent build shows how to combine retrieval with an agent that can ask clarifying questions when the retrieved context is insufficient.
If you want to evaluate which strategy works best for your specific corpus, look into RAGAs evaluation framework — it gives you automated metrics for faithfulness, answer relevance, and context recall.
Conclusion
The difference between a RAG pipeline that frustrates users and one that genuinely helps them often comes down to retrieval quality. Basic similarity search is a starting point, not a finish line. The 10 strategies in this guide give you a toolkit to systematically improve retrieval accuracy for your specific use case.
My practical recommendation: implement EnsembleRetriever and ParentDocumentRetriever first. Together they address the two most common failure modes — phrasing mismatch and lost context — with minimal added complexity. Measure, then add more strategies only where you see persistent failures.
Questions about a specific retrieval failure pattern you are hitting? Drop a comment below with details about your use case.
FAQs
Which retrieval strategy gives the best accuracy improvement over basic similarity search? EnsembleRetriever combining BM25 and vector search consistently produces the largest accuracy gains in benchmarks — typically 10-20% improvement over either method alone. HyDE is the second strongest for question-answering tasks where the query style differs significantly from the document style.
When should I use ParentDocumentRetriever instead of standard chunking? Use ParentDocumentRetriever when your documents contain important context that spans multiple paragraphs. Standard small chunks often miss this inter-paragraph context. ParentDocumentRetriever scores based on small chunks (for precision) but returns larger parent chunks (for context), giving you the best of both approaches.
Is HyDE expensive to use compared to standard retrieval? Yes, HyDE adds one LLM call per query to generate the hypothetical document, roughly doubling retrieval latency and adding about $0.001-0.005 per query with GPT-4o-mini. For latency-sensitive applications, use EnsembleRetriever instead. HyDE is worth the cost for batch processing or when accuracy is the priority.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.
5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)
Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.