AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

10 LangChain Retrieval Strategies for Better RAG Results

⚡ Quick Answer

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

AiTechWorlds Team May 31, 2026 13 min read

#LangChain #RAG #retrieval #vector search #advanced

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Basic similarity search works. Until it does not. You embed the user's question, find the nearest document chunks, and hope the phrasing of the question matches the phrasing of the answer. In practice, a question like "what causes memory leaks in Python?" might match poorly against documentation that says "unreferenced objects prevent garbage collection" — the concepts are identical but the words are different.

That gap between query phrasing and document phrasing is where most RAG pipelines lose accuracy. The retrieval strategies in this guide close that gap using different techniques: query expansion, hybrid search, better chunking, and hypothetical document generation. I will give you working code for all 10, plus honest guidance on when each one is worth the added complexity.

If you are new to RAG, start with the RAG system tutorial first — this guide assumes you already have a basic retrieval pipeline running. The vector database guide is also worth reading to understand the storage layer these strategies build on.

Setup

pip install langchain langchain-openai langchain-community \
    chromadb rank-bm25 python-dotenv

Base setup used across all examples:

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

load_dotenv()

# Sample documents for demonstration
raw_docs = [
    Document(page_content="""Python memory management relies on reference counting and a cyclic garbage collector.
    When an object's reference count reaches zero, CPython immediately deallocates it.
    However, circular references prevent the reference count from reaching zero.
    The cyclic garbage collector handles these circular references by periodically scanning for unreachable cycles.""",
    metadata={"source": "python_internals.pdf", "page": 1}),

    Document(page_content="""Memory leaks in Python occur when objects accumulate in memory without being freed.
    Common causes include: global variables holding references, closures capturing variables,
    event listeners not being removed, and circular references between objects.
    The tracemalloc module can help identify which code is allocating the most memory.""",
    metadata={"source": "python_internals.pdf", "page": 2}),

    Document(page_content="""Profiling Python applications for memory usage requires specialized tools.
    memory_profiler decorates functions to show line-by-line memory usage.
    objgraph visualizes object reference graphs to find unexpected references.
    The gc module provides access to the garbage collector and can force collection cycles.""",
    metadata={"source": "python_debugging.pdf", "page": 5}),

    Document(page_content="""Python's garbage collector uses a generational approach with three generations.
    New objects start in generation 0. Objects that survive a collection move to generation 1.
    Objects in generation 2 are considered long-lived. The collector runs most frequently on generation 0.""",
    metadata={"source": "python_internals.pdf", "page": 3}),
]

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o", temperature=0)

Strategy 1: Basic Similarity Search (Baseline)

Start here so you have a baseline to compare against.

from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=raw_docs,
    embedding=embeddings,
    collection_name="baseline"
)

basic_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

query = "what causes memory leaks in Python?"
results = basic_retriever.invoke(query)
print(f"Basic similarity — {len(results)} results")
for doc in results:
    print(f"  Source: {doc.metadata['source']} p.{doc.metadata['page']}")
    print(f"  Preview: {doc.page_content[:100]}...")

Strategy 2: MMR (Maximal Marginal Relevance)

Standard similarity search can return 3 documents that all say nearly the same thing. MMR balances relevance with diversity — it picks the next document that is both relevant to the query AND different from what was already selected.

mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 3,           # documents to return
        "fetch_k": 10,    # candidates to consider before MMR reranking
        "lambda_mult": 0.7  # 0=max diversity, 1=max relevance
    }
)

mmr_results = mmr_retriever.invoke("Python memory management")
print(f"MMR results: {len(mmr_results)}")
# These should be more diverse than basic similarity results

When to use it: When you have a large number of documents with significant overlap (e.g., multiple pages from the same document saying similar things). MMR reduces redundancy in retrieved context.

Strategy 3: Similarity Score Threshold

Only return documents above a confidence threshold. Prevents hallucinations from low-relevance retrievals.

threshold_retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.75,  # only return docs above this similarity
        "k": 5
    }
)

results = threshold_retriever.invoke("Python memory leaks debugging")
print(f"Threshold retriever: {len(results)} results above 0.75 threshold")

# For out-of-scope queries, this returns fewer or no results
oos_results = threshold_retriever.invoke("JavaScript async await patterns")
print(f"Out-of-scope query: {len(oos_results)} results (should be 0 or 1)")

Strategy 4: ParentDocumentRetriever

This is one of the most impactful strategies in practice. It stores small child chunks for scoring (precise matching) but returns larger parent chunks (more context).

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Larger parent chunks (what gets returned)
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
)

# Smaller child chunks (what gets indexed and scored)
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
)

parent_vectorstore = Chroma(
    collection_name="parent_child",
    embedding_function=embeddings
)

# InMemoryStore holds the full parent documents
parent_store = InMemoryStore()

parent_retriever = ParentDocumentRetriever(
    vectorstore=parent_vectorstore,
    docstore=parent_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

parent_retriever.add_documents(raw_docs)

results = parent_retriever.invoke("Python garbage collector generations")
print(f"ParentDocumentRetriever: {len(results)} results")
for doc in results:
    print(f"  Content length: {len(doc.page_content)} chars (larger parent chunk)")

When to use it: Almost always better than basic chunking for technical documentation, legal documents, and any content where context spans multiple paragraphs.

Strategy 5: MultiQueryRetriever

The LLM rewrites the user's question into multiple different phrasings, runs all of them, and deduplicates the results. This catches documents that match one phrasing but not another.

from langchain.retrievers.multi_query import MultiQueryRetriever
import logging

# Enable logging to see generated queries
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=basic_retriever,
    llm=llm,
)

# The retriever automatically generates alternative phrasings
results = multi_query_retriever.invoke("what makes Python programs use too much memory?")

print(f"MultiQuery results: {len(results)} unique documents")
# The log output shows the generated alternative queries, something like:
# - "Python memory leak causes"
# - "unreferenced objects Python memory"
# - "Python program memory consumption increase"

When to use it: When your users phrase questions very differently from how your documents are written. Especially useful for customer support knowledge bases.

Strategy 6: EnsembleRetriever (Hybrid Search)

Combines vector search (semantic) with BM25 keyword search (lexical). This is consistently the strongest single upgrade you can make to a basic RAG pipeline.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# BM25 is a keyword-based retriever — no embeddings needed
bm25_retriever = BM25Retriever.from_documents(
    raw_docs,
    k=3
)

# Vector-based retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Combine them with Reciprocal Rank Fusion
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]   # equal weight; tune based on your data
)

results = ensemble_retriever.invoke("tracemalloc memory profiling")
print(f"Ensemble results: {len(results)} documents")
# BM25 will find "tracemalloc" exactly; vector search adds semantic context

A 2024 benchmarking study on RAG systems found hybrid retrieval (BM25 + vector) outperformed pure vector search by an average of 12% on NDCG@10 across multiple datasets. It is one of the highest-ROI improvements you can make.

Strategy 7: Contextual Compression Retriever

Retrieve chunks, then use an LLM to compress each chunk down to only the parts relevant to the query. Cuts noise from retrieved context.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# The compressor uses an LLM to extract only relevant portions
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=basic_retriever,
)

results = compression_retriever.invoke("how does cyclic garbage collection work?")

print(f"Compression retriever: {len(results)} results")
for doc in results:
    print(f"\nCompressed content ({len(doc.page_content)} chars):")
    print(doc.page_content)
    # The content should be much shorter than the original chunk
    # and contain only text relevant to cyclic garbage collection

When to use it: When your chunks are large and contain a lot of off-topic content relative to the query. Reduces hallucination risk from irrelevant context.

Strategy 8: HyDE (Hypothetical Document Embeddings)

Instead of embedding the query directly, ask the LLM to generate a hypothetical document that would answer the query, then embed that and search with it. This works because the hypothetical document has the same phrasing style as real documents.

from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# HyDE embeddings — generates hypothetical docs before embedding
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    base_embeddings=embeddings,
)

# Build a vector store with HyDE embeddings
hyde_vectorstore = Chroma.from_documents(
    documents=raw_docs,
    embedding=hyde_embeddings,  # note: uses the HyDE embedder
    collection_name="hyde"
)

hyde_retriever = hyde_vectorstore.as_retriever(search_kwargs={"k": 3})

# At query time, the question "what causes memory leaks" gets turned into
# a hypothetical answer like "Memory leaks in Python are caused by..."
# That hypothetical answer is then embedded and used for similarity search
results = hyde_retriever.invoke("what causes memory leaks in Python?")
print(f"HyDE results: {len(results)}")

When to use it: When queries are short questions and documents are longer technical explanations. The embedding space mismatch between a 5-word question and a 200-word answer paragraph causes poor retrieval — HyDE fixes this by making the query look like an answer.

Strategy 9: Self-Query Retriever

The LLM translates natural language filter conditions into structured metadata filters. "Documents from 2025 about Python" becomes a vector search + metadata filter {"year": 2025, "topic": "Python"}.

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# Describe your metadata fields to the LLM
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The source PDF file name",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="Page number in the source document",
        type="integer",
    ),
]

document_content_description = "Technical documentation about Python internals and debugging"

self_query_retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True,
)

# This should filter to python_debugging.pdf only
results = self_query_retriever.invoke(
    "What debugging tools are mentioned in the debugging documentation?"
)
print(f"Self-query results: {len(results)}")
for doc in results:
    print(f"  Source: {doc.metadata['source']}")

Strategy 10: Multi-Vector Retriever

Store multiple embeddings per document: the document itself, its summary, and hypothetical questions it answers. Retrieval checks all three.

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
import uuid

mv_vectorstore = Chroma(
    collection_name="multi_vector",
    embedding_function=embeddings
)

mv_docstore = InMemoryStore()

mv_retriever = MultiVectorRetriever(
    vectorstore=mv_vectorstore,
    docstore=mv_docstore,
    id_key="doc_id",
)

# Generate summaries for each document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

summary_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Summarize this document in 2-3 sentences for retrieval purposes."),
        ("human", "{doc}"),
    ])
    | llm
    | StrOutputParser()
)

# Generate questions each document could answer
question_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Generate 3 questions this document could answer. One per line."),
        ("human", "{doc}"),
    ])
    | llm
    | StrOutputParser()
)

# Add documents with multiple embedding types
all_embed_docs = []
doc_ids = []

for doc in raw_docs:
    doc_id = str(uuid.uuid4())
    doc_ids.append(doc_id)

    # Summary embedding
    summary = summary_chain.invoke({"doc": doc.page_content})
    summary_doc = Document(
        page_content=summary,
        metadata={"doc_id": doc_id, "embed_type": "summary"}
    )

    # Questions embedding
    questions = question_chain.invoke({"doc": doc.page_content})
    for q in questions.strip().split("\n"):
        if q.strip():
            question_doc = Document(
                page_content=q.strip(),
                metadata={"doc_id": doc_id, "embed_type": "question"}
            )
            all_embed_docs.append(question_doc)

    all_embed_docs.append(summary_doc)

    # Store original document
    doc.metadata["doc_id"] = doc_id

mv_retriever.vectorstore.add_documents(all_embed_docs)
mv_retriever.docstore.mset(zip(doc_ids, raw_docs))

results = mv_retriever.invoke("Python memory leak debugging tools")
print(f"MultiVector results: {len(results)}")

Strategy Comparison Table

Strategy	Accuracy Gain	Added Latency	Added Cost	Best For
Basic Similarity	Baseline	Baseline	Baseline	Simple Q&A, prototypes
MMR	Slight (diversity)	~5%	None	Redundant document collections
Score Threshold	Precision up	~5%	None	When "no answer" is better than wrong answer
ParentDocumentRetriever	High (context)	~10%	None	Technical docs, long content
MultiQueryRetriever	Medium	+1 LLM call	~$0.001	Query style mismatch
EnsembleRetriever	High (+12% avg)	~20%	Minimal	Almost any production RAG
Compression Retriever	Medium	+1 LLM call	~$0.003	Large, noisy chunks
HyDE	High (Q&A tasks)	+1 LLM call	~$0.002	Short queries vs long docs
Self-Query	High (filtered)	+1 LLM call	~$0.001	Metadata-rich document collections
Multi-Vector	Very High	Slower indexing	Higher indexing	High-precision knowledge bases

Recommended Starting Strategy

If I were starting a new RAG project today, this is the order I would add strategies:

Start with EnsembleRetriever (BM25 + vector) — biggest single improvement, minimal complexity
Add ParentDocumentRetriever — better context without extra LLM calls
Add score threshold — prevents confidently wrong answers
Add MultiQueryRetriever if accuracy is still low — especially if users ask questions in unpredictable ways
Add HyDE if your queries are very short and documents are long

Do not add all of them at once. Each one adds complexity and some add latency. Add them incrementally and measure whether accuracy actually improves on your specific data.

For evaluation tooling, the semantic search tutorial covers how to set up evaluation metrics, and the LangSmith guide (linked below) shows how to run A/B comparisons between retrieval strategies.

What to Build Next

These retrieval strategies pair with the full pipeline covered in Build AI agent with LangChain. For building a complete document Q&A system, AI research agent build shows how to combine retrieval with an agent that can ask clarifying questions when the retrieved context is insufficient.

If you want to evaluate which strategy works best for your specific corpus, look into RAGAs evaluation framework — it gives you automated metrics for faithfulness, answer relevance, and context recall.

Conclusion

The difference between a RAG pipeline that frustrates users and one that genuinely helps them often comes down to retrieval quality. Basic similarity search is a starting point, not a finish line. The 10 strategies in this guide give you a toolkit to systematically improve retrieval accuracy for your specific use case.

My practical recommendation: implement EnsembleRetriever and ParentDocumentRetriever first. Together they address the two most common failure modes — phrasing mismatch and lost context — with minimal added complexity. Measure, then add more strategies only where you see persistent failures.

Questions about a specific retrieval failure pattern you are hitting? Drop a comment below with details about your use case.

FAQs

Which retrieval strategy gives the best accuracy improvement over basic similarity search? EnsembleRetriever combining BM25 and vector search consistently produces the largest accuracy gains in benchmarks — typically 10-20% improvement over either method alone. HyDE is the second strongest for question-answering tasks where the query style differs significantly from the document style.

When should I use ParentDocumentRetriever instead of standard chunking? Use ParentDocumentRetriever when your documents contain important context that spans multiple paragraphs. Standard small chunks often miss this inter-paragraph context. ParentDocumentRetriever scores based on small chunks (for precision) but returns larger parent chunks (for context), giving you the best of both approaches.

Is HyDE expensive to use compared to standard retrieval? Yes, HyDE adds one LLM call per query to generate the hypothetical document, roughly doubling retrieval latency and adding about $0.001-0.005 per query with GPT-4o-mini. For latency-sensitive applications, use EnsembleRetriever instead. HyDE is worth the cost for batch processing or when accuracy is the priority.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

EnsembleRetriever combining BM25 and vector search consistently produces the largest accuracy gains in benchmarks — typically 10-20% improvement over either method alone. HyDE is the second strongest for question-answering tasks where the query style differs significantly from the document style.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Azure cloud console with OpenAI settings — LangChain Azure OpenAI enterprise integration

Agent Development

How to Use LangChain with Azure OpenAI Service (Enterprise)

Connect LangChain to Azure OpenAI Service for enterprise deployments. Covers AzureChatOpenAI, managed identity, embeddings, content filtering, and a comparison table.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizSQL Advanced: JOINs & Subqueries

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

10 LangChain Retrieval Strategies for Better RAG Results

⚡ Quick Answer

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

AiTechWorlds Team May 31, 2026 13 min read

#LangChain #RAG #retrieval #vector search #advanced

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Setup

pip install langchain langchain-openai langchain-community \
    chromadb rank-bm25 python-dotenv

Base setup used across all examples:

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

load_dotenv()

# Sample documents for demonstration
raw_docs = [
    Document(page_content="""Python memory management relies on reference counting and a cyclic garbage collector.
    When an object's reference count reaches zero, CPython immediately deallocates it.
    However, circular references prevent the reference count from reaching zero.
    The cyclic garbage collector handles these circular references by periodically scanning for unreachable cycles.""",
    metadata={"source": "python_internals.pdf", "page": 1}),

    Document(page_content="""Memory leaks in Python occur when objects accumulate in memory without being freed.
    Common causes include: global variables holding references, closures capturing variables,
    event listeners not being removed, and circular references between objects.
    The tracemalloc module can help identify which code is allocating the most memory.""",
    metadata={"source": "python_internals.pdf", "page": 2}),

    Document(page_content="""Profiling Python applications for memory usage requires specialized tools.
    memory_profiler decorates functions to show line-by-line memory usage.
    objgraph visualizes object reference graphs to find unexpected references.
    The gc module provides access to the garbage collector and can force collection cycles.""",
    metadata={"source": "python_debugging.pdf", "page": 5}),

    Document(page_content="""Python's garbage collector uses a generational approach with three generations.
    New objects start in generation 0. Objects that survive a collection move to generation 1.
    Objects in generation 2 are considered long-lived. The collector runs most frequently on generation 0.""",
    metadata={"source": "python_internals.pdf", "page": 3}),
]

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o", temperature=0)

Strategy 1: Basic Similarity Search (Baseline)

Start here so you have a baseline to compare against.

from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=raw_docs,
    embedding=embeddings,
    collection_name="baseline"
)

basic_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

query = "what causes memory leaks in Python?"
results = basic_retriever.invoke(query)
print(f"Basic similarity — {len(results)} results")
for doc in results:
    print(f"  Source: {doc.metadata['source']} p.{doc.metadata['page']}")
    print(f"  Preview: {doc.page_content[:100]}...")

Strategy 2: MMR (Maximal Marginal Relevance)

mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 3,           # documents to return
        "fetch_k": 10,    # candidates to consider before MMR reranking
        "lambda_mult": 0.7  # 0=max diversity, 1=max relevance
    }
)

mmr_results = mmr_retriever.invoke("Python memory management")
print(f"MMR results: {len(mmr_results)}")
# These should be more diverse than basic similarity results

Strategy 3: Similarity Score Threshold

Only return documents above a confidence threshold. Prevents hallucinations from low-relevance retrievals.

threshold_retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.75,  # only return docs above this similarity
        "k": 5
    }
)

results = threshold_retriever.invoke("Python memory leaks debugging")
print(f"Threshold retriever: {len(results)} results above 0.75 threshold")

# For out-of-scope queries, this returns fewer or no results
oos_results = threshold_retriever.invoke("JavaScript async await patterns")
print(f"Out-of-scope query: {len(oos_results)} results (should be 0 or 1)")

Strategy 4: ParentDocumentRetriever

This is one of the most impactful strategies in practice. It stores small child chunks for scoring (precise matching) but returns larger parent chunks (more context).

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Larger parent chunks (what gets returned)
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
)

# Smaller child chunks (what gets indexed and scored)
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
)

parent_vectorstore = Chroma(
    collection_name="parent_child",
    embedding_function=embeddings
)

# InMemoryStore holds the full parent documents
parent_store = InMemoryStore()

parent_retriever = ParentDocumentRetriever(
    vectorstore=parent_vectorstore,
    docstore=parent_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

parent_retriever.add_documents(raw_docs)

results = parent_retriever.invoke("Python garbage collector generations")
print(f"ParentDocumentRetriever: {len(results)} results")
for doc in results:
    print(f"  Content length: {len(doc.page_content)} chars (larger parent chunk)")

When to use it: Almost always better than basic chunking for technical documentation, legal documents, and any content where context spans multiple paragraphs.

Strategy 5: MultiQueryRetriever

The LLM rewrites the user's question into multiple different phrasings, runs all of them, and deduplicates the results. This catches documents that match one phrasing but not another.

from langchain.retrievers.multi_query import MultiQueryRetriever
import logging

# Enable logging to see generated queries
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=basic_retriever,
    llm=llm,
)

# The retriever automatically generates alternative phrasings
results = multi_query_retriever.invoke("what makes Python programs use too much memory?")

print(f"MultiQuery results: {len(results)} unique documents")
# The log output shows the generated alternative queries, something like:
# - "Python memory leak causes"
# - "unreferenced objects Python memory"
# - "Python program memory consumption increase"

When to use it: When your users phrase questions very differently from how your documents are written. Especially useful for customer support knowledge bases.

Strategy 6: EnsembleRetriever (Hybrid Search)

Combines vector search (semantic) with BM25 keyword search (lexical). This is consistently the strongest single upgrade you can make to a basic RAG pipeline.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# BM25 is a keyword-based retriever — no embeddings needed
bm25_retriever = BM25Retriever.from_documents(
    raw_docs,
    k=3
)

# Vector-based retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Combine them with Reciprocal Rank Fusion
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]   # equal weight; tune based on your data
)

results = ensemble_retriever.invoke("tracemalloc memory profiling")
print(f"Ensemble results: {len(results)} documents")
# BM25 will find "tracemalloc" exactly; vector search adds semantic context

Strategy 7: Contextual Compression Retriever

Retrieve chunks, then use an LLM to compress each chunk down to only the parts relevant to the query. Cuts noise from retrieved context.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# The compressor uses an LLM to extract only relevant portions
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=basic_retriever,
)

results = compression_retriever.invoke("how does cyclic garbage collection work?")

print(f"Compression retriever: {len(results)} results")
for doc in results:
    print(f"\nCompressed content ({len(doc.page_content)} chars):")
    print(doc.page_content)
    # The content should be much shorter than the original chunk
    # and contain only text relevant to cyclic garbage collection

When to use it: When your chunks are large and contain a lot of off-topic content relative to the query. Reduces hallucination risk from irrelevant context.

Strategy 8: HyDE (Hypothetical Document Embeddings)

from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# HyDE embeddings — generates hypothetical docs before embedding
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    base_embeddings=embeddings,
)

# Build a vector store with HyDE embeddings
hyde_vectorstore = Chroma.from_documents(
    documents=raw_docs,
    embedding=hyde_embeddings,  # note: uses the HyDE embedder
    collection_name="hyde"
)

hyde_retriever = hyde_vectorstore.as_retriever(search_kwargs={"k": 3})

# At query time, the question "what causes memory leaks" gets turned into
# a hypothetical answer like "Memory leaks in Python are caused by..."
# That hypothetical answer is then embedded and used for similarity search
results = hyde_retriever.invoke("what causes memory leaks in Python?")
print(f"HyDE results: {len(results)}")

Strategy 9: Self-Query Retriever

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# Describe your metadata fields to the LLM
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The source PDF file name",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="Page number in the source document",
        type="integer",
    ),
]

document_content_description = "Technical documentation about Python internals and debugging"

self_query_retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True,
)

# This should filter to python_debugging.pdf only
results = self_query_retriever.invoke(
    "What debugging tools are mentioned in the debugging documentation?"
)
print(f"Self-query results: {len(results)}")
for doc in results:
    print(f"  Source: {doc.metadata['source']}")

Strategy 10: Multi-Vector Retriever

Store multiple embeddings per document: the document itself, its summary, and hypothetical questions it answers. Retrieval checks all three.

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
import uuid

mv_vectorstore = Chroma(
    collection_name="multi_vector",
    embedding_function=embeddings
)

mv_docstore = InMemoryStore()

mv_retriever = MultiVectorRetriever(
    vectorstore=mv_vectorstore,
    docstore=mv_docstore,
    id_key="doc_id",
)

# Generate summaries for each document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

summary_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Summarize this document in 2-3 sentences for retrieval purposes."),
        ("human", "{doc}"),
    ])
    | llm
    | StrOutputParser()
)

# Generate questions each document could answer
question_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Generate 3 questions this document could answer. One per line."),
        ("human", "{doc}"),
    ])
    | llm
    | StrOutputParser()
)

# Add documents with multiple embedding types
all_embed_docs = []
doc_ids = []

for doc in raw_docs:
    doc_id = str(uuid.uuid4())
    doc_ids.append(doc_id)

    # Summary embedding
    summary = summary_chain.invoke({"doc": doc.page_content})
    summary_doc = Document(
        page_content=summary,
        metadata={"doc_id": doc_id, "embed_type": "summary"}
    )

    # Questions embedding
    questions = question_chain.invoke({"doc": doc.page_content})
    for q in questions.strip().split("\n"):
        if q.strip():
            question_doc = Document(
                page_content=q.strip(),
                metadata={"doc_id": doc_id, "embed_type": "question"}
            )
            all_embed_docs.append(question_doc)

    all_embed_docs.append(summary_doc)

    # Store original document
    doc.metadata["doc_id"] = doc_id

mv_retriever.vectorstore.add_documents(all_embed_docs)
mv_retriever.docstore.mset(zip(doc_ids, raw_docs))

results = mv_retriever.invoke("Python memory leak debugging tools")
print(f"MultiVector results: {len(results)}")

Strategy Comparison Table

Strategy	Accuracy Gain	Added Latency	Added Cost	Best For
Basic Similarity	Baseline	Baseline	Baseline	Simple Q&A, prototypes
MMR	Slight (diversity)	~5%	None	Redundant document collections
Score Threshold	Precision up	~5%	None	When "no answer" is better than wrong answer
ParentDocumentRetriever	High (context)	~10%	None	Technical docs, long content
MultiQueryRetriever	Medium	+1 LLM call	~$0.001	Query style mismatch
EnsembleRetriever	High (+12% avg)	~20%	Minimal	Almost any production RAG
Compression Retriever	Medium	+1 LLM call	~$0.003	Large, noisy chunks
HyDE	High (Q&A tasks)	+1 LLM call	~$0.002	Short queries vs long docs
Self-Query	High (filtered)	+1 LLM call	~$0.001	Metadata-rich document collections
Multi-Vector	Very High	Slower indexing	Higher indexing	High-precision knowledge bases

Recommended Starting Strategy

If I were starting a new RAG project today, this is the order I would add strategies:

Start with EnsembleRetriever (BM25 + vector) — biggest single improvement, minimal complexity
Add ParentDocumentRetriever — better context without extra LLM calls
Add score threshold — prevents confidently wrong answers
Add MultiQueryRetriever if accuracy is still low — especially if users ask questions in unpredictable ways
Add HyDE if your queries are very short and documents are long

Do not add all of them at once. Each one adds complexity and some add latency. Add them incrementally and measure whether accuracy actually improves on your specific data.

For evaluation tooling, the semantic search tutorial covers how to set up evaluation metrics, and the LangSmith guide (linked below) shows how to run A/B comparisons between retrieval strategies.

What to Build Next

Conclusion

Questions about a specific retrieval failure pattern you are hitting? Drop a comment below with details about your use case.

FAQs

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Agent Development

How to Use LangChain with Azure OpenAI Service (Enterprise)

Connect LangChain to Azure OpenAI Service for enterprise deployments. Covers AzureChatOpenAI, managed identity, embeddings, content filtering, and a comparison table.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

10 LangChain Retrieval Strategies for Better RAG Results

Setup

Strategy 1: Basic Similarity Search (Baseline)

Strategy 2: MMR (Maximal Marginal Relevance)

Strategy 3: Similarity Score Threshold

Strategy 4: ParentDocumentRetriever

Strategy 5: MultiQueryRetriever

Strategy 6: EnsembleRetriever (Hybrid Search)

Strategy 7: Contextual Compression Retriever

Strategy 8: HyDE (Hypothetical Document Embeddings)

Strategy 9: Self-Query Retriever

Strategy 10: Multi-Vector Retriever

Strategy Comparison Table

Recommended Starting Strategy

What to Build Next

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

How to Use LangChain with Azure OpenAI Service (Enterprise)

Go deeper on this topic

Get Free AI Notes Daily

10 LangChain Retrieval Strategies for Better RAG Results

Setup

Strategy 1: Basic Similarity Search (Baseline)

Strategy 2: MMR (Maximal Marginal Relevance)

Strategy 3: Similarity Score Threshold

Strategy 4: ParentDocumentRetriever

Strategy 5: MultiQueryRetriever

Strategy 6: EnsembleRetriever (Hybrid Search)

Strategy 7: Contextual Compression Retriever

Strategy 8: HyDE (Hypothetical Document Embeddings)

Strategy 9: Self-Query Retriever

Strategy 10: Multi-Vector Retriever

Strategy Comparison Table

Recommended Starting Strategy

What to Build Next

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

How to Use LangChain with Azure OpenAI Service (Enterprise)

Go deeper on this topic

Get Free AI Notes Daily