AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

vector similarity visualization — LangChain retriever types MMR similarity compression

5 LangChain Vector Store Retrievers (MMR, Similarity, Compression)

⚡ Quick Answer

Master LangChain's 5 core retriever types — SimilaritySearch, MMR, ContextualCompression, MultiVectorRetriever, and SelfQueryRetriever — with code, benchmarks, and guidance.

AiTechWorlds Team May 31, 2026 14 min read

#LangChain #vector store #retrieval #MMR #RAG

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Choosing the right retriever is the decision that most affects your RAG system's answer quality — more than model choice, more than prompt engineering, in many cases more than the quality of the documents themselves. I have seen teams swap from basic similarity search to MMR and watch answer diversity jump immediately. I have also seen teams add contextual compression to a system and cut hallucinations noticeably.

This guide covers five retrievers in practical detail: what they actually do under the hood, when to use them, and real code you can drop into your pipeline. We will also benchmark them side by side on the same queries so you can see the actual differences in output.

If you want to understand the broader retrieval landscape before diving into these specifics, the RAG system tutorial gives good foundation. The vector database guide covers the storage layer these retrievers sit on top of.

Setup

pip install langchain langchain-openai langchain-community \
    chromadb rank-bm25 python-dotenv

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

load_dotenv()

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# A realistic document corpus — multiple chunks with some overlap
raw_docs = [
    Document(
        page_content="Python's asyncio library provides tools for writing concurrent code using coroutines. "
                     "The event loop runs coroutines and handles I/O operations efficiently. "
                     "async def creates a coroutine function, await suspends execution until a result is ready.",
        metadata={"topic": "python", "subtopic": "asyncio", "difficulty": "intermediate"}
    ),
    Document(
        page_content="Async/await in Python enables writing non-blocking I/O code that looks synchronous. "
                     "The asyncio.gather() function runs multiple coroutines concurrently. "
                     "aiohttp is the standard library for making async HTTP requests.",
        metadata={"topic": "python", "subtopic": "asyncio", "difficulty": "intermediate"}
    ),
    Document(
        page_content="Python threading uses OS threads for true parallelism with I/O-bound tasks. "
                     "The GIL (Global Interpreter Lock) prevents true CPU parallelism in Python threads. "
                     "For CPU-bound work, use multiprocessing instead of threading.",
        metadata={"topic": "python", "subtopic": "concurrency", "difficulty": "advanced"}
    ),
    Document(
        page_content="Python multiprocessing spawns separate processes, each with their own Python interpreter. "
                     "This bypasses the GIL limitation for CPU-intensive tasks. "
                     "ProcessPoolExecutor provides a high-level interface for parallel CPU work.",
        metadata={"topic": "python", "subtopic": "concurrency", "difficulty": "advanced"}
    ),
    Document(
        page_content="FastAPI is an async web framework built on Starlette and Pydantic. "
                     "It uses Python type hints for automatic request validation and documentation. "
                     "Async route handlers using async def run on the asyncio event loop.",
        metadata={"topic": "python", "subtopic": "web", "difficulty": "beginner"}
    ),
    Document(
        page_content="SQLAlchemy async support allows database queries without blocking the event loop. "
                     "The AsyncSession class wraps the standard Session for use with asyncio. "
                     "AsyncEngine and create_async_engine set up async database connections.",
        metadata={"topic": "python", "subtopic": "database", "difficulty": "intermediate"}
    ),
    Document(
        page_content="Concurrency patterns in Python: use asyncio for I/O-bound concurrent tasks, "
                     "threading for I/O-bound parallel tasks needing shared state, "
                     "and multiprocessing for CPU-bound parallel computation.",
        metadata={"topic": "python", "subtopic": "concurrency", "difficulty": "intermediate"}
    ),
]

# Build the vector store
vectorstore = Chroma.from_documents(
    documents=raw_docs,
    embedding=embeddings,
    collection_name="python_docs"
)

Retriever 1: SimilaritySearch

The baseline. Embeds the query and returns the k documents with highest cosine similarity.

similarity_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

query = "how does async work in Python?"
results = similarity_retriever.invoke(query)

print(f"Similarity search: {len(results)} results for '{query}'")
for i, doc in enumerate(results):
    print(f"\n  [{i+1}] Subtopic: {doc.metadata['subtopic']}")
    print(f"  Preview: {doc.page_content[:120]}...")

What you will typically see: The top 3 documents are all asyncio-related because that matches the query most closely. This is correct, but there may be repetition if the corpus has multiple similar asyncio chunks.

With Score Threshold

threshold_retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.78,
        "k": 5
    }
)

results = threshold_retriever.invoke("Python async HTTP requests")
print(f"Threshold retriever: {len(results)} results above 0.78 threshold")

# For out-of-scope queries, this returns fewer results
oos_results = threshold_retriever.invoke("JavaScript promises and callbacks")
print(f"Out-of-scope: {len(oos_results)} results (should be 0 or 1)")

The threshold prevents the retriever from returning weakly relevant documents. This is worth adding even to basic pipelines — returning nothing is usually better than returning unrelated content that causes hallucinations.

Retriever 2: MMR (Maximal Marginal Relevance)

MMR solves the redundancy problem in similarity search. When multiple documents are very similar to each other and to the query, basic similarity search returns all of them. MMR picks documents that are both relevant AND diverse from each other.

mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 3,           # final documents to return
        "fetch_k": 10,    # candidates to consider before MMR selection
        "lambda_mult": 0.6  # balance: 0=max diversity, 1=max relevance
    }
)

query = "Python concurrent programming"
mmr_results = mmr_retriever.invoke(query)

print(f"MMR retriever: {len(mmr_results)} results")
for i, doc in enumerate(mmr_results):
    print(f"\n  [{i+1}] Subtopic: {doc.metadata['subtopic']}")
    print(f"  Preview: {doc.page_content[:120]}...")

What you will typically see with MMR: Instead of three asyncio documents, you might get asyncio, threading, and FastAPI — more diverse coverage of "concurrent programming" topics. The diversity is especially valuable when you have a broad query that could be answered from multiple angles.

Tuning the Lambda Parameter

# Compare different lambda values on the same query
for lambda_val in [0.2, 0.5, 0.8]:
    retriever = vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 3, "fetch_k": 10, "lambda_mult": lambda_val}
    )
    results = retriever.invoke("Python concurrency")
    subtopics = [doc.metadata["subtopic"] for doc in results]
    print(f"lambda={lambda_val}: subtopics = {subtopics}")

# lambda=0.2: high diversity, may include less relevant docs
# lambda=0.5: balanced (usually best default)
# lambda=0.8: similar to basic similarity, less diverse

Retriever 3: ContextualCompressionRetriever

This retriever has two stages. First, it retrieves chunks using any base retriever. Then, an LLM compresses each chunk to only the parts directly relevant to the query. The result: shorter, more focused context with less noise.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# The extractor uses an LLM to pull relevant content from each chunk
extractor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=extractor,
    base_retriever=similarity_retriever,
)

query = "how does the GIL affect Python threading?"
results = compression_retriever.invoke(query)

print(f"Compression retriever: {len(results)} results")
for i, doc in enumerate(results):
    print(f"\n  [{i+1}] Compressed content ({len(doc.page_content)} chars):")
    print(f"  {doc.page_content}")
    # Should show only the GIL-relevant sentence, not the full chunk

Using LLMChainFilter Instead

LLMChainFilter makes a binary decision (keep or drop) rather than extracting specific text. It is faster but less precise:

from langchain.retrievers.document_compressors import LLMChainFilter

filter_compressor = LLMChainFilter.from_llm(llm)

filter_retriever = ContextualCompressionRetriever(
    base_compressor=filter_compressor,
    base_retriever=similarity_retriever,
)

# This filters out chunks that don't actually address the query
results = filter_retriever.invoke("how to handle database connections asynchronously?")
print(f"Filter retriever: {len(results)} chunks passed the filter")
for doc in results:
    print(f"  Subtopic: {doc.metadata['subtopic']}")

EmbeddingsFilter — Cheaper Alternative

If LLM-based compression is too expensive for your use case, EmbeddingsFilter does the filtering using embedding similarity instead of an LLM call:

from langchain.retrievers.document_compressors import EmbeddingsFilter

embeddings_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.76
)

cheap_compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter,
    base_retriever=similarity_retriever,
)

results = cheap_compression_retriever.invoke("asyncio event loop")
print(f"Embeddings filter: {len(results)} results")

Retriever 4: MultiVectorRetriever

Standard vector stores index one embedding per document chunk. MultiVectorRetriever indexes multiple embeddings per document — for example, the original text, a summary, and hypothetical questions the document answers. Retrieval checks all of these embeddings, so the document is findable via any matching phrasing.

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import uuid

# Separate vector store and document store for MultiVector
mv_vectorstore = Chroma(
    collection_name="multi_vector",
    embedding_function=embeddings
)
mv_docstore = InMemoryStore()

mv_retriever = MultiVectorRetriever(
    vectorstore=mv_vectorstore,
    docstore=mv_docstore,
    id_key="doc_id",
)

# Chain to generate summaries
summary_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Write a concise 1-2 sentence summary of this text for search indexing."),
        ("human", "{doc}"),
    ])
    | llm
    | StrOutputParser()
)

# Chain to generate hypothetical questions
question_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Write 2 short questions that this text directly answers. One per line."),
        ("human", "{doc}"),
    ])
    | llm
    | StrOutputParser()
)

# Index documents with multiple embeddings
all_index_docs = []
doc_id_map = []

for doc in raw_docs:
    doc_id = str(uuid.uuid4())
    doc_id_map.append(doc_id)

    # Generate and store summary embedding
    summary = summary_chain.invoke({"doc": doc.page_content})
    summary_doc = Document(
        page_content=summary,
        metadata={"doc_id": doc_id, "embed_type": "summary"}
    )
    all_index_docs.append(summary_doc)

    # Generate and store question embeddings
    questions = question_chain.invoke({"doc": doc.page_content})
    for q in questions.strip().split("\n"):
        if q.strip():
            q_doc = Document(
                page_content=q.strip(),
                metadata={"doc_id": doc_id, "embed_type": "question"}
            )
            all_index_docs.append(q_doc)

    # Tag original doc with its ID
    doc.metadata["doc_id"] = doc_id

# Add index docs to vector store, originals to doc store
mv_vectorstore.add_documents(all_index_docs)
mv_docstore.mset(zip(doc_id_map, raw_docs))

# Query — will find documents via summary or question matches
results = mv_retriever.invoke("non-blocking database operations")
print(f"MultiVector results: {len(results)}")
for doc in results:
    print(f"  Subtopic: {doc.metadata['subtopic']}")
    print(f"  Preview: {doc.page_content[:100]}...")

The beauty of this approach: a user who asks "non-blocking database operations" might not match the exact document text ("AsyncSession class wraps the standard Session"), but the generated question "How do you run database queries without blocking the event loop?" is a much stronger match.

Retriever 5: SelfQueryRetriever

This retriever lets the LLM translate natural language filter conditions into structured metadata filters automatically. A query like "beginner Python async tutorials" generates both a vector search AND a metadata filter {"difficulty": "beginner"}.

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# Describe the metadata fields to the LLM
metadata_field_info = [
    AttributeInfo(
        name="topic",
        description="The main programming topic. One of: python, javascript, rust",
        type="string",
    ),
    AttributeInfo(
        name="subtopic",
        description="The specific subtopic within the topic. Examples: asyncio, concurrency, web, database",
        type="string",
    ),
    AttributeInfo(
        name="difficulty",
        description="The difficulty level. One of: beginner, intermediate, advanced",
        type="string",
    ),
]

document_content_description = "Technical documentation about Python programming topics"

self_query_retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True,   # shows the generated filter in logs
)

# These queries should generate metadata filters automatically
test_queries = [
    "What are the beginner-level Python concepts?",
    "Show me advanced Python content about concurrency",
    "Find intermediate content about web frameworks",
]

for query in test_queries:
    print(f"\nQuery: {query}")
    results = self_query_retriever.invoke(query)
    for doc in results:
        print(f"  [{doc.metadata['difficulty']}] {doc.metadata['subtopic']}: {doc.page_content[:80]}...")

Side-by-Side Comparison

Let me run all five retrievers on the same query and compare:

query = "Python async and threading differences"

retrievers = {
    "similarity": similarity_retriever,
    "mmr": mmr_retriever,
    "compression": compression_retriever,
    "self_query": self_query_retriever,
}

print(f"Query: '{query}'\n{'='*60}")

for name, retriever in retrievers.items():
    try:
        results = retriever.invoke(query)
        subtopics = [doc.metadata.get("subtopic", "unknown") for doc in results]
        print(f"\n{name.upper()} ({len(results)} results):")
        print(f"  Subtopics: {subtopics}")
        print(f"  First result ({len(results[0].page_content)} chars): {results[0].page_content[:120]}...")
    except Exception as e:
        print(f"\n{name.upper()}: Error — {e}")

Retriever Performance Comparison

Retriever	Typical Latency	Extra Cost	Best For
SimilaritySearch	Fast (1 embed call)	None	Simple Q&A, homogeneous corpus
MMR	Slightly slower (1 embed + ranking)	None	Diverse corpus, redundancy issues
Score Threshold	Fast (1 embed call)	None	Preventing low-confidence retrievals
ContextualCompression (LLM)	Slow (1 embed + k LLM calls)	$0.001-0.01 per query	Large noisy chunks
ContextualCompression (Embeddings)	Medium (k embed calls)	Minimal	Budget-conscious noise reduction
MultiVectorRetriever	Medium indexing, fast query	Higher indexing cost	Query-document phrasing mismatch
SelfQueryRetriever	Medium (1 extra LLM call)	~$0.001 per query	Metadata-rich collections

Combining Retrievers

You can chain and combine these retrievers. Here is a pattern I use often in production: EnsembleRetriever combining MMR with BM25, then wrapped in contextual compression:

from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.document_compressors import EmbeddingsFilter

# BM25 for keyword matching
bm25 = BM25Retriever.from_documents(raw_docs, k=3)

# MMR for semantic diversity
mmr = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 3, "fetch_k": 8, "lambda_mult": 0.6}
)

# Combine with Reciprocal Rank Fusion
ensemble = EnsembleRetriever(
    retrievers=[bm25, mmr],
    weights=[0.4, 0.6]   # slight preference for semantic
)

# Add embedding-based compression to filter noise
embeddings_compressor = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.75
)

final_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_compressor,
    base_retriever=ensemble,
)

results = final_retriever.invoke("async Python database queries")
print(f"Combined retriever: {len(results)} results")
for doc in results:
    print(f"  {doc.metadata['subtopic']}: {doc.page_content[:100]}...")

Wiring a Retriever Into a RAG Chain

Once you have chosen and configured your retriever, wiring it into a Q&A chain is straightforward:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
    return "\n\n---\n".join(doc.page_content for doc in docs)

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a technical assistant. Answer the question based on the provided context.
If the context does not contain enough information to answer, say so.

Context:
{context}"""),
    ("human", "{question}"),
])

# Swap out the retriever here based on your needs
active_retriever = mmr_retriever   # or any other retriever from this guide

rag_chain = (
    RunnableParallel({
        "context": active_retriever | format_docs,
        "question": RunnablePassthrough(),
    })
    | qa_prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("What should I use for CPU-intensive parallel work in Python?")
print(answer)

Common Mistakes

Using only basic similarity search in production — The investment to add MMR and a score threshold is small and almost always improves output quality, especially for technical knowledge bases where multiple chunks overlap significantly.

Setting k too high — Passing 10 chunks to the LLM when 3-4 would suffice wastes tokens and can dilute the relevant content. Start with k=3, measure accuracy, and only increase it if retrieval is missing important context.

Not testing out-of-scope queries — Add score_threshold and test your retriever with questions clearly outside your corpus. A retriever that confidently returns irrelevant results for off-topic queries is worse than one that returns nothing.

Using LLM compression for every query — ContextualCompressionRetriever with LLMChainExtractor is powerful but adds one LLM call per retrieved chunk. For high-traffic systems, use EmbeddingsFilter instead.

What to Build Next

These retrievers feed into the broader RAG architecture covered in RAG system tutorial. For building an agent that uses retrieval as a tool — rather than a direct chain — Build AI agent with LangChain shows that pattern. If you are evaluating retrieval quality systematically, the LangSmith guide at langchain-langsmith-debugging-tracing covers how to set up evaluation datasets for retrieval benchmarking.

For production retrieval quality, also look at semantic search tutorial for embedding optimization techniques that work alongside the retriever strategies here.

Conclusion

The right retriever for your project depends on your corpus characteristics and query patterns. If your documents have a lot of similar content, MMR prevents redundant results. If your chunks are long and noisy, contextual compression reduces irrelevant content. If users phrase questions differently from how documents are written, MultiVectorRetriever closes that gap.

My recommendation for most projects starting out: basic similarity with a score threshold, then add MMR when you notice redundancy, then add EnsembleRetriever with BM25 when you want the best accuracy for production. Each addition is incremental and measurable.

Build something, measure retrieval quality with LangSmith, and let the data guide which retriever upgrades are worth the added complexity for your specific use case.

FAQs

What is the difference between MMR and similarity search in LangChain? Similarity search returns the top-k documents most similar to the query, which can result in redundant, nearly identical results. MMR (Maximal Marginal Relevance) adds a diversity constraint — each new document must be both relevant to the query AND different from already-selected documents. MMR is better when your corpus has many similar chunks that would otherwise dominate the results.

When should I use ContextualCompressionRetriever? Use ContextualCompressionRetriever when your chunks contain significant off-topic content relative to the user's query. It retrieves chunks normally, then uses an LLM to extract only the relevant portions. This reduces noise in the context window, which tends to improve answer quality — but it adds one LLM call per retrieved chunk, so use it selectively.

Can I combine multiple retriever types together? Yes. EnsembleRetriever lets you combine any two or more retrievers with configurable weights using Reciprocal Rank Fusion. A common pattern is combining an MMR vector retriever with a BM25 keyword retriever for hybrid search. You can even nest retrievers — for example, wrapping an EnsembleRetriever with a ContextualCompressionRetriever.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Similarity search returns the top-k documents most similar to the query, which can result in redundant, nearly identical results. MMR (Maximal Marginal Relevance) adds a diversity constraint — each new document must be both relevant to the query AND different from already-selected documents. MMR is better when your corpus has many similar chunks that would otherwise dominate the results.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizRAG Systems

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

5 LangChain Vector Store Retrievers (MMR, Similarity, Compression)

⚡ Quick Answer

Master LangChain's 5 core retriever types — SimilaritySearch, MMR, ContextualCompression, MultiVectorRetriever, and SelfQueryRetriever — with code, benchmarks, and guidance.

AiTechWorlds Team May 31, 2026 14 min read

#LangChain #vector store #retrieval #MMR #RAG

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Setup

pip install langchain langchain-openai langchain-community \
    chromadb rank-bm25 python-dotenv

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

load_dotenv()

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# A realistic document corpus — multiple chunks with some overlap
raw_docs = [
    Document(
        page_content="Python's asyncio library provides tools for writing concurrent code using coroutines. "
                     "The event loop runs coroutines and handles I/O operations efficiently. "
                     "async def creates a coroutine function, await suspends execution until a result is ready.",
        metadata={"topic": "python", "subtopic": "asyncio", "difficulty": "intermediate"}
    ),
    Document(
        page_content="Async/await in Python enables writing non-blocking I/O code that looks synchronous. "
                     "The asyncio.gather() function runs multiple coroutines concurrently. "
                     "aiohttp is the standard library for making async HTTP requests.",
        metadata={"topic": "python", "subtopic": "asyncio", "difficulty": "intermediate"}
    ),
    Document(
        page_content="Python threading uses OS threads for true parallelism with I/O-bound tasks. "
                     "The GIL (Global Interpreter Lock) prevents true CPU parallelism in Python threads. "
                     "For CPU-bound work, use multiprocessing instead of threading.",
        metadata={"topic": "python", "subtopic": "concurrency", "difficulty": "advanced"}
    ),
    Document(
        page_content="Python multiprocessing spawns separate processes, each with their own Python interpreter. "
                     "This bypasses the GIL limitation for CPU-intensive tasks. "
                     "ProcessPoolExecutor provides a high-level interface for parallel CPU work.",
        metadata={"topic": "python", "subtopic": "concurrency", "difficulty": "advanced"}
    ),
    Document(
        page_content="FastAPI is an async web framework built on Starlette and Pydantic. "
                     "It uses Python type hints for automatic request validation and documentation. "
                     "Async route handlers using async def run on the asyncio event loop.",
        metadata={"topic": "python", "subtopic": "web", "difficulty": "beginner"}
    ),
    Document(
        page_content="SQLAlchemy async support allows database queries without blocking the event loop. "
                     "The AsyncSession class wraps the standard Session for use with asyncio. "
                     "AsyncEngine and create_async_engine set up async database connections.",
        metadata={"topic": "python", "subtopic": "database", "difficulty": "intermediate"}
    ),
    Document(
        page_content="Concurrency patterns in Python: use asyncio for I/O-bound concurrent tasks, "
                     "threading for I/O-bound parallel tasks needing shared state, "
                     "and multiprocessing for CPU-bound parallel computation.",
        metadata={"topic": "python", "subtopic": "concurrency", "difficulty": "intermediate"}
    ),
]

# Build the vector store
vectorstore = Chroma.from_documents(
    documents=raw_docs,
    embedding=embeddings,
    collection_name="python_docs"
)

Retriever 1: SimilaritySearch

The baseline. Embeds the query and returns the k documents with highest cosine similarity.

similarity_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

query = "how does async work in Python?"
results = similarity_retriever.invoke(query)

print(f"Similarity search: {len(results)} results for '{query}'")
for i, doc in enumerate(results):
    print(f"\n  [{i+1}] Subtopic: {doc.metadata['subtopic']}")
    print(f"  Preview: {doc.page_content[:120]}...")

With Score Threshold

threshold_retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.78,
        "k": 5
    }
)

results = threshold_retriever.invoke("Python async HTTP requests")
print(f"Threshold retriever: {len(results)} results above 0.78 threshold")

# For out-of-scope queries, this returns fewer results
oos_results = threshold_retriever.invoke("JavaScript promises and callbacks")
print(f"Out-of-scope: {len(oos_results)} results (should be 0 or 1)")

Retriever 2: MMR (Maximal Marginal Relevance)

mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 3,           # final documents to return
        "fetch_k": 10,    # candidates to consider before MMR selection
        "lambda_mult": 0.6  # balance: 0=max diversity, 1=max relevance
    }
)

query = "Python concurrent programming"
mmr_results = mmr_retriever.invoke(query)

print(f"MMR retriever: {len(mmr_results)} results")
for i, doc in enumerate(mmr_results):
    print(f"\n  [{i+1}] Subtopic: {doc.metadata['subtopic']}")
    print(f"  Preview: {doc.page_content[:120]}...")

Tuning the Lambda Parameter

# Compare different lambda values on the same query
for lambda_val in [0.2, 0.5, 0.8]:
    retriever = vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 3, "fetch_k": 10, "lambda_mult": lambda_val}
    )
    results = retriever.invoke("Python concurrency")
    subtopics = [doc.metadata["subtopic"] for doc in results]
    print(f"lambda={lambda_val}: subtopics = {subtopics}")

# lambda=0.2: high diversity, may include less relevant docs
# lambda=0.5: balanced (usually best default)
# lambda=0.8: similar to basic similarity, less diverse

Retriever 3: ContextualCompressionRetriever

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# The extractor uses an LLM to pull relevant content from each chunk
extractor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=extractor,
    base_retriever=similarity_retriever,
)

query = "how does the GIL affect Python threading?"
results = compression_retriever.invoke(query)

print(f"Compression retriever: {len(results)} results")
for i, doc in enumerate(results):
    print(f"\n  [{i+1}] Compressed content ({len(doc.page_content)} chars):")
    print(f"  {doc.page_content}")
    # Should show only the GIL-relevant sentence, not the full chunk

Using LLMChainFilter Instead

LLMChainFilter makes a binary decision (keep or drop) rather than extracting specific text. It is faster but less precise:

from langchain.retrievers.document_compressors import LLMChainFilter

filter_compressor = LLMChainFilter.from_llm(llm)

filter_retriever = ContextualCompressionRetriever(
    base_compressor=filter_compressor,
    base_retriever=similarity_retriever,
)

# This filters out chunks that don't actually address the query
results = filter_retriever.invoke("how to handle database connections asynchronously?")
print(f"Filter retriever: {len(results)} chunks passed the filter")
for doc in results:
    print(f"  Subtopic: {doc.metadata['subtopic']}")

EmbeddingsFilter — Cheaper Alternative

If LLM-based compression is too expensive for your use case, EmbeddingsFilter does the filtering using embedding similarity instead of an LLM call:

from langchain.retrievers.document_compressors import EmbeddingsFilter

embeddings_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.76
)

cheap_compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter,
    base_retriever=similarity_retriever,
)

results = cheap_compression_retriever.invoke("asyncio event loop")
print(f"Embeddings filter: {len(results)} results")

Retriever 4: MultiVectorRetriever

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import uuid

# Separate vector store and document store for MultiVector
mv_vectorstore = Chroma(
    collection_name="multi_vector",
    embedding_function=embeddings
)
mv_docstore = InMemoryStore()

mv_retriever = MultiVectorRetriever(
    vectorstore=mv_vectorstore,
    docstore=mv_docstore,
    id_key="doc_id",
)

# Chain to generate summaries
summary_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Write a concise 1-2 sentence summary of this text for search indexing."),
        ("human", "{doc}"),
    ])
    | llm
    | StrOutputParser()
)

# Chain to generate hypothetical questions
question_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Write 2 short questions that this text directly answers. One per line."),
        ("human", "{doc}"),
    ])
    | llm
    | StrOutputParser()
)

# Index documents with multiple embeddings
all_index_docs = []
doc_id_map = []

for doc in raw_docs:
    doc_id = str(uuid.uuid4())
    doc_id_map.append(doc_id)

    # Generate and store summary embedding
    summary = summary_chain.invoke({"doc": doc.page_content})
    summary_doc = Document(
        page_content=summary,
        metadata={"doc_id": doc_id, "embed_type": "summary"}
    )
    all_index_docs.append(summary_doc)

    # Generate and store question embeddings
    questions = question_chain.invoke({"doc": doc.page_content})
    for q in questions.strip().split("\n"):
        if q.strip():
            q_doc = Document(
                page_content=q.strip(),
                metadata={"doc_id": doc_id, "embed_type": "question"}
            )
            all_index_docs.append(q_doc)

    # Tag original doc with its ID
    doc.metadata["doc_id"] = doc_id

# Add index docs to vector store, originals to doc store
mv_vectorstore.add_documents(all_index_docs)
mv_docstore.mset(zip(doc_id_map, raw_docs))

# Query — will find documents via summary or question matches
results = mv_retriever.invoke("non-blocking database operations")
print(f"MultiVector results: {len(results)}")
for doc in results:
    print(f"  Subtopic: {doc.metadata['subtopic']}")
    print(f"  Preview: {doc.page_content[:100]}...")

Retriever 5: SelfQueryRetriever

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# Describe the metadata fields to the LLM
metadata_field_info = [
    AttributeInfo(
        name="topic",
        description="The main programming topic. One of: python, javascript, rust",
        type="string",
    ),
    AttributeInfo(
        name="subtopic",
        description="The specific subtopic within the topic. Examples: asyncio, concurrency, web, database",
        type="string",
    ),
    AttributeInfo(
        name="difficulty",
        description="The difficulty level. One of: beginner, intermediate, advanced",
        type="string",
    ),
]

document_content_description = "Technical documentation about Python programming topics"

self_query_retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True,   # shows the generated filter in logs
)

# These queries should generate metadata filters automatically
test_queries = [
    "What are the beginner-level Python concepts?",
    "Show me advanced Python content about concurrency",
    "Find intermediate content about web frameworks",
]

for query in test_queries:
    print(f"\nQuery: {query}")
    results = self_query_retriever.invoke(query)
    for doc in results:
        print(f"  [{doc.metadata['difficulty']}] {doc.metadata['subtopic']}: {doc.page_content[:80]}...")

Side-by-Side Comparison

Let me run all five retrievers on the same query and compare:

query = "Python async and threading differences"

retrievers = {
    "similarity": similarity_retriever,
    "mmr": mmr_retriever,
    "compression": compression_retriever,
    "self_query": self_query_retriever,
}

print(f"Query: '{query}'\n{'='*60}")

for name, retriever in retrievers.items():
    try:
        results = retriever.invoke(query)
        subtopics = [doc.metadata.get("subtopic", "unknown") for doc in results]
        print(f"\n{name.upper()} ({len(results)} results):")
        print(f"  Subtopics: {subtopics}")
        print(f"  First result ({len(results[0].page_content)} chars): {results[0].page_content[:120]}...")
    except Exception as e:
        print(f"\n{name.upper()}: Error — {e}")

Retriever Performance Comparison

Retriever	Typical Latency	Extra Cost	Best For
SimilaritySearch	Fast (1 embed call)	None	Simple Q&A, homogeneous corpus
MMR	Slightly slower (1 embed + ranking)	None	Diverse corpus, redundancy issues
Score Threshold	Fast (1 embed call)	None	Preventing low-confidence retrievals
ContextualCompression (LLM)	Slow (1 embed + k LLM calls)	$0.001-0.01 per query	Large noisy chunks
ContextualCompression (Embeddings)	Medium (k embed calls)	Minimal	Budget-conscious noise reduction
MultiVectorRetriever	Medium indexing, fast query	Higher indexing cost	Query-document phrasing mismatch
SelfQueryRetriever	Medium (1 extra LLM call)	~$0.001 per query	Metadata-rich collections

Combining Retrievers

You can chain and combine these retrievers. Here is a pattern I use often in production: EnsembleRetriever combining MMR with BM25, then wrapped in contextual compression:

from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.document_compressors import EmbeddingsFilter

# BM25 for keyword matching
bm25 = BM25Retriever.from_documents(raw_docs, k=3)

# MMR for semantic diversity
mmr = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 3, "fetch_k": 8, "lambda_mult": 0.6}
)

# Combine with Reciprocal Rank Fusion
ensemble = EnsembleRetriever(
    retrievers=[bm25, mmr],
    weights=[0.4, 0.6]   # slight preference for semantic
)

# Add embedding-based compression to filter noise
embeddings_compressor = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.75
)

final_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_compressor,
    base_retriever=ensemble,
)

results = final_retriever.invoke("async Python database queries")
print(f"Combined retriever: {len(results)} results")
for doc in results:
    print(f"  {doc.metadata['subtopic']}: {doc.page_content[:100]}...")

Wiring a Retriever Into a RAG Chain

Once you have chosen and configured your retriever, wiring it into a Q&A chain is straightforward:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
    return "\n\n---\n".join(doc.page_content for doc in docs)

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a technical assistant. Answer the question based on the provided context.
If the context does not contain enough information to answer, say so.

Context:
{context}"""),
    ("human", "{question}"),
])

# Swap out the retriever here based on your needs
active_retriever = mmr_retriever   # or any other retriever from this guide

rag_chain = (
    RunnableParallel({
        "context": active_retriever | format_docs,
        "question": RunnablePassthrough(),
    })
    | qa_prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("What should I use for CPU-intensive parallel work in Python?")
print(answer)

Common Mistakes

What to Build Next

For production retrieval quality, also look at semantic search tutorial for embedding optimization techniques that work alongside the retriever strategies here.

Conclusion

Build something, measure retrieval quality with LangSmith, and let the data guide which retriever upgrades are worth the added complexity for your specific use case.

FAQs

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizRAG Systems

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

5 LangChain Vector Store Retrievers (MMR, Similarity, Compression)

Setup

Retriever 1: SimilaritySearch

With Score Threshold

Retriever 2: MMR (Maximal Marginal Relevance)

Tuning the Lambda Parameter

Retriever 3: ContextualCompressionRetriever

Using LLMChainFilter Instead

EmbeddingsFilter — Cheaper Alternative

Retriever 4: MultiVectorRetriever

Retriever 5: SelfQueryRetriever

Side-by-Side Comparison

Retriever Performance Comparison

Combining Retrievers

Wiring a Retriever Into a RAG Chain

Common Mistakes

What to Build Next

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

5 LangChain Vector Store Retrievers (MMR, Similarity, Compression)

Setup

Retriever 1: SimilaritySearch

With Score Threshold

Retriever 2: MMR (Maximal Marginal Relevance)

Tuning the Lambda Parameter

Retriever 3: ContextualCompressionRetriever

Using LLMChainFilter Instead

EmbeddingsFilter — Cheaper Alternative

Retriever 4: MultiVectorRetriever

Retriever 5: SelfQueryRetriever

Side-by-Side Comparison

Retriever Performance Comparison

Combining Retrievers

Wiring a Retriever Into a RAG Chain

Common Mistakes

What to Build Next

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily