5 LangChain Vector Store Retrievers (MMR, Similarity, Compression)
Master LangChain's 5 core retriever types — SimilaritySearch, MMR, ContextualCompression, MultiVectorRetriever, and SelfQueryRetriever — with code, benchmarks, and guidance.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Choosing the right retriever is the decision that most affects your RAG system's answer quality — more than model choice, more than prompt engineering, in many cases more than the quality of the documents themselves. I have seen teams swap from basic similarity search to MMR and watch answer diversity jump immediately. I have also seen teams add contextual compression to a system and cut hallucinations noticeably.
This guide covers five retrievers in practical detail: what they actually do under the hood, when to use them, and real code you can drop into your pipeline. We will also benchmark them side by side on the same queries so you can see the actual differences in output.
If you want to understand the broader retrieval landscape before diving into these specifics, the RAG system tutorial gives good foundation. The vector database guide covers the storage layer these retrievers sit on top of.
Setup
pip install langchain langchain-openai langchain-community \
chromadb rank-bm25 python-dotenv
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
load_dotenv()
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# A realistic document corpus — multiple chunks with some overlap
raw_docs = [
Document(
page_content="Python's asyncio library provides tools for writing concurrent code using coroutines. "
"The event loop runs coroutines and handles I/O operations efficiently. "
"async def creates a coroutine function, await suspends execution until a result is ready.",
metadata={"topic": "python", "subtopic": "asyncio", "difficulty": "intermediate"}
),
Document(
page_content="Async/await in Python enables writing non-blocking I/O code that looks synchronous. "
"The asyncio.gather() function runs multiple coroutines concurrently. "
"aiohttp is the standard library for making async HTTP requests.",
metadata={"topic": "python", "subtopic": "asyncio", "difficulty": "intermediate"}
),
Document(
page_content="Python threading uses OS threads for true parallelism with I/O-bound tasks. "
"The GIL (Global Interpreter Lock) prevents true CPU parallelism in Python threads. "
"For CPU-bound work, use multiprocessing instead of threading.",
metadata={"topic": "python", "subtopic": "concurrency", "difficulty": "advanced"}
),
Document(
page_content="Python multiprocessing spawns separate processes, each with their own Python interpreter. "
"This bypasses the GIL limitation for CPU-intensive tasks. "
"ProcessPoolExecutor provides a high-level interface for parallel CPU work.",
metadata={"topic": "python", "subtopic": "concurrency", "difficulty": "advanced"}
),
Document(
page_content="FastAPI is an async web framework built on Starlette and Pydantic. "
"It uses Python type hints for automatic request validation and documentation. "
"Async route handlers using async def run on the asyncio event loop.",
metadata={"topic": "python", "subtopic": "web", "difficulty": "beginner"}
),
Document(
page_content="SQLAlchemy async support allows database queries without blocking the event loop. "
"The AsyncSession class wraps the standard Session for use with asyncio. "
"AsyncEngine and create_async_engine set up async database connections.",
metadata={"topic": "python", "subtopic": "database", "difficulty": "intermediate"}
),
Document(
page_content="Concurrency patterns in Python: use asyncio for I/O-bound concurrent tasks, "
"threading for I/O-bound parallel tasks needing shared state, "
"and multiprocessing for CPU-bound parallel computation.",
metadata={"topic": "python", "subtopic": "concurrency", "difficulty": "intermediate"}
),
]
# Build the vector store
vectorstore = Chroma.from_documents(
documents=raw_docs,
embedding=embeddings,
collection_name="python_docs"
)
Retriever 1: SimilaritySearch
The baseline. Embeds the query and returns the k documents with highest cosine similarity.
similarity_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 3}
)
query = "how does async work in Python?"
results = similarity_retriever.invoke(query)
print(f"Similarity search: {len(results)} results for '{query}'")
for i, doc in enumerate(results):
print(f"\n [{i+1}] Subtopic: {doc.metadata['subtopic']}")
print(f" Preview: {doc.page_content[:120]}...")
What you will typically see: The top 3 documents are all asyncio-related because that matches the query most closely. This is correct, but there may be repetition if the corpus has multiple similar asyncio chunks.
With Score Threshold
threshold_retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={
"score_threshold": 0.78,
"k": 5
}
)
results = threshold_retriever.invoke("Python async HTTP requests")
print(f"Threshold retriever: {len(results)} results above 0.78 threshold")
# For out-of-scope queries, this returns fewer results
oos_results = threshold_retriever.invoke("JavaScript promises and callbacks")
print(f"Out-of-scope: {len(oos_results)} results (should be 0 or 1)")
The threshold prevents the retriever from returning weakly relevant documents. This is worth adding even to basic pipelines — returning nothing is usually better than returning unrelated content that causes hallucinations.
Retriever 2: MMR (Maximal Marginal Relevance)
MMR solves the redundancy problem in similarity search. When multiple documents are very similar to each other and to the query, basic similarity search returns all of them. MMR picks documents that are both relevant AND diverse from each other.
mmr_retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 3, # final documents to return
"fetch_k": 10, # candidates to consider before MMR selection
"lambda_mult": 0.6 # balance: 0=max diversity, 1=max relevance
}
)
query = "Python concurrent programming"
mmr_results = mmr_retriever.invoke(query)
print(f"MMR retriever: {len(mmr_results)} results")
for i, doc in enumerate(mmr_results):
print(f"\n [{i+1}] Subtopic: {doc.metadata['subtopic']}")
print(f" Preview: {doc.page_content[:120]}...")
What you will typically see with MMR: Instead of three asyncio documents, you might get asyncio, threading, and FastAPI — more diverse coverage of "concurrent programming" topics. The diversity is especially valuable when you have a broad query that could be answered from multiple angles.
Tuning the Lambda Parameter
# Compare different lambda values on the same query
for lambda_val in [0.2, 0.5, 0.8]:
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 3, "fetch_k": 10, "lambda_mult": lambda_val}
)
results = retriever.invoke("Python concurrency")
subtopics = [doc.metadata["subtopic"] for doc in results]
print(f"lambda={lambda_val}: subtopics = {subtopics}")
# lambda=0.2: high diversity, may include less relevant docs
# lambda=0.5: balanced (usually best default)
# lambda=0.8: similar to basic similarity, less diverse
Retriever 3: ContextualCompressionRetriever
This retriever has two stages. First, it retrieves chunks using any base retriever. Then, an LLM compresses each chunk to only the parts directly relevant to the query. The result: shorter, more focused context with less noise.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# The extractor uses an LLM to pull relevant content from each chunk
extractor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=extractor,
base_retriever=similarity_retriever,
)
query = "how does the GIL affect Python threading?"
results = compression_retriever.invoke(query)
print(f"Compression retriever: {len(results)} results")
for i, doc in enumerate(results):
print(f"\n [{i+1}] Compressed content ({len(doc.page_content)} chars):")
print(f" {doc.page_content}")
# Should show only the GIL-relevant sentence, not the full chunk
Using LLMChainFilter Instead
LLMChainFilter makes a binary decision (keep or drop) rather than extracting specific text. It is faster but less precise:
from langchain.retrievers.document_compressors import LLMChainFilter
filter_compressor = LLMChainFilter.from_llm(llm)
filter_retriever = ContextualCompressionRetriever(
base_compressor=filter_compressor,
base_retriever=similarity_retriever,
)
# This filters out chunks that don't actually address the query
results = filter_retriever.invoke("how to handle database connections asynchronously?")
print(f"Filter retriever: {len(results)} chunks passed the filter")
for doc in results:
print(f" Subtopic: {doc.metadata['subtopic']}")
EmbeddingsFilter — Cheaper Alternative
If LLM-based compression is too expensive for your use case, EmbeddingsFilter does the filtering using embedding similarity instead of an LLM call:
from langchain.retrievers.document_compressors import EmbeddingsFilter
embeddings_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.76
)
cheap_compression_retriever = ContextualCompressionRetriever(
base_compressor=embeddings_filter,
base_retriever=similarity_retriever,
)
results = cheap_compression_retriever.invoke("asyncio event loop")
print(f"Embeddings filter: {len(results)} results")
Retriever 4: MultiVectorRetriever
Standard vector stores index one embedding per document chunk. MultiVectorRetriever indexes multiple embeddings per document — for example, the original text, a summary, and hypothetical questions the document answers. Retrieval checks all of these embeddings, so the document is findable via any matching phrasing.
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import uuid
# Separate vector store and document store for MultiVector
mv_vectorstore = Chroma(
collection_name="multi_vector",
embedding_function=embeddings
)
mv_docstore = InMemoryStore()
mv_retriever = MultiVectorRetriever(
vectorstore=mv_vectorstore,
docstore=mv_docstore,
id_key="doc_id",
)
# Chain to generate summaries
summary_chain = (
ChatPromptTemplate.from_messages([
("system", "Write a concise 1-2 sentence summary of this text for search indexing."),
("human", "{doc}"),
])
| llm
| StrOutputParser()
)
# Chain to generate hypothetical questions
question_chain = (
ChatPromptTemplate.from_messages([
("system", "Write 2 short questions that this text directly answers. One per line."),
("human", "{doc}"),
])
| llm
| StrOutputParser()
)
# Index documents with multiple embeddings
all_index_docs = []
doc_id_map = []
for doc in raw_docs:
doc_id = str(uuid.uuid4())
doc_id_map.append(doc_id)
# Generate and store summary embedding
summary = summary_chain.invoke({"doc": doc.page_content})
summary_doc = Document(
page_content=summary,
metadata={"doc_id": doc_id, "embed_type": "summary"}
)
all_index_docs.append(summary_doc)
# Generate and store question embeddings
questions = question_chain.invoke({"doc": doc.page_content})
for q in questions.strip().split("\n"):
if q.strip():
q_doc = Document(
page_content=q.strip(),
metadata={"doc_id": doc_id, "embed_type": "question"}
)
all_index_docs.append(q_doc)
# Tag original doc with its ID
doc.metadata["doc_id"] = doc_id
# Add index docs to vector store, originals to doc store
mv_vectorstore.add_documents(all_index_docs)
mv_docstore.mset(zip(doc_id_map, raw_docs))
# Query — will find documents via summary or question matches
results = mv_retriever.invoke("non-blocking database operations")
print(f"MultiVector results: {len(results)}")
for doc in results:
print(f" Subtopic: {doc.metadata['subtopic']}")
print(f" Preview: {doc.page_content[:100]}...")
The beauty of this approach: a user who asks "non-blocking database operations" might not match the exact document text ("AsyncSession class wraps the standard Session"), but the generated question "How do you run database queries without blocking the event loop?" is a much stronger match.
Retriever 5: SelfQueryRetriever
This retriever lets the LLM translate natural language filter conditions into structured metadata filters automatically. A query like "beginner Python async tutorials" generates both a vector search AND a metadata filter {"difficulty": "beginner"}.
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
# Describe the metadata fields to the LLM
metadata_field_info = [
AttributeInfo(
name="topic",
description="The main programming topic. One of: python, javascript, rust",
type="string",
),
AttributeInfo(
name="subtopic",
description="The specific subtopic within the topic. Examples: asyncio, concurrency, web, database",
type="string",
),
AttributeInfo(
name="difficulty",
description="The difficulty level. One of: beginner, intermediate, advanced",
type="string",
),
]
document_content_description = "Technical documentation about Python programming topics"
self_query_retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents=document_content_description,
metadata_field_info=metadata_field_info,
verbose=True, # shows the generated filter in logs
)
# These queries should generate metadata filters automatically
test_queries = [
"What are the beginner-level Python concepts?",
"Show me advanced Python content about concurrency",
"Find intermediate content about web frameworks",
]
for query in test_queries:
print(f"\nQuery: {query}")
results = self_query_retriever.invoke(query)
for doc in results:
print(f" [{doc.metadata['difficulty']}] {doc.metadata['subtopic']}: {doc.page_content[:80]}...")
Side-by-Side Comparison
Let me run all five retrievers on the same query and compare:
query = "Python async and threading differences"
retrievers = {
"similarity": similarity_retriever,
"mmr": mmr_retriever,
"compression": compression_retriever,
"self_query": self_query_retriever,
}
print(f"Query: '{query}'\n{'='*60}")
for name, retriever in retrievers.items():
try:
results = retriever.invoke(query)
subtopics = [doc.metadata.get("subtopic", "unknown") for doc in results]
print(f"\n{name.upper()} ({len(results)} results):")
print(f" Subtopics: {subtopics}")
print(f" First result ({len(results[0].page_content)} chars): {results[0].page_content[:120]}...")
except Exception as e:
print(f"\n{name.upper()}: Error — {e}")
Retriever Performance Comparison
| Retriever | Typical Latency | Extra Cost | Best For |
|---|---|---|---|
| SimilaritySearch | Fast (1 embed call) | None | Simple Q&A, homogeneous corpus |
| MMR | Slightly slower (1 embed + ranking) | None | Diverse corpus, redundancy issues |
| Score Threshold | Fast (1 embed call) | None | Preventing low-confidence retrievals |
| ContextualCompression (LLM) | Slow (1 embed + k LLM calls) | $0.001-0.01 per query | Large noisy chunks |
| ContextualCompression (Embeddings) | Medium (k embed calls) | Minimal | Budget-conscious noise reduction |
| MultiVectorRetriever | Medium indexing, fast query | Higher indexing cost | Query-document phrasing mismatch |
| SelfQueryRetriever | Medium (1 extra LLM call) | ~$0.001 per query | Metadata-rich collections |
Combining Retrievers
You can chain and combine these retrievers. Here is a pattern I use often in production: EnsembleRetriever combining MMR with BM25, then wrapped in contextual compression:
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.document_compressors import EmbeddingsFilter
# BM25 for keyword matching
bm25 = BM25Retriever.from_documents(raw_docs, k=3)
# MMR for semantic diversity
mmr = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 3, "fetch_k": 8, "lambda_mult": 0.6}
)
# Combine with Reciprocal Rank Fusion
ensemble = EnsembleRetriever(
retrievers=[bm25, mmr],
weights=[0.4, 0.6] # slight preference for semantic
)
# Add embedding-based compression to filter noise
embeddings_compressor = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.75
)
final_retriever = ContextualCompressionRetriever(
base_compressor=embeddings_compressor,
base_retriever=ensemble,
)
results = final_retriever.invoke("async Python database queries")
print(f"Combined retriever: {len(results)} results")
for doc in results:
print(f" {doc.metadata['subtopic']}: {doc.page_content[:100]}...")
Wiring a Retriever Into a RAG Chain
Once you have chosen and configured your retriever, wiring it into a Q&A chain is straightforward:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
def format_docs(docs):
return "\n\n---\n".join(doc.page_content for doc in docs)
qa_prompt = ChatPromptTemplate.from_messages([
("system", """You are a technical assistant. Answer the question based on the provided context.
If the context does not contain enough information to answer, say so.
Context:
{context}"""),
("human", "{question}"),
])
# Swap out the retriever here based on your needs
active_retriever = mmr_retriever # or any other retriever from this guide
rag_chain = (
RunnableParallel({
"context": active_retriever | format_docs,
"question": RunnablePassthrough(),
})
| qa_prompt
| llm
| StrOutputParser()
)
answer = rag_chain.invoke("What should I use for CPU-intensive parallel work in Python?")
print(answer)
Common Mistakes
Using only basic similarity search in production — The investment to add MMR and a score threshold is small and almost always improves output quality, especially for technical knowledge bases where multiple chunks overlap significantly.
Setting k too high — Passing 10 chunks to the LLM when 3-4 would suffice wastes tokens and can dilute the relevant content. Start with k=3, measure accuracy, and only increase it if retrieval is missing important context.
Not testing out-of-scope queries — Add score_threshold and test your retriever with questions clearly outside your corpus. A retriever that confidently returns irrelevant results for off-topic queries is worse than one that returns nothing.
Using LLM compression for every query — ContextualCompressionRetriever with LLMChainExtractor is powerful but adds one LLM call per retrieved chunk. For high-traffic systems, use EmbeddingsFilter instead.
What to Build Next
These retrievers feed into the broader RAG architecture covered in RAG system tutorial. For building an agent that uses retrieval as a tool — rather than a direct chain — Build AI agent with LangChain shows that pattern. If you are evaluating retrieval quality systematically, the LangSmith guide at langchain-langsmith-debugging-tracing covers how to set up evaluation datasets for retrieval benchmarking.
For production retrieval quality, also look at semantic search tutorial for embedding optimization techniques that work alongside the retriever strategies here.
Conclusion
The right retriever for your project depends on your corpus characteristics and query patterns. If your documents have a lot of similar content, MMR prevents redundant results. If your chunks are long and noisy, contextual compression reduces irrelevant content. If users phrase questions differently from how documents are written, MultiVectorRetriever closes that gap.
My recommendation for most projects starting out: basic similarity with a score threshold, then add MMR when you notice redundancy, then add EnsembleRetriever with BM25 when you want the best accuracy for production. Each addition is incremental and measurable.
Build something, measure retrieval quality with LangSmith, and let the data guide which retriever upgrades are worth the added complexity for your specific use case.
FAQs
What is the difference between MMR and similarity search in LangChain? Similarity search returns the top-k documents most similar to the query, which can result in redundant, nearly identical results. MMR (Maximal Marginal Relevance) adds a diversity constraint — each new document must be both relevant to the query AND different from already-selected documents. MMR is better when your corpus has many similar chunks that would otherwise dominate the results.
When should I use ContextualCompressionRetriever? Use ContextualCompressionRetriever when your chunks contain significant off-topic content relative to the user's query. It retrieves chunks normally, then uses an LLM to extract only the relevant portions. This reduces noise in the context window, which tends to improve answer quality — but it adds one LLM call per retrieved chunk, so use it selectively.
Can I combine multiple retriever types together? Yes. EnsembleRetriever lets you combine any two or more retrievers with configurable weights using Reciprocal Rank Fusion. A common pattern is combining an MMR vector retriever with a BM25 keyword retriever for hybrid search. You can even nest retrievers — for example, wrapping an EnsembleRetriever with a ContextualCompressionRetriever.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.