AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

multiple retrieval methods being combined and weighted — LangChain retriever ensemble hybrid

7 LangChain Retriever Ensembles (Hybrid, Weighted Fusion)

⚡ Quick Answer

Combine multiple retrievers in LangChain using EnsembleRetriever, BM25 fusion, and Reciprocal Rank Fusion to build higher-accuracy RAG pipelines.

AiTechWorlds Team May 31, 2026 17 min read

#LangChain #RAG #hybrid search #retriever ensemble #BM25

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

I spent three weeks debugging a RAG pipeline that kept returning irrelevant chunks. The documents were there. The embeddings looked fine. But users kept getting answers that missed obvious keyword matches — things like product codes, API names, and version numbers that a dense vector search just didn't handle well.

The fix turned out to be surprisingly straightforward: stop relying on a single retriever and combine multiple retrieval strategies instead. That's what retriever ensembles are about, and once I understood how they work in LangChain, I wished I'd learned this pattern much earlier.

This guide walks through 7 approaches to building retriever ensembles in LangChain — from the basic EnsembleRetriever all the way to weighted fusion pipelines with Reciprocal Rank Fusion. I'll include working Python code for each approach, a benchmark comparison table, and a full production-ready example at the end.

If you're building a RAG system tutorial or optimizing an existing pipeline, this is probably the highest-ROI improvement you can make to retrieval quality.

Why Single Retrievers Fall Short

Before getting into the ensemble approaches, it's worth understanding why any single retriever has inherent blind spots.

Dense vector retrievers (like those built on OpenAI or HuggingFace embeddings) work by encoding both queries and documents into high-dimensional vectors, then finding the nearest neighbors. They're great at semantic similarity — "car" matching "automobile," or a question about "pricing" matching a passage about "cost." They're not great at exact matches.

BM25 (Best Match 25) is the opposite. It's a classic TF-IDF variant that excels at keyword matching, handles rare terms well, and doesn't need any training data. What it misses is everything semantic — it won't connect synonyms, it won't understand intent, and it has no concept of context.

Real-world documents usually need both. A user searching for "gpt-4o temperature parameter behavior" needs exact keyword matching for "gpt-4o" and "temperature parameter" plus semantic understanding of "behavior." Neither retriever alone handles this well.

According to research from the BEIR benchmark (Thakur et al., 2021), hybrid retrieval methods consistently outperform single-strategy approaches across 18 retrieval tasks, with average NDCG@10 improvements of 3–8 percentage points over dense-only retrieval.

For a broader look at retrieval architectures, the vector database guide covers the storage layer that makes these comparisons meaningful.

Comparison: Single Dense vs BM25 vs Hybrid

Before diving into code, here's a practical comparison across retrieval strategies:

Strategy	NDCG@10 (BEIR avg)	Query Speed	Cost (per 1M queries)	Best For
Dense only (OpenAI)	0.48	~120ms	~$1.50	Semantic/conversational queries
BM25 only	0.44	~15ms	$0	Keyword-heavy, technical docs
Hybrid (BM25 + Dense, 0.5/0.5)	0.53	~140ms	~$0.75	General purpose, mixed queries
Hybrid with RRF	0.55	~150ms	~$0.75	Multi-domain, unpredictable query types
Weighted Fusion (tuned)	0.57	~155ms	~$0.75	Domain-specific with tuned weights

The hybrid approaches consistently win on quality at modest speed and cost trade-offs. The extra 30ms per query is almost never a problem in practice.

Approach 1: Basic EnsembleRetriever

LangChain's EnsembleRetriever is the simplest way to combine retrievers. You pass it a list of retrievers and a list of weights, and it handles the fusion internally.

from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Sample documents
docs = [
    "LangChain supports tool use with custom agents.",
    "GPT-4o has a context window of 128k tokens.",
    "BM25 is a keyword-based retrieval algorithm.",
    "Vector embeddings capture semantic meaning.",
    "Hybrid search combines BM25 and dense retrieval.",
]

# Build BM25 retriever
bm25_retriever = BM25Retriever.from_texts(docs)
bm25_retriever.k = 5

# Build dense retriever
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_texts(docs, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with equal weights
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.5, 0.5]
)

results = ensemble_retriever.invoke("What retrieval methods work well for keyword search?")
for doc in results:
    print(doc.page_content)

The weights here are fairly intuitive: a weight of 0.5 means each retriever contributes equally to the final ranking. The fusion algorithm is Reciprocal Rank Fusion by default — more on that in Approach 4.

Approach 2: BM25 + Dense with Document Loaders

In real applications, you're not building retrievers from scratch text lists. Here's how this pattern works when loading actual documents:

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Load documents
loader = DirectoryLoader("./docs/", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# BM25 retriever from chunks
bm25 = BM25Retriever.from_documents(chunks)
bm25.k = 6

# Dense retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)
dense = vectorstore.as_retriever(search_kwargs={"k": 6})

# Ensemble
ensemble = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.4, 0.6]
)

query = "How does the authentication middleware work?"
results = ensemble.invoke(query)
print(f"Retrieved {len(results)} documents")

Notice the weights are asymmetric here — 0.4 for BM25, 0.6 for dense. For technical documentation where terminology matters, I often flip this to 0.6/0.4. You should tune these for your specific corpus.

For chunking strategies that pair well with ensemble retrievers, check out the post on LangChain text splitters.

Approach 3: Three-Way Ensemble (Sparse + Dense + MMR)

You're not limited to two retrievers. This approach adds Maximum Marginal Relevance (MMR) as a third component to improve diversity in the retrieved chunks.

from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Assume chunks already prepared
vectorstore = Chroma.from_documents(chunks, embeddings)

# Retriever 1: BM25 (keyword)
bm25 = BM25Retriever.from_documents(chunks, k=5)

# Retriever 2: Dense similarity
dense = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# Retriever 3: MMR (diversity-aware)
mmr = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.7}
)

# Three-way ensemble
ensemble = EnsembleRetriever(
    retrievers=[bm25, dense, mmr],
    weights=[0.3, 0.4, 0.3]
)

results = ensemble.invoke("Explain the difference between sync and async execution")

MMR adds diversity by penalizing documents that are too similar to already-selected ones. The lambda_mult parameter (0.7 here) controls the diversity/relevance trade-off — lower values mean more diversity.

Approach 4: Reciprocal Rank Fusion Explained

Reciprocal Rank Fusion (RRF) is the algorithm that actually powers LangChain's EnsembleRetriever. It's worth understanding because it explains why the ensemble often works better than weighted averaging of scores.

The formula is:

RRF_score(doc) = Σ 1 / (k + rank_i(doc))

Where k is a constant (usually 60), and rank_i is the document's rank in retriever i.

Here's a manual implementation that shows the logic clearly:

from collections import defaultdict
from typing import List, Tuple

def reciprocal_rank_fusion(
    ranked_lists: List[List[str]],
    weights: List[float],
    k: int = 60
) -> List[Tuple[str, float]]:
    """
    Combine multiple ranked lists using RRF with weights.
    
    Args:
        ranked_lists: List of ranked document ID lists
        weights: Weight for each ranked list
        k: RRF constant (default 60)
    
    Returns:
        Sorted list of (doc_id, score) tuples
    """
    scores = defaultdict(float)
    
    for ranked_list, weight in zip(ranked_lists, weights):
        for rank, doc_id in enumerate(ranked_list, start=1):
            scores[doc_id] += weight * (1 / (k + rank))
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)


# Example usage
bm25_results = ["doc_3", "doc_1", "doc_5", "doc_2", "doc_4"]
dense_results = ["doc_1", "doc_3", "doc_2", "doc_6", "doc_5"]

fused = reciprocal_rank_fusion(
    ranked_lists=[bm25_results, dense_results],
    weights=[0.5, 0.5]
)

print("Fused ranking:")
for doc_id, score in fused[:5]:
    print(f"  {doc_id}: {score:.4f}")

The key insight: RRF rewards consistency across retrievers. A document that ranks #2 in both BM25 and dense retrieval will outscore a document that ranks #1 in only one of them. This makes the fusion conservative and reliable.

Approach 5: Weighted Fusion with Score Normalization

For cases where you want finer control than RRF provides, you can implement score-based weighted fusion. This requires normalizing scores from different retrievers into the same range first.

from langchain.schema import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers import BM25Retriever
import numpy as np
from typing import List, Tuple

class WeightedFusionRetriever:
    """
    Custom retriever that combines BM25 and dense scores
    using normalized weighted fusion.
    """
    
    def __init__(self, bm25_retriever, vectorstore, 
                 bm25_weight=0.4, dense_weight=0.6, k=6):
        self.bm25 = bm25_retriever
        self.vectorstore = vectorstore
        self.bm25_weight = bm25_weight
        self.dense_weight = dense_weight
        self.k = k
    
    def _normalize_scores(self, scores: List[float]) -> List[float]:
        """Min-max normalize a list of scores to [0, 1]."""
        if not scores:
            return scores
        min_s, max_s = min(scores), max(scores)
        if max_s == min_s:
            return [1.0] * len(scores)
        return [(s - min_s) / (max_s - min_s) for s in scores]
    
    def invoke(self, query: str) -> List[Document]:
        # Get BM25 results (BM25Retriever doesn't return scores natively)
        bm25_docs = self.bm25.get_relevant_documents(query)
        # Assign rank-based scores for BM25
        bm25_scores = [1.0 / (i + 1) for i in range(len(bm25_docs))]
        
        # Get dense results with scores
        dense_results = self.vectorstore.similarity_search_with_score(
            query, k=self.k
        )
        dense_docs = [doc for doc, _ in dense_results]
        # Lower cosine distance = better; invert for scoring
        dense_scores = [1.0 - score for _, score in dense_results]
        
        # Normalize both score sets
        bm25_norm = self._normalize_scores(bm25_scores)
        dense_norm = self._normalize_scores(dense_scores)
        
        # Build unified score map
        doc_scores = {}
        
        for doc, score in zip(bm25_docs, bm25_norm):
            key = doc.page_content[:100]
            doc_scores[key] = {
                "doc": doc,
                "score": self.bm25_weight * score
            }
        
        for doc, score in zip(dense_docs, dense_norm):
            key = doc.page_content[:100]
            if key in doc_scores:
                doc_scores[key]["score"] += self.dense_weight * score
            else:
                doc_scores[key] = {
                    "doc": doc,
                    "score": self.dense_weight * score
                }
        
        # Sort by fused score
        sorted_results = sorted(
            doc_scores.values(),
            key=lambda x: x["score"],
            reverse=True
        )
        
        return [item["doc"] for item in sorted_results[:self.k]]


# Usage
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
bm25 = BM25Retriever.from_documents(chunks, k=6)

fusion_retriever = WeightedFusionRetriever(
    bm25_retriever=bm25,
    vectorstore=vectorstore,
    bm25_weight=0.4,
    dense_weight=0.6,
    k=6
)

results = fusion_retriever.invoke("How does token streaming work in LangChain?")

This approach is more transparent than RRF — you can see exactly how much each retriever contributes to the final score.

Approach 6: Contextual Compression + Ensemble

One problem with ensemble retrieval is that you still get full chunks, some of which might be only partially relevant. Combining ensemble retrieval with contextual compression keeps the best parts.

from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Build base ensemble
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

bm25 = BM25Retriever.from_documents(chunks, k=8)
dense = vectorstore.as_retriever(search_kwargs={"k": 8})

base_ensemble = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.5, 0.5]
)

# Add compression layer
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_ensemble
)

# This retrieves, fuses, then compresses to only the relevant parts
results = compression_retriever.invoke(
    "What are the rate limits for the OpenAI embeddings API?"
)

for doc in results:
    print("---")
    print(doc.page_content)

The compression step adds latency and token cost, so use this selectively — it's most valuable when your chunks are large (500+ tokens) and queries are very specific.

Approach 7: Full Production Hybrid Retriever

This is the pattern I actually use in production. It combines everything: async retrieval, proper error handling, caching, and configurable weights.

import asyncio
from typing import List, Optional, Dict, Any
from langchain.schema import Document
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.cache import InMemoryCache
from langchain.globals import set_llm_cache
import logging

logger = logging.getLogger(__name__)

class ProductionHybridRetriever:
    """
    Production-grade hybrid retriever with:
    - BM25 + dense ensemble
    - Configurable weights
    - Result deduplication
    - Metadata filtering
    - Async support
    """
    
    def __init__(
        self,
        documents: List[Document],
        embeddings_model: str = "text-embedding-3-small",
        bm25_weight: float = 0.45,
        dense_weight: float = 0.55,
        top_k: int = 6,
        filter_metadata: Optional[Dict[str, Any]] = None
    ):
        self.top_k = top_k
        self.filter_metadata = filter_metadata
        self.bm25_weight = bm25_weight
        self.dense_weight = dense_weight
        
        # Validate weights
        assert abs(bm25_weight + dense_weight - 1.0) < 1e-6, \
            "Weights must sum to 1.0"
        
        # Initialize retrievers
        logger.info("Building BM25 index...")
        self.bm25 = BM25Retriever.from_documents(documents)
        self.bm25.k = top_k * 2  # Retrieve more, fuse, then trim
        
        logger.info("Building dense index...")
        embeddings = OpenAIEmbeddings(model=embeddings_model)
        self.vectorstore = FAISS.from_documents(documents, embeddings)
        self.dense = self.vectorstore.as_retriever(
            search_kwargs={"k": top_k * 2}
        )
        
        # Build ensemble
        self.ensemble = EnsembleRetriever(
            retrievers=[self.bm25, self.dense],
            weights=[bm25_weight, dense_weight]
        )
        
        logger.info(
            f"Hybrid retriever ready. "
            f"BM25 weight={bm25_weight}, dense weight={dense_weight}"
        )
    
    def _apply_metadata_filter(
        self, docs: List[Document]
    ) -> List[Document]:
        """Filter documents by metadata if filter is set."""
        if not self.filter_metadata:
            return docs
        
        filtered = []
        for doc in docs:
            match = all(
                doc.metadata.get(k) == v
                for k, v in self.filter_metadata.items()
            )
            if match:
                filtered.append(doc)
        return filtered
    
    def _deduplicate(self, docs: List[Document]) -> List[Document]:
        """Remove duplicate documents by content hash."""
        seen = set()
        unique = []
        for doc in docs:
            content_hash = hash(doc.page_content)
            if content_hash not in seen:
                seen.add(content_hash)
                unique.append(doc)
        return unique
    
    def retrieve(self, query: str) -> List[Document]:
        """Synchronous retrieval."""
        try:
            raw_results = self.ensemble.invoke(query)
            filtered = self._apply_metadata_filter(raw_results)
            deduplicated = self._deduplicate(filtered)
            return deduplicated[:self.top_k]
        except Exception as e:
            logger.error(f"Retrieval failed for query '{query}': {e}")
            # Graceful fallback to dense-only
            logger.info("Falling back to dense-only retrieval")
            return self.dense.invoke(query)[:self.top_k]
    
    async def aretrieve(self, query: str) -> List[Document]:
        """Async retrieval."""
        try:
            raw_results = await self.ensemble.ainvoke(query)
            filtered = self._apply_metadata_filter(raw_results)
            deduplicated = self._deduplicate(filtered)
            return deduplicated[:self.top_k]
        except Exception as e:
            logger.error(f"Async retrieval failed: {e}")
            return await self.dense.ainvoke(query)
    
    def update_weights(
        self, bm25_weight: float, dense_weight: float
    ) -> None:
        """Hot-swap weights without rebuilding indices."""
        assert abs(bm25_weight + dense_weight - 1.0) < 1e-6
        self.ensemble.weights = [bm25_weight, dense_weight]
        self.bm25_weight = bm25_weight
        self.dense_weight = dense_weight
        logger.info(f"Weights updated: BM25={bm25_weight}, dense={dense_weight}")


# --- Integration with RAG chain ---

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def build_hybrid_rag_chain(
    documents: List[Document],
    model: str = "gpt-4o",
    bm25_weight: float = 0.45
):
    """Build a complete RAG chain with hybrid retrieval."""
    
    # Initialize hybrid retriever
    retriever = ProductionHybridRetriever(
        documents=documents,
        bm25_weight=bm25_weight,
        dense_weight=1.0 - bm25_weight,
        top_k=6
    )
    
    # Custom prompt
    prompt_template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""
    
    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    
    llm = ChatOpenAI(model=model, temperature=0)
    
    # Note: wrap the custom retriever for RetrievalQA
    class RetrieverWrapper:
        def get_relevant_documents(self, query):
            return retriever.retrieve(query)
        async def aget_relevant_documents(self, query):
            return await retriever.aretrieve(query)
    
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=RetrieverWrapper(),
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True
    )
    
    return chain, retriever


# Example usage
if __name__ == "__main__":
    from langchain_community.document_loaders import TextLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    loader = TextLoader("./knowledge_base.txt")
    docs = loader.load()
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=400, chunk_overlap=40
    )
    chunks = splitter.split_documents(docs)
    
    chain, retriever = build_hybrid_rag_chain(
        documents=chunks,
        model="gpt-4o",
        bm25_weight=0.45
    )
    
    response = chain.invoke({"query": "How does authentication work?"})
    print("Answer:", response["result"])
    print("\nSources:")
    for doc in response["source_documents"]:
        print(f"  - {doc.metadata.get('source', 'unknown')}")

This production retriever handles the real-world concerns that the basic examples skip: async support, graceful fallbacks, metadata filtering, deduplication, and runtime weight adjustment.

Tuning Weights for Your Domain

Getting the weights right matters. Here's a framework I use to decide where to start:

Start with 0.5/0.5 for general-purpose corpora where you don't know the query distribution.

Lean toward higher BM25 weight (0.6–0.7) when:

Your documents have specific technical terminology, product names, or codes
Users tend to search with exact phrases
Your domain has many proper nouns (APIs, tools, person names)

Lean toward higher dense weight (0.6–0.7) when:

User queries are conversational or question-based
Documents use varied vocabulary for the same concepts
You have multilingual content

To tune empirically, collect 50–100 real queries and their expected answer documents. Then run a grid search over weights from 0.3 to 0.7 in 0.1 increments and measure NDCG@10 or Recall@5 on your test set.

import itertools
from sklearn.metrics import ndcg_score
import numpy as np

def evaluate_weights(
    queries: List[str],
    relevant_docs: List[List[str]],  # Expected relevant doc IDs per query
    chunks: List[Document],
    weight_steps: List[float] = [0.3, 0.4, 0.5, 0.6, 0.7]
):
    """Grid search over BM25/dense weights."""
    results = {}
    
    for bm25_w in weight_steps:
        dense_w = round(1.0 - bm25_w, 1)
        retriever = ProductionHybridRetriever(
            documents=chunks,
            bm25_weight=bm25_w,
            dense_weight=dense_w,
            top_k=10
        )
        
        hits = 0
        total = 0
        for query, relevant in zip(queries, relevant_docs):
            retrieved = retriever.retrieve(query)
            retrieved_ids = [d.metadata.get("id", "") for d in retrieved]
            hits += len(set(retrieved_ids) & set(relevant))
            total += len(relevant)
        
        recall = hits / total if total > 0 else 0
        results[(bm25_w, dense_w)] = recall
        print(f"BM25={bm25_w}, Dense={dense_w}: Recall={recall:.3f}")
    
    best = max(results, key=results.get)
    print(f"\nBest weights: BM25={best[0]}, Dense={best[1]}")
    return best

This kind of systematic tuning can push your recall numbers by 5–10 percentage points compared to default weights.

For more on building complete retrieval pipelines, the guide on building AI agents with LangChain covers how retrieval fits into the broader agent architecture.

Common Mistakes and How to Avoid Them

Forgetting to deduplicate. Both retrievers might return the same document. Without deduplication, you're wasting context window space on repeated information.

Using the same k for both retrievers. BM25 and dense have different precision characteristics. I typically retrieve 2× the final k from each, then trim after fusion.

Not filtering by metadata. If your corpus has documents from different time periods, sources, or categories, filtering by metadata before fusion can improve relevance significantly.

Treating weights as set-and-forget. Query patterns change as your application evolves. Revisit weights quarterly if you're in production.

Building the BM25 index at query time. BM25 index construction is slow on large corpora. Build it once at startup and serialize it to disk.

import pickle
from pathlib import Path

def save_bm25_index(retriever: BM25Retriever, path: str):
    """Serialize BM25 index to disk."""
    with open(path, "wb") as f:
        pickle.dump(retriever, f)

def load_bm25_index(path: str) -> BM25Retriever:
    """Load BM25 index from disk."""
    if not Path(path).exists():
        raise FileNotFoundError(f"BM25 index not found at {path}")
    with open(path, "rb") as f:
        return pickle.load(f)

This simple caching pattern can save 30–60 seconds on startup for large document sets.

If you're integrating this into a full agent setup, the post on AI agent memory and planning covers how retrieval fits alongside other memory components.

When to Skip the Ensemble

Ensemble retrieval isn't always the right choice. Skip it when:

Your corpus is small (under 1,000 documents) — BM25 overhead isn't worth it
All queries are highly semantic with no keyword components
Latency is critical and you can't afford the extra 30–50ms
You're already using a vector DB with built-in hybrid search (like Pinecone, Weaviate, or Qdrant)

Modern vector databases increasingly have hybrid search built in at the infrastructure level, which is faster than combining two separate Python retrievers. Check the vector database guide to see which DBs offer native hybrid support.

Conclusion

Retriever ensembles are one of those patterns that feel complicated until you actually use them — then they become indispensable. The EnsembleRetriever in LangChain makes the basics genuinely easy, and the Reciprocal Rank Fusion algorithm does a good job of combining signals without you needing to tune much.

For most RAG applications, starting with a 0.5/0.5 BM25 + dense ensemble will immediately improve retrieval quality over either approach alone. From there, you can tune weights for your domain, add contextual compression for long chunks, and layer in metadata filtering as your data grows.

The production retriever in Approach 7 gives you a solid foundation that handles edge cases, supports async workloads, and degrades gracefully when one retriever fails. Copy it, adapt the weights to your corpus, and you'll have a much more reliable retrieval layer than a single-strategy approach.

If you're building this into a complete agent, the AI research agent build post shows how ensemble retrieval fits into multi-step research workflows.

FAQs

What is an EnsembleRetriever in LangChain? EnsembleRetriever is a LangChain component that combines results from multiple retrievers using Reciprocal Rank Fusion (RRF) or custom weighting. It lets you merge a keyword-based retriever like BM25 with a dense vector retriever to improve recall and precision.

Why is hybrid retrieval better than dense-only search? Dense retrievers miss exact keyword matches and struggle with rare terms, while BM25 misses semantic relationships. Combining them captures both signals, which consistently outperforms either approach alone on benchmarks like BEIR and MS MARCO.

What weights should I use for BM25 and dense retriever? A 0.5/0.5 split is a reasonable starting point, but the optimal weights depend on your document corpus. Technical documentation often benefits from higher BM25 weight (0.6–0.7) since exact terminology matters. Conversational or semantic queries favor dense retriever weight of 0.6–0.7.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

EnsembleRetriever is a LangChain component that combines results from multiple retrievers using Reciprocal Rank Fusion (RRF) or custom weighting. It lets you merge a keyword-based retriever like BM25 with a dense vector retriever to improve recall and precision.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizRAG Systems

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

7 LangChain Retriever Ensembles (Hybrid, Weighted Fusion)

⚡ Quick Answer

Combine multiple retrievers in LangChain using EnsembleRetriever, BM25 fusion, and Reciprocal Rank Fusion to build higher-accuracy RAG pipelines.

AiTechWorlds Team May 31, 2026 17 min read

#LangChain #RAG #hybrid search #retriever ensemble #BM25

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

If you're building a RAG system tutorial or optimizing an existing pipeline, this is probably the highest-ROI improvement you can make to retrieval quality.

Why Single Retrievers Fall Short

Before getting into the ensemble approaches, it's worth understanding why any single retriever has inherent blind spots.

For a broader look at retrieval architectures, the vector database guide covers the storage layer that makes these comparisons meaningful.

Comparison: Single Dense vs BM25 vs Hybrid

Before diving into code, here's a practical comparison across retrieval strategies:

Strategy	NDCG@10 (BEIR avg)	Query Speed	Cost (per 1M queries)	Best For
Dense only (OpenAI)	0.48	~120ms	~$1.50	Semantic/conversational queries
BM25 only	0.44	~15ms	$0	Keyword-heavy, technical docs
Hybrid (BM25 + Dense, 0.5/0.5)	0.53	~140ms	~$0.75	General purpose, mixed queries
Hybrid with RRF	0.55	~150ms	~$0.75	Multi-domain, unpredictable query types
Weighted Fusion (tuned)	0.57	~155ms	~$0.75	Domain-specific with tuned weights

The hybrid approaches consistently win on quality at modest speed and cost trade-offs. The extra 30ms per query is almost never a problem in practice.

Approach 1: Basic EnsembleRetriever

LangChain's EnsembleRetriever is the simplest way to combine retrievers. You pass it a list of retrievers and a list of weights, and it handles the fusion internally.

from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Sample documents
docs = [
    "LangChain supports tool use with custom agents.",
    "GPT-4o has a context window of 128k tokens.",
    "BM25 is a keyword-based retrieval algorithm.",
    "Vector embeddings capture semantic meaning.",
    "Hybrid search combines BM25 and dense retrieval.",
]

# Build BM25 retriever
bm25_retriever = BM25Retriever.from_texts(docs)
bm25_retriever.k = 5

# Build dense retriever
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_texts(docs, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with equal weights
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.5, 0.5]
)

results = ensemble_retriever.invoke("What retrieval methods work well for keyword search?")
for doc in results:
    print(doc.page_content)

Approach 2: BM25 + Dense with Document Loaders

In real applications, you're not building retrievers from scratch text lists. Here's how this pattern works when loading actual documents:

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Load documents
loader = DirectoryLoader("./docs/", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# BM25 retriever from chunks
bm25 = BM25Retriever.from_documents(chunks)
bm25.k = 6

# Dense retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)
dense = vectorstore.as_retriever(search_kwargs={"k": 6})

# Ensemble
ensemble = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.4, 0.6]
)

query = "How does the authentication middleware work?"
results = ensemble.invoke(query)
print(f"Retrieved {len(results)} documents")

For chunking strategies that pair well with ensemble retrievers, check out the post on LangChain text splitters.

Approach 3: Three-Way Ensemble (Sparse + Dense + MMR)

You're not limited to two retrievers. This approach adds Maximum Marginal Relevance (MMR) as a third component to improve diversity in the retrieved chunks.

from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Assume chunks already prepared
vectorstore = Chroma.from_documents(chunks, embeddings)

# Retriever 1: BM25 (keyword)
bm25 = BM25Retriever.from_documents(chunks, k=5)

# Retriever 2: Dense similarity
dense = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# Retriever 3: MMR (diversity-aware)
mmr = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.7}
)

# Three-way ensemble
ensemble = EnsembleRetriever(
    retrievers=[bm25, dense, mmr],
    weights=[0.3, 0.4, 0.3]
)

results = ensemble.invoke("Explain the difference between sync and async execution")

Approach 4: Reciprocal Rank Fusion Explained

The formula is:

RRF_score(doc) = Σ 1 / (k + rank_i(doc))

Where k is a constant (usually 60), and rank_i is the document's rank in retriever i.

Here's a manual implementation that shows the logic clearly:

from collections import defaultdict
from typing import List, Tuple

def reciprocal_rank_fusion(
    ranked_lists: List[List[str]],
    weights: List[float],
    k: int = 60
) -> List[Tuple[str, float]]:
    """
    Combine multiple ranked lists using RRF with weights.
    
    Args:
        ranked_lists: List of ranked document ID lists
        weights: Weight for each ranked list
        k: RRF constant (default 60)
    
    Returns:
        Sorted list of (doc_id, score) tuples
    """
    scores = defaultdict(float)
    
    for ranked_list, weight in zip(ranked_lists, weights):
        for rank, doc_id in enumerate(ranked_list, start=1):
            scores[doc_id] += weight * (1 / (k + rank))
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)


# Example usage
bm25_results = ["doc_3", "doc_1", "doc_5", "doc_2", "doc_4"]
dense_results = ["doc_1", "doc_3", "doc_2", "doc_6", "doc_5"]

fused = reciprocal_rank_fusion(
    ranked_lists=[bm25_results, dense_results],
    weights=[0.5, 0.5]
)

print("Fused ranking:")
for doc_id, score in fused[:5]:
    print(f"  {doc_id}: {score:.4f}")

Approach 5: Weighted Fusion with Score Normalization

For cases where you want finer control than RRF provides, you can implement score-based weighted fusion. This requires normalizing scores from different retrievers into the same range first.

from langchain.schema import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers import BM25Retriever
import numpy as np
from typing import List, Tuple

class WeightedFusionRetriever:
    """
    Custom retriever that combines BM25 and dense scores
    using normalized weighted fusion.
    """
    
    def __init__(self, bm25_retriever, vectorstore, 
                 bm25_weight=0.4, dense_weight=0.6, k=6):
        self.bm25 = bm25_retriever
        self.vectorstore = vectorstore
        self.bm25_weight = bm25_weight
        self.dense_weight = dense_weight
        self.k = k
    
    def _normalize_scores(self, scores: List[float]) -> List[float]:
        """Min-max normalize a list of scores to [0, 1]."""
        if not scores:
            return scores
        min_s, max_s = min(scores), max(scores)
        if max_s == min_s:
            return [1.0] * len(scores)
        return [(s - min_s) / (max_s - min_s) for s in scores]
    
    def invoke(self, query: str) -> List[Document]:
        # Get BM25 results (BM25Retriever doesn't return scores natively)
        bm25_docs = self.bm25.get_relevant_documents(query)
        # Assign rank-based scores for BM25
        bm25_scores = [1.0 / (i + 1) for i in range(len(bm25_docs))]
        
        # Get dense results with scores
        dense_results = self.vectorstore.similarity_search_with_score(
            query, k=self.k
        )
        dense_docs = [doc for doc, _ in dense_results]
        # Lower cosine distance = better; invert for scoring
        dense_scores = [1.0 - score for _, score in dense_results]
        
        # Normalize both score sets
        bm25_norm = self._normalize_scores(bm25_scores)
        dense_norm = self._normalize_scores(dense_scores)
        
        # Build unified score map
        doc_scores = {}
        
        for doc, score in zip(bm25_docs, bm25_norm):
            key = doc.page_content[:100]
            doc_scores[key] = {
                "doc": doc,
                "score": self.bm25_weight * score
            }
        
        for doc, score in zip(dense_docs, dense_norm):
            key = doc.page_content[:100]
            if key in doc_scores:
                doc_scores[key]["score"] += self.dense_weight * score
            else:
                doc_scores[key] = {
                    "doc": doc,
                    "score": self.dense_weight * score
                }
        
        # Sort by fused score
        sorted_results = sorted(
            doc_scores.values(),
            key=lambda x: x["score"],
            reverse=True
        )
        
        return [item["doc"] for item in sorted_results[:self.k]]


# Usage
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
bm25 = BM25Retriever.from_documents(chunks, k=6)

fusion_retriever = WeightedFusionRetriever(
    bm25_retriever=bm25,
    vectorstore=vectorstore,
    bm25_weight=0.4,
    dense_weight=0.6,
    k=6
)

results = fusion_retriever.invoke("How does token streaming work in LangChain?")

This approach is more transparent than RRF — you can see exactly how much each retriever contributes to the final score.

Approach 6: Contextual Compression + Ensemble

One problem with ensemble retrieval is that you still get full chunks, some of which might be only partially relevant. Combining ensemble retrieval with contextual compression keeps the best parts.

from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Build base ensemble
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

bm25 = BM25Retriever.from_documents(chunks, k=8)
dense = vectorstore.as_retriever(search_kwargs={"k": 8})

base_ensemble = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.5, 0.5]
)

# Add compression layer
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_ensemble
)

# This retrieves, fuses, then compresses to only the relevant parts
results = compression_retriever.invoke(
    "What are the rate limits for the OpenAI embeddings API?"
)

for doc in results:
    print("---")
    print(doc.page_content)

The compression step adds latency and token cost, so use this selectively — it's most valuable when your chunks are large (500+ tokens) and queries are very specific.

Approach 7: Full Production Hybrid Retriever

This is the pattern I actually use in production. It combines everything: async retrieval, proper error handling, caching, and configurable weights.

import asyncio
from typing import List, Optional, Dict, Any
from langchain.schema import Document
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.cache import InMemoryCache
from langchain.globals import set_llm_cache
import logging

logger = logging.getLogger(__name__)

class ProductionHybridRetriever:
    """
    Production-grade hybrid retriever with:
    - BM25 + dense ensemble
    - Configurable weights
    - Result deduplication
    - Metadata filtering
    - Async support
    """
    
    def __init__(
        self,
        documents: List[Document],
        embeddings_model: str = "text-embedding-3-small",
        bm25_weight: float = 0.45,
        dense_weight: float = 0.55,
        top_k: int = 6,
        filter_metadata: Optional[Dict[str, Any]] = None
    ):
        self.top_k = top_k
        self.filter_metadata = filter_metadata
        self.bm25_weight = bm25_weight
        self.dense_weight = dense_weight
        
        # Validate weights
        assert abs(bm25_weight + dense_weight - 1.0) < 1e-6, \
            "Weights must sum to 1.0"
        
        # Initialize retrievers
        logger.info("Building BM25 index...")
        self.bm25 = BM25Retriever.from_documents(documents)
        self.bm25.k = top_k * 2  # Retrieve more, fuse, then trim
        
        logger.info("Building dense index...")
        embeddings = OpenAIEmbeddings(model=embeddings_model)
        self.vectorstore = FAISS.from_documents(documents, embeddings)
        self.dense = self.vectorstore.as_retriever(
            search_kwargs={"k": top_k * 2}
        )
        
        # Build ensemble
        self.ensemble = EnsembleRetriever(
            retrievers=[self.bm25, self.dense],
            weights=[bm25_weight, dense_weight]
        )
        
        logger.info(
            f"Hybrid retriever ready. "
            f"BM25 weight={bm25_weight}, dense weight={dense_weight}"
        )
    
    def _apply_metadata_filter(
        self, docs: List[Document]
    ) -> List[Document]:
        """Filter documents by metadata if filter is set."""
        if not self.filter_metadata:
            return docs
        
        filtered = []
        for doc in docs:
            match = all(
                doc.metadata.get(k) == v
                for k, v in self.filter_metadata.items()
            )
            if match:
                filtered.append(doc)
        return filtered
    
    def _deduplicate(self, docs: List[Document]) -> List[Document]:
        """Remove duplicate documents by content hash."""
        seen = set()
        unique = []
        for doc in docs:
            content_hash = hash(doc.page_content)
            if content_hash not in seen:
                seen.add(content_hash)
                unique.append(doc)
        return unique
    
    def retrieve(self, query: str) -> List[Document]:
        """Synchronous retrieval."""
        try:
            raw_results = self.ensemble.invoke(query)
            filtered = self._apply_metadata_filter(raw_results)
            deduplicated = self._deduplicate(filtered)
            return deduplicated[:self.top_k]
        except Exception as e:
            logger.error(f"Retrieval failed for query '{query}': {e}")
            # Graceful fallback to dense-only
            logger.info("Falling back to dense-only retrieval")
            return self.dense.invoke(query)[:self.top_k]
    
    async def aretrieve(self, query: str) -> List[Document]:
        """Async retrieval."""
        try:
            raw_results = await self.ensemble.ainvoke(query)
            filtered = self._apply_metadata_filter(raw_results)
            deduplicated = self._deduplicate(filtered)
            return deduplicated[:self.top_k]
        except Exception as e:
            logger.error(f"Async retrieval failed: {e}")
            return await self.dense.ainvoke(query)
    
    def update_weights(
        self, bm25_weight: float, dense_weight: float
    ) -> None:
        """Hot-swap weights without rebuilding indices."""
        assert abs(bm25_weight + dense_weight - 1.0) < 1e-6
        self.ensemble.weights = [bm25_weight, dense_weight]
        self.bm25_weight = bm25_weight
        self.dense_weight = dense_weight
        logger.info(f"Weights updated: BM25={bm25_weight}, dense={dense_weight}")


# --- Integration with RAG chain ---

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def build_hybrid_rag_chain(
    documents: List[Document],
    model: str = "gpt-4o",
    bm25_weight: float = 0.45
):
    """Build a complete RAG chain with hybrid retrieval."""
    
    # Initialize hybrid retriever
    retriever = ProductionHybridRetriever(
        documents=documents,
        bm25_weight=bm25_weight,
        dense_weight=1.0 - bm25_weight,
        top_k=6
    )
    
    # Custom prompt
    prompt_template = """Use the following context to answer the question.
If the answer is not in the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""
    
    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    
    llm = ChatOpenAI(model=model, temperature=0)
    
    # Note: wrap the custom retriever for RetrievalQA
    class RetrieverWrapper:
        def get_relevant_documents(self, query):
            return retriever.retrieve(query)
        async def aget_relevant_documents(self, query):
            return await retriever.aretrieve(query)
    
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=RetrieverWrapper(),
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True
    )
    
    return chain, retriever


# Example usage
if __name__ == "__main__":
    from langchain_community.document_loaders import TextLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    loader = TextLoader("./knowledge_base.txt")
    docs = loader.load()
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=400, chunk_overlap=40
    )
    chunks = splitter.split_documents(docs)
    
    chain, retriever = build_hybrid_rag_chain(
        documents=chunks,
        model="gpt-4o",
        bm25_weight=0.45
    )
    
    response = chain.invoke({"query": "How does authentication work?"})
    print("Answer:", response["result"])
    print("\nSources:")
    for doc in response["source_documents"]:
        print(f"  - {doc.metadata.get('source', 'unknown')}")

This production retriever handles the real-world concerns that the basic examples skip: async support, graceful fallbacks, metadata filtering, deduplication, and runtime weight adjustment.

Tuning Weights for Your Domain

Getting the weights right matters. Here's a framework I use to decide where to start:

Start with 0.5/0.5 for general-purpose corpora where you don't know the query distribution.

Lean toward higher BM25 weight (0.6–0.7) when:

Your documents have specific technical terminology, product names, or codes
Users tend to search with exact phrases
Your domain has many proper nouns (APIs, tools, person names)

Lean toward higher dense weight (0.6–0.7) when:

User queries are conversational or question-based
Documents use varied vocabulary for the same concepts
You have multilingual content

import itertools
from sklearn.metrics import ndcg_score
import numpy as np

def evaluate_weights(
    queries: List[str],
    relevant_docs: List[List[str]],  # Expected relevant doc IDs per query
    chunks: List[Document],
    weight_steps: List[float] = [0.3, 0.4, 0.5, 0.6, 0.7]
):
    """Grid search over BM25/dense weights."""
    results = {}
    
    for bm25_w in weight_steps:
        dense_w = round(1.0 - bm25_w, 1)
        retriever = ProductionHybridRetriever(
            documents=chunks,
            bm25_weight=bm25_w,
            dense_weight=dense_w,
            top_k=10
        )
        
        hits = 0
        total = 0
        for query, relevant in zip(queries, relevant_docs):
            retrieved = retriever.retrieve(query)
            retrieved_ids = [d.metadata.get("id", "") for d in retrieved]
            hits += len(set(retrieved_ids) & set(relevant))
            total += len(relevant)
        
        recall = hits / total if total > 0 else 0
        results[(bm25_w, dense_w)] = recall
        print(f"BM25={bm25_w}, Dense={dense_w}: Recall={recall:.3f}")
    
    best = max(results, key=results.get)
    print(f"\nBest weights: BM25={best[0]}, Dense={best[1]}")
    return best

This kind of systematic tuning can push your recall numbers by 5–10 percentage points compared to default weights.

For more on building complete retrieval pipelines, the guide on building AI agents with LangChain covers how retrieval fits into the broader agent architecture.

Common Mistakes and How to Avoid Them

Forgetting to deduplicate. Both retrievers might return the same document. Without deduplication, you're wasting context window space on repeated information.

Using the same k for both retrievers. BM25 and dense have different precision characteristics. I typically retrieve 2× the final k from each, then trim after fusion.

Not filtering by metadata. If your corpus has documents from different time periods, sources, or categories, filtering by metadata before fusion can improve relevance significantly.

Treating weights as set-and-forget. Query patterns change as your application evolves. Revisit weights quarterly if you're in production.

Building the BM25 index at query time. BM25 index construction is slow on large corpora. Build it once at startup and serialize it to disk.

import pickle
from pathlib import Path

def save_bm25_index(retriever: BM25Retriever, path: str):
    """Serialize BM25 index to disk."""
    with open(path, "wb") as f:
        pickle.dump(retriever, f)

def load_bm25_index(path: str) -> BM25Retriever:
    """Load BM25 index from disk."""
    if not Path(path).exists():
        raise FileNotFoundError(f"BM25 index not found at {path}")
    with open(path, "rb") as f:
        return pickle.load(f)

This simple caching pattern can save 30–60 seconds on startup for large document sets.

If you're integrating this into a full agent setup, the post on AI agent memory and planning covers how retrieval fits alongside other memory components.

When to Skip the Ensemble

Ensemble retrieval isn't always the right choice. Skip it when:

Your corpus is small (under 1,000 documents) — BM25 overhead isn't worth it
All queries are highly semantic with no keyword components
Latency is critical and you can't afford the extra 30–50ms
You're already using a vector DB with built-in hybrid search (like Pinecone, Weaviate, or Qdrant)

Conclusion

If you're building this into a complete agent, the AI research agent build post shows how ensemble retrieval fits into multi-step research workflows.

FAQs

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizRAG Systems

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

7 LangChain Retriever Ensembles (Hybrid, Weighted Fusion)

Why Single Retrievers Fall Short

Comparison: Single Dense vs BM25 vs Hybrid

Approach 1: Basic EnsembleRetriever

Approach 2: BM25 + Dense with Document Loaders

Approach 3: Three-Way Ensemble (Sparse + Dense + MMR)

Approach 4: Reciprocal Rank Fusion Explained

Approach 5: Weighted Fusion with Score Normalization

Approach 6: Contextual Compression + Ensemble

Approach 7: Full Production Hybrid Retriever

Tuning Weights for Your Domain

Common Mistakes and How to Avoid Them

When to Skip the Ensemble

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

7 LangChain Retriever Ensembles (Hybrid, Weighted Fusion)

Why Single Retrievers Fall Short

Comparison: Single Dense vs BM25 vs Hybrid

Approach 1: Basic EnsembleRetriever

Approach 2: BM25 + Dense with Document Loaders

Approach 3: Three-Way Ensemble (Sparse + Dense + MMR)

Approach 4: Reciprocal Rank Fusion Explained

Approach 5: Weighted Fusion with Score Normalization

Approach 6: Contextual Compression + Ensemble

Approach 7: Full Production Hybrid Retriever

Tuning Weights for Your Domain

Common Mistakes and How to Avoid Them

When to Skip the Ensemble

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily