10 LangChain RAG Rerankers: Cohere, Cross-Encoder and More
Improve RAG relevance with LangChain rerankers — CohereRerank, CrossEncoderReranker, FlashrankRerank, RankGPT, and more, with BEIR benchmark results and code.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Your RAG pipeline retrieves the top 5 documents, feeds them to GPT-4o, and produces an answer. It works well on simple queries. Then a user asks something nuanced, your embedding-based retrieval returns five thematically related but not quite right documents, and the LLM produces a confident but wrong answer.
The fix is a reranker. You retrieve a wider set of candidates first (say, 20 documents), then run a more expensive but more accurate scoring model to pick the best 5. This two-stage approach — broad retrieval followed by precise reranking — is how every high-quality production RAG system is built.
This guide covers ten reranking approaches available in LangChain, with working code, benchmark comparisons, and guidance on when to use each one.
Why Reranking Works
Vector search computes a single embedding for the query, a single embedding for each document, and measures cosine similarity. This is fast but lossy — the embedding captures the general topic but loses fine-grained relevance signals.
A cross-encoder reranker processes the query and each candidate document together as a single sequence, letting the model attend to interactions between query terms and document terms. This is the same mechanism that made BERT-based retrieval systems dramatically better than TF-IDF, applied to the reranking stage.
A 2024 analysis of the BEIR benchmark dataset by the MS MARCO team found that reranking vector search results improved NDCG@10 by an average of 18% across 18 retrieval tasks. For enterprise document search specifically, the improvement is often higher because domain-specific terminology appears in exact form more than embeddings capture.
Reranker Setup
pip install langchain-community langchain-cohere sentence-transformers flashrank
pip install cohere # for CohereRerank
We'll use this retriever as the base for all examples:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Sample knowledge base for demonstration
sample_docs = [
Document(page_content="The transformer architecture uses self-attention to process sequences in parallel.", metadata={"source": "ml_basics.md"}),
Document(page_content="BERT is a transformer model pre-trained on masked language modeling and next sentence prediction.", metadata={"source": "bert_paper.md"}),
Document(page_content="GPT models use a decoder-only transformer architecture trained autoregressively.", metadata={"source": "gpt_overview.md"}),
Document(page_content="Attention mechanisms allow models to focus on different parts of the input.", metadata={"source": "attention.md"}),
Document(page_content="Vector databases store embeddings and support approximate nearest neighbor search.", metadata={"source": "vector_db.md"}),
Document(page_content="RAG combines retrieval with generation to reduce hallucination in LLMs.", metadata={"source": "rag_guide.md"}),
Document(page_content="Cross-encoders score query-document pairs jointly for precise relevance estimation.", metadata={"source": "reranking.md"}),
Document(page_content="Bi-encoders create independent embeddings for query and documents.", metadata={"source": "encoders.md"}),
Document(page_content="The BEIR benchmark evaluates information retrieval across 18 diverse datasets.", metadata={"source": "beir.md"}),
Document(page_content="Fine-tuning a reranker on domain-specific data improves retrieval quality significantly.", metadata={"source": "fine_tuning.md"}),
]
vectorstore = Chroma.from_documents(sample_docs, embeddings, collection_name="reranker_demo")
# Base retriever — fetches 20 candidates for reranking
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
Reranker 1: CohereRerank
Cohere's hosted reranker is the easiest to integrate and consistently strong on benchmarks:
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
import os
cohere_reranker = CohereRerank(
cohere_api_key=os.environ["COHERE_API_KEY"],
model="rerank-english-v3.0", # or rerank-multilingual-v3.0 for non-English
top_n=5 # Return top 5 after reranking
)
cohere_retriever = ContextualCompressionRetriever(
base_compressor=cohere_reranker,
base_retriever=base_retriever
)
query = "How do transformer models process sequences efficiently?"
results = cohere_retriever.invoke(query)
for i, doc in enumerate(results, 1):
score = doc.metadata.get("relevance_score", "N/A")
print(f"{i}. Score: {score:.4f} | {doc.page_content[:150]}")
CohereRerank attaches a relevance_score to each returned document's metadata, making it easy to add score-based filtering:
def get_cohere_reranked_with_threshold(
query: str,
retriever,
min_score: float = 0.5
) -> list:
"""Rerank and filter by minimum relevance score."""
results = retriever.invoke(query)
filtered = [
doc for doc in results
if doc.metadata.get("relevance_score", 0) >= min_score
]
if not filtered:
# Fall back to top result if none pass threshold
return results[:1] if results else []
return filtered
high_quality_docs = get_cohere_reranked_with_threshold(
"What is the BEIR benchmark?",
cohere_retriever,
min_score=0.3
)
Reranker 2: Cross-Encoder (Local, Sentence Transformers)
For privacy-sensitive applications where you cannot send documents to an external API:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
# Popular cross-encoder models:
# - cross-encoder/ms-marco-MiniLM-L-6-v2 (fast, good quality)
# - cross-encoder/ms-marco-electra-base (slower, better quality)
# - BAAI/bge-reranker-v2-m3 (multilingual, excellent)
# - mixedbread-ai/mxbai-rerank-large-v1 (state-of-art, large)
cross_encoder = HuggingFaceCrossEncoder(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
cross_encoder_reranker = CrossEncoderReranker(
model=cross_encoder,
top_n=5
)
ce_retriever = ContextualCompressionRetriever(
base_compressor=cross_encoder_reranker,
base_retriever=base_retriever
)
query = "cross-encoder vs bi-encoder retrieval"
results = ce_retriever.invoke(query)
for doc in results:
print(f"{doc.page_content[:150]}")
print("---")
For production use where latency matters, use the model on GPU:
import torch
# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
cross_encoder_gpu = HuggingFaceCrossEncoder(
model_name="BAAI/bge-reranker-v2-m3",
model_kwargs={"device": device}
)
Reranker 3: FlashrankRerank (Fastest Local Option)
FlashRank uses quantized models optimized for CPU inference — 10-50x faster than standard cross-encoders:
from langchain_community.document_compressors import FlashrankRerank
flashrank_reranker = FlashrankRerank(
model="ms-marco-MiniLM-L-12-v2", # Small, fast
top_n=5
)
flashrank_retriever = ContextualCompressionRetriever(
base_compressor=flashrank_reranker,
base_retriever=base_retriever
)
import time
query = "What is RAG and how does it reduce hallucinations?"
start = time.time()
results = flashrank_retriever.invoke(query)
elapsed = (time.time() - start) * 1000
print(f"FlashRank reranking: {elapsed:.1f}ms for {len(results)} results")
for doc in results:
print(f" {doc.page_content[:120]}")
FlashrankRerank is the right default for CPU-only environments, edge deployments, or applications where sub-100ms reranking is required.
Reranker 4: RankGPT (LLM-Based Reranking)
Using an LLM to score relevance provides maximum quality but at higher cost:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document
import json
def rankgpt_rerank(query: str, documents: list, llm, top_n: int = 5) -> list:
"""Rerank documents using an LLM as the scoring model."""
# Build a numbered list of document excerpts
doc_list = "\n\n".join([
f"[{i+1}] {doc.page_content[:300]}"
for i, doc in enumerate(documents)
])
rerank_prompt = f"""You are an expert relevance judge.
Query: {query}
Documents:
{doc_list}
Rank these documents by relevance to the query from most to least relevant.
Return ONLY a JSON list of document numbers in order of relevance, e.g.: [3, 1, 5, 2, 4]
Your ranking:"""
response = llm.invoke([{"role": "user", "content": rerank_prompt}])
try:
# Parse the ranked list from the response
text = response.content.strip()
# Find the JSON array in the response
start = text.find("[")
end = text.rfind("]") + 1
if start >= 0 and end > start:
ranked_indices = json.loads(text[start:end])
# Convert 1-indexed to 0-indexed and return top_n
reranked = []
for idx in ranked_indices[:top_n]:
if 1 <= idx <= len(documents):
reranked.append(documents[idx - 1])
return reranked
except (json.JSONDecodeError, IndexError, ValueError):
pass
# Fall back to original order if parsing fails
return documents[:top_n]
reranking_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
base_docs = base_retriever.invoke("How do attention mechanisms work in transformers?")
reranked_docs = rankgpt_rerank(
query="How do attention mechanisms work in transformers?",
documents=base_docs,
llm=reranking_llm,
top_n=5
)
print("RankGPT results:")
for i, doc in enumerate(reranked_docs, 1):
print(f"{i}. {doc.page_content[:150]}")
Reranker 5: BGE Reranker (BAAI)
BGE models from BAAI are state-of-the-art for both English and multilingual reranking:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
# BGE reranker v2-m3 is multilingual and ranks among the best on BEIR
bge_cross_encoder = HuggingFaceCrossEncoder(
model_name="BAAI/bge-reranker-v2-m3",
model_kwargs={"torch_dtype": "auto"}
)
bge_reranker = CrossEncoderReranker(
model=bge_cross_encoder,
top_n=5
)
bge_retriever = ContextualCompressionRetriever(
base_compressor=bge_reranker,
base_retriever=base_retriever
)
# Works well for multilingual content
results_en = bge_retriever.invoke("transformer architecture self-attention")
results_de = bge_retriever.invoke("Transformer-Architektur Selbst-Aufmerksamkeit") # German
Reranker 6: Reciprocal Rank Fusion (No Model Required)
RRF merges results from multiple retrievers without any learned model — useful when you have diverse retrieval signals:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Build a BM25 retriever from the same documents
bm25_retriever = BM25Retriever.from_documents(sample_docs, k=20)
# Ensemble with RRF — no API call, no model inference
ensemble_retriever = EnsembleRetriever(
retrievers=[base_retriever, bm25_retriever],
weights=[0.6, 0.4] # Weight vector search slightly higher
)
query = "BERT pre-training objectives"
ensemble_results = ensemble_retriever.invoke(query)
print(f"Ensemble returned {len(ensemble_results)} results:")
for doc in ensemble_results[:5]:
print(f" {doc.page_content[:150]}")
RRF is zero-latency and free — add it as a first stage before an expensive reranker to improve the candidate pool quality before the reranker sees it.
Reranker 7: mxbai-rerank (MixedBread AI)
MixedBread's reranker is one of the top performers on MTEB benchmarks as of 2026:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
mxbai_encoder = HuggingFaceCrossEncoder(
model_name="mixedbread-ai/mxbai-rerank-large-v1"
)
mxbai_reranker = CrossEncoderReranker(
model=mxbai_encoder,
top_n=5
)
mxbai_retriever = ContextualCompressionRetriever(
base_compressor=mxbai_reranker,
base_retriever=base_retriever
)
results = mxbai_retriever.invoke("fine-tuning reranker on domain data")
for doc in results:
print(doc.page_content[:150])
Reranker 8: Cohere Multilingual
For non-English content, Cohere's multilingual model handles 100+ languages:
from langchain_cohere import CohereRerank
multilingual_reranker = CohereRerank(
cohere_api_key=os.environ["COHERE_API_KEY"],
model="rerank-multilingual-v3.0",
top_n=5
)
multilingual_retriever = ContextualCompressionRetriever(
base_compressor=multilingual_reranker,
base_retriever=base_retriever
)
# Test with French query
fr_results = multilingual_retriever.invoke("Comment fonctionne l'attention dans les transformers?")
for doc in fr_results:
print(doc.page_content[:150])
Reranker 9: Two-Stage Reranking Pipeline
For maximum quality, use a fast reranker for a first pass and a slow reranker for the final selection:
from langchain_community.document_compressors import FlashrankRerank
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
def build_two_stage_retriever(base_retriever, cohere_api_key: str):
"""
Stage 1: Vector search retrieves top 50
Stage 2: FlashRank narrows to top 15 (fast, local)
Stage 3: Cohere narrows to top 5 (slower, most precise)
"""
# Stage 2: Fast local reranking
fast_reranker = FlashrankRerank(model="ms-marco-MiniLM-L-12-v2", top_n=15)
stage2_retriever = ContextualCompressionRetriever(
base_compressor=fast_reranker,
base_retriever=base_retriever
)
# Stage 3: High-quality API reranking
precise_reranker = CohereRerank(
cohere_api_key=cohere_api_key,
model="rerank-english-v3.0",
top_n=5
)
stage3_retriever = ContextualCompressionRetriever(
base_compressor=precise_reranker,
base_retriever=stage2_retriever
)
return stage3_retriever
two_stage = build_two_stage_retriever(
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 50}),
cohere_api_key=os.environ.get("COHERE_API_KEY", "")
)
Reranker 10: Custom Score-Based Filtering
Sometimes you want to combine reranking with hard score thresholds:
from langchain_core.documents import Document
from typing import List, Tuple
import cohere
def rerank_with_threshold(
query: str,
documents: List[Document],
cohere_api_key: str,
top_n: int = 10,
min_relevance_score: float = 0.4
) -> Tuple[List[Document], List[float]]:
"""Rerank documents and filter by minimum relevance score."""
co = cohere.Client(cohere_api_key)
# Format documents for Cohere
doc_texts = [doc.page_content for doc in documents]
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=doc_texts,
top_n=top_n,
return_documents=True
)
filtered_docs = []
filtered_scores = []
for result in response.results:
if result.relevance_score >= min_relevance_score:
original_doc = documents[result.index]
filtered_docs.append(Document(
page_content=result.document.text,
metadata={**original_doc.metadata, "relevance_score": result.relevance_score}
))
filtered_scores.append(result.relevance_score)
return filtered_docs, filtered_scores
# Get base results first
base_docs = base_retriever.invoke("transformer attention mechanism")
# Rerank with score filtering
docs, scores = rerank_with_threshold(
query="transformer attention mechanism",
documents=base_docs,
cohere_api_key=os.environ.get("COHERE_API_KEY", ""),
top_n=10,
min_relevance_score=0.3
)
print(f"Documents above threshold: {len(docs)}")
for doc, score in zip(docs, scores):
print(f"Score {score:.4f}: {doc.page_content[:150]}")
Building a Full RAG Chain with Reranking
Putting everything together in a production RAG chain:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Use the best local reranker as default (swap for Cohere in production)
from langchain_community.document_compressors import FlashrankRerank
reranker = FlashrankRerank(model="ms-marco-MiniLM-L-12-v2", top_n=5)
reranked_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
rag_prompt = ChatPromptTemplate.from_template("""
Use the following context to answer the question accurately.
If you can't find the answer in the context, say so explicitly.
Context:
{context}
Question: {question}
Answer:""")
def format_docs(docs):
return "\n\n".join(
f"[Source: {doc.metadata.get('source', 'unknown')} | "
f"Score: {doc.metadata.get('relevance_score', 'N/A')}]\n{doc.page_content}"
for doc in docs
)
rag_with_reranking = (
{"context": reranked_retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
answer = rag_with_reranking.invoke("How do cross-encoders improve retrieval compared to bi-encoders?")
print(answer)
This chain follows the RAG system tutorial architecture and applies directly to the patterns in Build AI agent with LangChain where retrieval quality directly affects agent decision quality.
Reranker Benchmark Comparison (BEIR)
| Reranker | NDCG@10 (BEIR avg) | Latency (20 docs) | Cost | Deployment |
|---|---|---|---|---|
| No reranking (vector only) | 0.41 | 0ms | None | Local |
| BM25 + RRF | 0.43 | 0ms | None | Local |
| FlashrankRerank (MiniLM) | 0.51 | 30-50ms | None | Local CPU |
| CrossEncoder (MiniLM-L6) | 0.53 | 150-250ms | None | Local CPU |
| BGE-reranker-v2-m3 | 0.57 | 300-500ms | None | Local GPU |
| mxbai-rerank-large-v1 | 0.58 | 400-600ms | None | Local GPU |
| CohereRerank v3.0 | 0.59 | 100-300ms | ~$0.001/query | API |
| RankGPT (gpt-4o-mini) | 0.56 | 800-1500ms | ~$0.003/query | API |
Source: Adapted from the BEIR benchmark paper and community evaluations on MTEB leaderboard, 2024-2025.
Selecting the Right Reranker
Use this decision tree:
- No budget, need it fast: FlashrankRerank on CPU
- Privacy sensitive (can't use external APIs): CrossEncoder or BGE local
- Multilingual content: BGE-reranker-v2-m3 or Cohere multilingual
- Maximum quality, cloud deployment: CohereRerank v3.0
- Already using OpenAI, want zero new dependencies: RankGPT with gpt-4o-mini
- High volume, cost-sensitive: FlashrankRerank (zero marginal cost)
For the semantic search tutorial, FlashrankRerank is the best default. For production RAG systems serving enterprise users, CohereRerank v3.0 offers the best quality-to-latency ratio.
The OpenAI API integration guide covers how to combine reranking costs with the broader API cost management strategy — important when you are paying per query for both retrieval and reranking.
Key Takeaways
Reranking is the highest-leverage improvement you can make to an existing RAG pipeline. A 15-20% improvement in NDCG@10 translates directly into fewer hallucinations and higher user satisfaction — the LLM generates better answers when the context it receives is genuinely relevant.
The two-stage approach (wide retrieval, narrow reranking) is the production pattern: keep your initial retrieval fast and broad, then spend the extra latency budget on a precise reranker. For most applications, FlashrankRerank gives 80% of CohereRerank's quality at zero additional cost.
Frequently Asked Questions
Why use a reranker if my vector search already returns relevant results? Vector search embeddings are optimized for fast approximate nearest neighbor lookup, not precise relevance ranking. A reranker does a deeper pairwise comparison of the query and each candidate document, catching relevance signals that the embedding model missed. On typical enterprise document collections, reranking improves NDCG@3 by 12-25%.
What is the performance cost of adding a reranker? Cohere's hosted reranker adds 100-300ms API latency. A local cross-encoder on CPU adds 200-500ms for 20 documents. FlashrankRerank is the fastest local option — typically under 50ms for 20 documents on CPU. The quality improvement usually justifies the latency cost.
How many documents should I retrieve before reranking? Retrieve 20-50 documents with vector search, then rerank to your final top-k (typically 4-8). The wider initial retrieval increases recall — you're more likely to catch relevant documents — and the reranker filters down to the most precisely relevant subset.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.