AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

relevance reranking pipeline with scoring — LangChain reranker Cohere cross-encoder

10 LangChain RAG Rerankers: Cohere, Cross-Encoder and More

⚡ Quick Answer

Improve RAG relevance with LangChain rerankers — CohereRerank, CrossEncoderReranker, FlashrankRerank, RankGPT, and more, with BEIR benchmark results and code.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #RAG #reranker #Cohere #cross-encoder

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Your RAG pipeline retrieves the top 5 documents, feeds them to GPT-4o, and produces an answer. It works well on simple queries. Then a user asks something nuanced, your embedding-based retrieval returns five thematically related but not quite right documents, and the LLM produces a confident but wrong answer.

The fix is a reranker. You retrieve a wider set of candidates first (say, 20 documents), then run a more expensive but more accurate scoring model to pick the best 5. This two-stage approach — broad retrieval followed by precise reranking — is how every high-quality production RAG system is built.

This guide covers ten reranking approaches available in LangChain, with working code, benchmark comparisons, and guidance on when to use each one.

Why Reranking Works

Vector search computes a single embedding for the query, a single embedding for each document, and measures cosine similarity. This is fast but lossy — the embedding captures the general topic but loses fine-grained relevance signals.

A cross-encoder reranker processes the query and each candidate document together as a single sequence, letting the model attend to interactions between query terms and document terms. This is the same mechanism that made BERT-based retrieval systems dramatically better than TF-IDF, applied to the reranking stage.

A 2024 analysis of the BEIR benchmark dataset by the MS MARCO team found that reranking vector search results improved NDCG@10 by an average of 18% across 18 retrieval tasks. For enterprise document search specifically, the improvement is often higher because domain-specific terminology appears in exact form more than embeddings capture.

Reranker Setup

pip install langchain-community langchain-cohere sentence-transformers flashrank
pip install cohere  # for CohereRerank

We'll use this retriever as the base for all examples:

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Sample knowledge base for demonstration
sample_docs = [
    Document(page_content="The transformer architecture uses self-attention to process sequences in parallel.", metadata={"source": "ml_basics.md"}),
    Document(page_content="BERT is a transformer model pre-trained on masked language modeling and next sentence prediction.", metadata={"source": "bert_paper.md"}),
    Document(page_content="GPT models use a decoder-only transformer architecture trained autoregressively.", metadata={"source": "gpt_overview.md"}),
    Document(page_content="Attention mechanisms allow models to focus on different parts of the input.", metadata={"source": "attention.md"}),
    Document(page_content="Vector databases store embeddings and support approximate nearest neighbor search.", metadata={"source": "vector_db.md"}),
    Document(page_content="RAG combines retrieval with generation to reduce hallucination in LLMs.", metadata={"source": "rag_guide.md"}),
    Document(page_content="Cross-encoders score query-document pairs jointly for precise relevance estimation.", metadata={"source": "reranking.md"}),
    Document(page_content="Bi-encoders create independent embeddings for query and documents.", metadata={"source": "encoders.md"}),
    Document(page_content="The BEIR benchmark evaluates information retrieval across 18 diverse datasets.", metadata={"source": "beir.md"}),
    Document(page_content="Fine-tuning a reranker on domain-specific data improves retrieval quality significantly.", metadata={"source": "fine_tuning.md"}),
]

vectorstore = Chroma.from_documents(sample_docs, embeddings, collection_name="reranker_demo")

# Base retriever — fetches 20 candidates for reranking
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

Reranker 1: CohereRerank

Cohere's hosted reranker is the easiest to integrate and consistently strong on benchmarks:

from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
import os

cohere_reranker = CohereRerank(
    cohere_api_key=os.environ["COHERE_API_KEY"],
    model="rerank-english-v3.0",  # or rerank-multilingual-v3.0 for non-English
    top_n=5  # Return top 5 after reranking
)

cohere_retriever = ContextualCompressionRetriever(
    base_compressor=cohere_reranker,
    base_retriever=base_retriever
)

query = "How do transformer models process sequences efficiently?"
results = cohere_retriever.invoke(query)

for i, doc in enumerate(results, 1):
    score = doc.metadata.get("relevance_score", "N/A")
    print(f"{i}. Score: {score:.4f} | {doc.page_content[:150]}")

CohereRerank attaches a relevance_score to each returned document's metadata, making it easy to add score-based filtering:

def get_cohere_reranked_with_threshold(
    query: str,
    retriever,
    min_score: float = 0.5
) -> list:
    """Rerank and filter by minimum relevance score."""
    results = retriever.invoke(query)
    filtered = [
        doc for doc in results
        if doc.metadata.get("relevance_score", 0) >= min_score
    ]
    
    if not filtered:
        # Fall back to top result if none pass threshold
        return results[:1] if results else []
    
    return filtered

high_quality_docs = get_cohere_reranked_with_threshold(
    "What is the BEIR benchmark?",
    cohere_retriever,
    min_score=0.3
)

Reranker 2: Cross-Encoder (Local, Sentence Transformers)

For privacy-sensitive applications where you cannot send documents to an external API:

from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

# Popular cross-encoder models:
# - cross-encoder/ms-marco-MiniLM-L-6-v2  (fast, good quality)
# - cross-encoder/ms-marco-electra-base    (slower, better quality)
# - BAAI/bge-reranker-v2-m3               (multilingual, excellent)
# - mixedbread-ai/mxbai-rerank-large-v1   (state-of-art, large)

cross_encoder = HuggingFaceCrossEncoder(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)

cross_encoder_reranker = CrossEncoderReranker(
    model=cross_encoder,
    top_n=5
)

ce_retriever = ContextualCompressionRetriever(
    base_compressor=cross_encoder_reranker,
    base_retriever=base_retriever
)

query = "cross-encoder vs bi-encoder retrieval"
results = ce_retriever.invoke(query)

for doc in results:
    print(f"{doc.page_content[:150]}")
    print("---")

For production use where latency matters, use the model on GPU:

import torch

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

cross_encoder_gpu = HuggingFaceCrossEncoder(
    model_name="BAAI/bge-reranker-v2-m3",
    model_kwargs={"device": device}
)

Reranker 3: FlashrankRerank (Fastest Local Option)

FlashRank uses quantized models optimized for CPU inference — 10-50x faster than standard cross-encoders:

from langchain_community.document_compressors import FlashrankRerank

flashrank_reranker = FlashrankRerank(
    model="ms-marco-MiniLM-L-12-v2",  # Small, fast
    top_n=5
)

flashrank_retriever = ContextualCompressionRetriever(
    base_compressor=flashrank_reranker,
    base_retriever=base_retriever
)

import time

query = "What is RAG and how does it reduce hallucinations?"

start = time.time()
results = flashrank_retriever.invoke(query)
elapsed = (time.time() - start) * 1000

print(f"FlashRank reranking: {elapsed:.1f}ms for {len(results)} results")
for doc in results:
    print(f"  {doc.page_content[:120]}")

FlashrankRerank is the right default for CPU-only environments, edge deployments, or applications where sub-100ms reranking is required.

Reranker 4: RankGPT (LLM-Based Reranking)

Using an LLM to score relevance provides maximum quality but at higher cost:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document
import json

def rankgpt_rerank(query: str, documents: list, llm, top_n: int = 5) -> list:
    """Rerank documents using an LLM as the scoring model."""
    
    # Build a numbered list of document excerpts
    doc_list = "\n\n".join([
        f"[{i+1}] {doc.page_content[:300]}"
        for i, doc in enumerate(documents)
    ])
    
    rerank_prompt = f"""You are an expert relevance judge.

Query: {query}

Documents:
{doc_list}

Rank these documents by relevance to the query from most to least relevant.
Return ONLY a JSON list of document numbers in order of relevance, e.g.: [3, 1, 5, 2, 4]
Your ranking:"""
    
    response = llm.invoke([{"role": "user", "content": rerank_prompt}])
    
    try:
        # Parse the ranked list from the response
        text = response.content.strip()
        # Find the JSON array in the response
        start = text.find("[")
        end = text.rfind("]") + 1
        if start >= 0 and end > start:
            ranked_indices = json.loads(text[start:end])
            # Convert 1-indexed to 0-indexed and return top_n
            reranked = []
            for idx in ranked_indices[:top_n]:
                if 1 <= idx <= len(documents):
                    reranked.append(documents[idx - 1])
            return reranked
    except (json.JSONDecodeError, IndexError, ValueError):
        pass
    
    # Fall back to original order if parsing fails
    return documents[:top_n]

reranking_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

base_docs = base_retriever.invoke("How do attention mechanisms work in transformers?")
reranked_docs = rankgpt_rerank(
    query="How do attention mechanisms work in transformers?",
    documents=base_docs,
    llm=reranking_llm,
    top_n=5
)

print("RankGPT results:")
for i, doc in enumerate(reranked_docs, 1):
    print(f"{i}. {doc.page_content[:150]}")

Reranker 5: BGE Reranker (BAAI)

BGE models from BAAI are state-of-the-art for both English and multilingual reranking:

from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

# BGE reranker v2-m3 is multilingual and ranks among the best on BEIR
bge_cross_encoder = HuggingFaceCrossEncoder(
    model_name="BAAI/bge-reranker-v2-m3",
    model_kwargs={"torch_dtype": "auto"}
)

bge_reranker = CrossEncoderReranker(
    model=bge_cross_encoder,
    top_n=5
)

bge_retriever = ContextualCompressionRetriever(
    base_compressor=bge_reranker,
    base_retriever=base_retriever
)

# Works well for multilingual content
results_en = bge_retriever.invoke("transformer architecture self-attention")
results_de = bge_retriever.invoke("Transformer-Architektur Selbst-Aufmerksamkeit")  # German

Reranker 6: Reciprocal Rank Fusion (No Model Required)

RRF merges results from multiple retrievers without any learned model — useful when you have diverse retrieval signals:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Build a BM25 retriever from the same documents
bm25_retriever = BM25Retriever.from_documents(sample_docs, k=20)

# Ensemble with RRF — no API call, no model inference
ensemble_retriever = EnsembleRetriever(
    retrievers=[base_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Weight vector search slightly higher
)

query = "BERT pre-training objectives"
ensemble_results = ensemble_retriever.invoke(query)

print(f"Ensemble returned {len(ensemble_results)} results:")
for doc in ensemble_results[:5]:
    print(f"  {doc.page_content[:150]}")

RRF is zero-latency and free — add it as a first stage before an expensive reranker to improve the candidate pool quality before the reranker sees it.

Reranker 7: mxbai-rerank (MixedBread AI)

MixedBread's reranker is one of the top performers on MTEB benchmarks as of 2026:

from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

mxbai_encoder = HuggingFaceCrossEncoder(
    model_name="mixedbread-ai/mxbai-rerank-large-v1"
)

mxbai_reranker = CrossEncoderReranker(
    model=mxbai_encoder,
    top_n=5
)

mxbai_retriever = ContextualCompressionRetriever(
    base_compressor=mxbai_reranker,
    base_retriever=base_retriever
)

results = mxbai_retriever.invoke("fine-tuning reranker on domain data")
for doc in results:
    print(doc.page_content[:150])

Reranker 8: Cohere Multilingual

For non-English content, Cohere's multilingual model handles 100+ languages:

from langchain_cohere import CohereRerank

multilingual_reranker = CohereRerank(
    cohere_api_key=os.environ["COHERE_API_KEY"],
    model="rerank-multilingual-v3.0",
    top_n=5
)

multilingual_retriever = ContextualCompressionRetriever(
    base_compressor=multilingual_reranker,
    base_retriever=base_retriever
)

# Test with French query
fr_results = multilingual_retriever.invoke("Comment fonctionne l'attention dans les transformers?")
for doc in fr_results:
    print(doc.page_content[:150])

Reranker 9: Two-Stage Reranking Pipeline

For maximum quality, use a fast reranker for a first pass and a slow reranker for the final selection:

from langchain_community.document_compressors import FlashrankRerank
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever

def build_two_stage_retriever(base_retriever, cohere_api_key: str):
    """
    Stage 1: Vector search retrieves top 50
    Stage 2: FlashRank narrows to top 15 (fast, local)
    Stage 3: Cohere narrows to top 5 (slower, most precise)
    """
    # Stage 2: Fast local reranking
    fast_reranker = FlashrankRerank(model="ms-marco-MiniLM-L-12-v2", top_n=15)
    stage2_retriever = ContextualCompressionRetriever(
        base_compressor=fast_reranker,
        base_retriever=base_retriever
    )
    
    # Stage 3: High-quality API reranking
    precise_reranker = CohereRerank(
        cohere_api_key=cohere_api_key,
        model="rerank-english-v3.0",
        top_n=5
    )
    stage3_retriever = ContextualCompressionRetriever(
        base_compressor=precise_reranker,
        base_retriever=stage2_retriever
    )
    
    return stage3_retriever

two_stage = build_two_stage_retriever(
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 50}),
    cohere_api_key=os.environ.get("COHERE_API_KEY", "")
)

Reranker 10: Custom Score-Based Filtering

Sometimes you want to combine reranking with hard score thresholds:

from langchain_core.documents import Document
from typing import List, Tuple
import cohere

def rerank_with_threshold(
    query: str,
    documents: List[Document],
    cohere_api_key: str,
    top_n: int = 10,
    min_relevance_score: float = 0.4
) -> Tuple[List[Document], List[float]]:
    """Rerank documents and filter by minimum relevance score."""
    
    co = cohere.Client(cohere_api_key)
    
    # Format documents for Cohere
    doc_texts = [doc.page_content for doc in documents]
    
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=doc_texts,
        top_n=top_n,
        return_documents=True
    )
    
    filtered_docs = []
    filtered_scores = []
    
    for result in response.results:
        if result.relevance_score >= min_relevance_score:
            original_doc = documents[result.index]
            filtered_docs.append(Document(
                page_content=result.document.text,
                metadata={**original_doc.metadata, "relevance_score": result.relevance_score}
            ))
            filtered_scores.append(result.relevance_score)
    
    return filtered_docs, filtered_scores

# Get base results first
base_docs = base_retriever.invoke("transformer attention mechanism")

# Rerank with score filtering
docs, scores = rerank_with_threshold(
    query="transformer attention mechanism",
    documents=base_docs,
    cohere_api_key=os.environ.get("COHERE_API_KEY", ""),
    top_n=10,
    min_relevance_score=0.3
)

print(f"Documents above threshold: {len(docs)}")
for doc, score in zip(docs, scores):
    print(f"Score {score:.4f}: {doc.page_content[:150]}")

Building a Full RAG Chain with Reranking

Putting everything together in a production RAG chain:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Use the best local reranker as default (swap for Cohere in production)
from langchain_community.document_compressors import FlashrankRerank

reranker = FlashrankRerank(model="ms-marco-MiniLM-L-12-v2", top_n=5)
reranked_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)

rag_prompt = ChatPromptTemplate.from_template("""
Use the following context to answer the question accurately.
If you can't find the answer in the context, say so explicitly.

Context:
{context}

Question: {question}

Answer:""")

def format_docs(docs):
    return "\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')} | "
        f"Score: {doc.metadata.get('relevance_score', 'N/A')}]\n{doc.page_content}"
        for doc in docs
    )

rag_with_reranking = (
    {"context": reranked_retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

answer = rag_with_reranking.invoke("How do cross-encoders improve retrieval compared to bi-encoders?")
print(answer)

This chain follows the RAG system tutorial architecture and applies directly to the patterns in Build AI agent with LangChain where retrieval quality directly affects agent decision quality.

Reranker Benchmark Comparison (BEIR)

Reranker	NDCG@10 (BEIR avg)	Latency (20 docs)	Cost	Deployment
No reranking (vector only)	0.41	0ms	None	Local
BM25 + RRF	0.43	0ms	None	Local
FlashrankRerank (MiniLM)	0.51	30-50ms	None	Local CPU
CrossEncoder (MiniLM-L6)	0.53	150-250ms	None	Local CPU
BGE-reranker-v2-m3	0.57	300-500ms	None	Local GPU
mxbai-rerank-large-v1	0.58	400-600ms	None	Local GPU
CohereRerank v3.0	0.59	100-300ms	~$0.001/query	API
RankGPT (gpt-4o-mini)	0.56	800-1500ms	~$0.003/query	API

Source: Adapted from the BEIR benchmark paper and community evaluations on MTEB leaderboard, 2024-2025.

Selecting the Right Reranker

Use this decision tree:

No budget, need it fast: FlashrankRerank on CPU
Privacy sensitive (can't use external APIs): CrossEncoder or BGE local
Multilingual content: BGE-reranker-v2-m3 or Cohere multilingual
Maximum quality, cloud deployment: CohereRerank v3.0
Already using OpenAI, want zero new dependencies: RankGPT with gpt-4o-mini
High volume, cost-sensitive: FlashrankRerank (zero marginal cost)

For the semantic search tutorial, FlashrankRerank is the best default. For production RAG systems serving enterprise users, CohereRerank v3.0 offers the best quality-to-latency ratio.

The OpenAI API integration guide covers how to combine reranking costs with the broader API cost management strategy — important when you are paying per query for both retrieval and reranking.

Key Takeaways

Reranking is the highest-leverage improvement you can make to an existing RAG pipeline. A 15-20% improvement in NDCG@10 translates directly into fewer hallucinations and higher user satisfaction — the LLM generates better answers when the context it receives is genuinely relevant.

The two-stage approach (wide retrieval, narrow reranking) is the production pattern: keep your initial retrieval fast and broad, then spend the extra latency budget on a precise reranker. For most applications, FlashrankRerank gives 80% of CohereRerank's quality at zero additional cost.

Frequently Asked Questions

Why use a reranker if my vector search already returns relevant results? Vector search embeddings are optimized for fast approximate nearest neighbor lookup, not precise relevance ranking. A reranker does a deeper pairwise comparison of the query and each candidate document, catching relevance signals that the embedding model missed. On typical enterprise document collections, reranking improves NDCG@3 by 12-25%.

What is the performance cost of adding a reranker? Cohere's hosted reranker adds 100-300ms API latency. A local cross-encoder on CPU adds 200-500ms for 20 documents. FlashrankRerank is the fastest local option — typically under 50ms for 20 documents on CPU. The quality improvement usually justifies the latency cost.

How many documents should I retrieve before reranking? Retrieve 20-50 documents with vector search, then rerank to your final top-k (typically 4-8). The wider initial retrieval increases recall — you're more likely to catch relevant documents — and the reranker filters down to the most precisely relevant subset.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

Vector search embeddings are optimized for fast approximate nearest neighbor lookup, not precise relevance ranking. A reranker does a deeper pairwise comparison of the query and each candidate document, catching relevance signals that the embedding model missed. On typical enterprise document collections, reranking improves NDCG@3 by 12-25%.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizRAG Systems

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

10 LangChain RAG Rerankers: Cohere, Cross-Encoder and More

⚡ Quick Answer

Improve RAG relevance with LangChain rerankers — CohereRerank, CrossEncoderReranker, FlashrankRerank, RankGPT, and more, with BEIR benchmark results and code.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #RAG #reranker #Cohere #cross-encoder

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

This guide covers ten reranking approaches available in LangChain, with working code, benchmark comparisons, and guidance on when to use each one.

Why Reranking Works

Reranker Setup

pip install langchain-community langchain-cohere sentence-transformers flashrank
pip install cohere  # for CohereRerank

We'll use this retriever as the base for all examples:

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Sample knowledge base for demonstration
sample_docs = [
    Document(page_content="The transformer architecture uses self-attention to process sequences in parallel.", metadata={"source": "ml_basics.md"}),
    Document(page_content="BERT is a transformer model pre-trained on masked language modeling and next sentence prediction.", metadata={"source": "bert_paper.md"}),
    Document(page_content="GPT models use a decoder-only transformer architecture trained autoregressively.", metadata={"source": "gpt_overview.md"}),
    Document(page_content="Attention mechanisms allow models to focus on different parts of the input.", metadata={"source": "attention.md"}),
    Document(page_content="Vector databases store embeddings and support approximate nearest neighbor search.", metadata={"source": "vector_db.md"}),
    Document(page_content="RAG combines retrieval with generation to reduce hallucination in LLMs.", metadata={"source": "rag_guide.md"}),
    Document(page_content="Cross-encoders score query-document pairs jointly for precise relevance estimation.", metadata={"source": "reranking.md"}),
    Document(page_content="Bi-encoders create independent embeddings for query and documents.", metadata={"source": "encoders.md"}),
    Document(page_content="The BEIR benchmark evaluates information retrieval across 18 diverse datasets.", metadata={"source": "beir.md"}),
    Document(page_content="Fine-tuning a reranker on domain-specific data improves retrieval quality significantly.", metadata={"source": "fine_tuning.md"}),
]

vectorstore = Chroma.from_documents(sample_docs, embeddings, collection_name="reranker_demo")

# Base retriever — fetches 20 candidates for reranking
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

Reranker 1: CohereRerank

Cohere's hosted reranker is the easiest to integrate and consistently strong on benchmarks:

from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
import os

cohere_reranker = CohereRerank(
    cohere_api_key=os.environ["COHERE_API_KEY"],
    model="rerank-english-v3.0",  # or rerank-multilingual-v3.0 for non-English
    top_n=5  # Return top 5 after reranking
)

cohere_retriever = ContextualCompressionRetriever(
    base_compressor=cohere_reranker,
    base_retriever=base_retriever
)

query = "How do transformer models process sequences efficiently?"
results = cohere_retriever.invoke(query)

for i, doc in enumerate(results, 1):
    score = doc.metadata.get("relevance_score", "N/A")
    print(f"{i}. Score: {score:.4f} | {doc.page_content[:150]}")

CohereRerank attaches a relevance_score to each returned document's metadata, making it easy to add score-based filtering:

def get_cohere_reranked_with_threshold(
    query: str,
    retriever,
    min_score: float = 0.5
) -> list:
    """Rerank and filter by minimum relevance score."""
    results = retriever.invoke(query)
    filtered = [
        doc for doc in results
        if doc.metadata.get("relevance_score", 0) >= min_score
    ]
    
    if not filtered:
        # Fall back to top result if none pass threshold
        return results[:1] if results else []
    
    return filtered

high_quality_docs = get_cohere_reranked_with_threshold(
    "What is the BEIR benchmark?",
    cohere_retriever,
    min_score=0.3
)

Reranker 2: Cross-Encoder (Local, Sentence Transformers)

For privacy-sensitive applications where you cannot send documents to an external API:

from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

# Popular cross-encoder models:
# - cross-encoder/ms-marco-MiniLM-L-6-v2  (fast, good quality)
# - cross-encoder/ms-marco-electra-base    (slower, better quality)
# - BAAI/bge-reranker-v2-m3               (multilingual, excellent)
# - mixedbread-ai/mxbai-rerank-large-v1   (state-of-art, large)

cross_encoder = HuggingFaceCrossEncoder(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)

cross_encoder_reranker = CrossEncoderReranker(
    model=cross_encoder,
    top_n=5
)

ce_retriever = ContextualCompressionRetriever(
    base_compressor=cross_encoder_reranker,
    base_retriever=base_retriever
)

query = "cross-encoder vs bi-encoder retrieval"
results = ce_retriever.invoke(query)

for doc in results:
    print(f"{doc.page_content[:150]}")
    print("---")

For production use where latency matters, use the model on GPU:

import torch

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

cross_encoder_gpu = HuggingFaceCrossEncoder(
    model_name="BAAI/bge-reranker-v2-m3",
    model_kwargs={"device": device}
)

Reranker 3: FlashrankRerank (Fastest Local Option)

FlashRank uses quantized models optimized for CPU inference — 10-50x faster than standard cross-encoders:

from langchain_community.document_compressors import FlashrankRerank

flashrank_reranker = FlashrankRerank(
    model="ms-marco-MiniLM-L-12-v2",  # Small, fast
    top_n=5
)

flashrank_retriever = ContextualCompressionRetriever(
    base_compressor=flashrank_reranker,
    base_retriever=base_retriever
)

import time

query = "What is RAG and how does it reduce hallucinations?"

start = time.time()
results = flashrank_retriever.invoke(query)
elapsed = (time.time() - start) * 1000

print(f"FlashRank reranking: {elapsed:.1f}ms for {len(results)} results")
for doc in results:
    print(f"  {doc.page_content[:120]}")

FlashrankRerank is the right default for CPU-only environments, edge deployments, or applications where sub-100ms reranking is required.

Reranker 4: RankGPT (LLM-Based Reranking)

Using an LLM to score relevance provides maximum quality but at higher cost:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.documents import Document
import json

def rankgpt_rerank(query: str, documents: list, llm, top_n: int = 5) -> list:
    """Rerank documents using an LLM as the scoring model."""
    
    # Build a numbered list of document excerpts
    doc_list = "\n\n".join([
        f"[{i+1}] {doc.page_content[:300]}"
        for i, doc in enumerate(documents)
    ])
    
    rerank_prompt = f"""You are an expert relevance judge.

Query: {query}

Documents:
{doc_list}

Rank these documents by relevance to the query from most to least relevant.
Return ONLY a JSON list of document numbers in order of relevance, e.g.: [3, 1, 5, 2, 4]
Your ranking:"""
    
    response = llm.invoke([{"role": "user", "content": rerank_prompt}])
    
    try:
        # Parse the ranked list from the response
        text = response.content.strip()
        # Find the JSON array in the response
        start = text.find("[")
        end = text.rfind("]") + 1
        if start >= 0 and end > start:
            ranked_indices = json.loads(text[start:end])
            # Convert 1-indexed to 0-indexed and return top_n
            reranked = []
            for idx in ranked_indices[:top_n]:
                if 1 <= idx <= len(documents):
                    reranked.append(documents[idx - 1])
            return reranked
    except (json.JSONDecodeError, IndexError, ValueError):
        pass
    
    # Fall back to original order if parsing fails
    return documents[:top_n]

reranking_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

base_docs = base_retriever.invoke("How do attention mechanisms work in transformers?")
reranked_docs = rankgpt_rerank(
    query="How do attention mechanisms work in transformers?",
    documents=base_docs,
    llm=reranking_llm,
    top_n=5
)

print("RankGPT results:")
for i, doc in enumerate(reranked_docs, 1):
    print(f"{i}. {doc.page_content[:150]}")

Reranker 5: BGE Reranker (BAAI)

BGE models from BAAI are state-of-the-art for both English and multilingual reranking:

from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

# BGE reranker v2-m3 is multilingual and ranks among the best on BEIR
bge_cross_encoder = HuggingFaceCrossEncoder(
    model_name="BAAI/bge-reranker-v2-m3",
    model_kwargs={"torch_dtype": "auto"}
)

bge_reranker = CrossEncoderReranker(
    model=bge_cross_encoder,
    top_n=5
)

bge_retriever = ContextualCompressionRetriever(
    base_compressor=bge_reranker,
    base_retriever=base_retriever
)

# Works well for multilingual content
results_en = bge_retriever.invoke("transformer architecture self-attention")
results_de = bge_retriever.invoke("Transformer-Architektur Selbst-Aufmerksamkeit")  # German

Reranker 6: Reciprocal Rank Fusion (No Model Required)

RRF merges results from multiple retrievers without any learned model — useful when you have diverse retrieval signals:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Build a BM25 retriever from the same documents
bm25_retriever = BM25Retriever.from_documents(sample_docs, k=20)

# Ensemble with RRF — no API call, no model inference
ensemble_retriever = EnsembleRetriever(
    retrievers=[base_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Weight vector search slightly higher
)

query = "BERT pre-training objectives"
ensemble_results = ensemble_retriever.invoke(query)

print(f"Ensemble returned {len(ensemble_results)} results:")
for doc in ensemble_results[:5]:
    print(f"  {doc.page_content[:150]}")

RRF is zero-latency and free — add it as a first stage before an expensive reranker to improve the candidate pool quality before the reranker sees it.

Reranker 7: mxbai-rerank (MixedBread AI)

MixedBread's reranker is one of the top performers on MTEB benchmarks as of 2026:

from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

mxbai_encoder = HuggingFaceCrossEncoder(
    model_name="mixedbread-ai/mxbai-rerank-large-v1"
)

mxbai_reranker = CrossEncoderReranker(
    model=mxbai_encoder,
    top_n=5
)

mxbai_retriever = ContextualCompressionRetriever(
    base_compressor=mxbai_reranker,
    base_retriever=base_retriever
)

results = mxbai_retriever.invoke("fine-tuning reranker on domain data")
for doc in results:
    print(doc.page_content[:150])

Reranker 8: Cohere Multilingual

For non-English content, Cohere's multilingual model handles 100+ languages:

from langchain_cohere import CohereRerank

multilingual_reranker = CohereRerank(
    cohere_api_key=os.environ["COHERE_API_KEY"],
    model="rerank-multilingual-v3.0",
    top_n=5
)

multilingual_retriever = ContextualCompressionRetriever(
    base_compressor=multilingual_reranker,
    base_retriever=base_retriever
)

# Test with French query
fr_results = multilingual_retriever.invoke("Comment fonctionne l'attention dans les transformers?")
for doc in fr_results:
    print(doc.page_content[:150])

Reranker 9: Two-Stage Reranking Pipeline

For maximum quality, use a fast reranker for a first pass and a slow reranker for the final selection:

from langchain_community.document_compressors import FlashrankRerank
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever

def build_two_stage_retriever(base_retriever, cohere_api_key: str):
    """
    Stage 1: Vector search retrieves top 50
    Stage 2: FlashRank narrows to top 15 (fast, local)
    Stage 3: Cohere narrows to top 5 (slower, most precise)
    """
    # Stage 2: Fast local reranking
    fast_reranker = FlashrankRerank(model="ms-marco-MiniLM-L-12-v2", top_n=15)
    stage2_retriever = ContextualCompressionRetriever(
        base_compressor=fast_reranker,
        base_retriever=base_retriever
    )
    
    # Stage 3: High-quality API reranking
    precise_reranker = CohereRerank(
        cohere_api_key=cohere_api_key,
        model="rerank-english-v3.0",
        top_n=5
    )
    stage3_retriever = ContextualCompressionRetriever(
        base_compressor=precise_reranker,
        base_retriever=stage2_retriever
    )
    
    return stage3_retriever

two_stage = build_two_stage_retriever(
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 50}),
    cohere_api_key=os.environ.get("COHERE_API_KEY", "")
)

Reranker 10: Custom Score-Based Filtering

Sometimes you want to combine reranking with hard score thresholds:

from langchain_core.documents import Document
from typing import List, Tuple
import cohere

def rerank_with_threshold(
    query: str,
    documents: List[Document],
    cohere_api_key: str,
    top_n: int = 10,
    min_relevance_score: float = 0.4
) -> Tuple[List[Document], List[float]]:
    """Rerank documents and filter by minimum relevance score."""
    
    co = cohere.Client(cohere_api_key)
    
    # Format documents for Cohere
    doc_texts = [doc.page_content for doc in documents]
    
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=doc_texts,
        top_n=top_n,
        return_documents=True
    )
    
    filtered_docs = []
    filtered_scores = []
    
    for result in response.results:
        if result.relevance_score >= min_relevance_score:
            original_doc = documents[result.index]
            filtered_docs.append(Document(
                page_content=result.document.text,
                metadata={**original_doc.metadata, "relevance_score": result.relevance_score}
            ))
            filtered_scores.append(result.relevance_score)
    
    return filtered_docs, filtered_scores

# Get base results first
base_docs = base_retriever.invoke("transformer attention mechanism")

# Rerank with score filtering
docs, scores = rerank_with_threshold(
    query="transformer attention mechanism",
    documents=base_docs,
    cohere_api_key=os.environ.get("COHERE_API_KEY", ""),
    top_n=10,
    min_relevance_score=0.3
)

print(f"Documents above threshold: {len(docs)}")
for doc, score in zip(docs, scores):
    print(f"Score {score:.4f}: {doc.page_content[:150]}")

Building a Full RAG Chain with Reranking

Putting everything together in a production RAG chain:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Use the best local reranker as default (swap for Cohere in production)
from langchain_community.document_compressors import FlashrankRerank

reranker = FlashrankRerank(model="ms-marco-MiniLM-L-12-v2", top_n=5)
reranked_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)

rag_prompt = ChatPromptTemplate.from_template("""
Use the following context to answer the question accurately.
If you can't find the answer in the context, say so explicitly.

Context:
{context}

Question: {question}

Answer:""")

def format_docs(docs):
    return "\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')} | "
        f"Score: {doc.metadata.get('relevance_score', 'N/A')}]\n{doc.page_content}"
        for doc in docs
    )

rag_with_reranking = (
    {"context": reranked_retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

answer = rag_with_reranking.invoke("How do cross-encoders improve retrieval compared to bi-encoders?")
print(answer)

This chain follows the RAG system tutorial architecture and applies directly to the patterns in Build AI agent with LangChain where retrieval quality directly affects agent decision quality.

Reranker Benchmark Comparison (BEIR)

Reranker	NDCG@10 (BEIR avg)	Latency (20 docs)	Cost	Deployment
No reranking (vector only)	0.41	0ms	None	Local
BM25 + RRF	0.43	0ms	None	Local
FlashrankRerank (MiniLM)	0.51	30-50ms	None	Local CPU
CrossEncoder (MiniLM-L6)	0.53	150-250ms	None	Local CPU
BGE-reranker-v2-m3	0.57	300-500ms	None	Local GPU
mxbai-rerank-large-v1	0.58	400-600ms	None	Local GPU
CohereRerank v3.0	0.59	100-300ms	~$0.001/query	API
RankGPT (gpt-4o-mini)	0.56	800-1500ms	~$0.003/query	API

Source: Adapted from the BEIR benchmark paper and community evaluations on MTEB leaderboard, 2024-2025.

Selecting the Right Reranker

Use this decision tree:

No budget, need it fast: FlashrankRerank on CPU
Privacy sensitive (can't use external APIs): CrossEncoder or BGE local
Multilingual content: BGE-reranker-v2-m3 or Cohere multilingual
Maximum quality, cloud deployment: CohereRerank v3.0
Already using OpenAI, want zero new dependencies: RankGPT with gpt-4o-mini
High volume, cost-sensitive: FlashrankRerank (zero marginal cost)

For the semantic search tutorial, FlashrankRerank is the best default. For production RAG systems serving enterprise users, CohereRerank v3.0 offers the best quality-to-latency ratio.

The OpenAI API integration guide covers how to combine reranking costs with the broader API cost management strategy — important when you are paying per query for both retrieval and reranking.

Key Takeaways

Frequently Asked Questions

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizRAG Systems

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

10 LangChain RAG Rerankers: Cohere, Cross-Encoder and More

Why Reranking Works

Reranker Setup

Reranker 1: CohereRerank

Reranker 2: Cross-Encoder (Local, Sentence Transformers)

Reranker 3: FlashrankRerank (Fastest Local Option)

Reranker 4: RankGPT (LLM-Based Reranking)

Reranker 5: BGE Reranker (BAAI)

Reranker 6: Reciprocal Rank Fusion (No Model Required)

Reranker 7: mxbai-rerank (MixedBread AI)

Reranker 8: Cohere Multilingual

Reranker 9: Two-Stage Reranking Pipeline

Reranker 10: Custom Score-Based Filtering

Building a Full RAG Chain with Reranking

Reranker Benchmark Comparison (BEIR)

Selecting the Right Reranker

Key Takeaways

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

10 LangChain RAG Rerankers: Cohere, Cross-Encoder and More

Why Reranking Works

Reranker Setup

Reranker 1: CohereRerank

Reranker 2: Cross-Encoder (Local, Sentence Transformers)

Reranker 3: FlashrankRerank (Fastest Local Option)

Reranker 4: RankGPT (LLM-Based Reranking)

Reranker 5: BGE Reranker (BAAI)

Reranker 6: Reciprocal Rank Fusion (No Model Required)

Reranker 7: mxbai-rerank (MixedBread AI)

Reranker 8: Cohere Multilingual

Reranker 9: Two-Stage Reranking Pipeline

Reranker 10: Custom Score-Based Filtering

Building a Full RAG Chain with Reranking

Reranker Benchmark Comparison (BEIR)

Selecting the Right Reranker

Key Takeaways

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily