AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

document being compressed and summarized — LangChain document compressors context reduction

10 LangChain Document Compressors to Reduce Context Length

⚡ Quick Answer

Learn 10 LangChain document compressors that slash context length, cut LLM costs, and keep RAG pipelines fast and accurate in production.

AiTechWorlds Team May 31, 2026 14 min read

#LangChain #document compressors #context length #RAG #embeddings

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Context windows are getting bigger, but your LLM bill isn't getting smaller. Stuffing 20 raw chunks from a vector store into every prompt is wasteful — most of that text doesn't help answer the query. LangChain's document compressor layer sits between retrieval and generation, cutting the noise before it reaches the model.

This guide covers 10 practical compressors, shows you real Python code for each, and gives you a cost-savings table so you can pick the right tool for your budget.

If you haven't set up retrieval yet, the RAG system tutorial and Vector database guide are good starting points.

Why Context Length Matters More Than You Think

GPT-4o charges $5 per million input tokens. Claude 3.5 Sonnet charges $3. Gemini 1.5 Pro charges $1.25. A typical RAG query retrieves 5–10 chunks of 500 tokens each. That's 2,500–5,000 tokens per query just for context. At 10,000 daily queries, you're burning $12–$25 per day on context alone — before you count system prompts and output.

Document compressors address three problems simultaneously:

Cost — fewer tokens in means a lower API bill
Quality — less noise means the LLM focuses on relevant evidence
Latency — smaller prompts process faster

A 2024 study from Databricks found that contextual compression improved RAG answer accuracy by 18% while reducing context tokens by 60% on average. Those numbers are consistent with what teams building production RAG systems report in the wild.

The ContextualCompressionRetriever Wrapper

All LangChain compressors plug into ContextualCompressionRetriever, which wraps any base retriever:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Build base vector store retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    collection_name="docs",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 8})

# Wrap with compressor
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

docs = compression_retriever.invoke("What are the side effects of metformin?")
for doc in docs:
    print(f"[{len(doc.page_content)} chars] {doc.page_content[:200]}")

Eight raw retrievals become two or three tightly focused passages. Now let's look at each compressor in detail.

Compressor 1: LLMChainExtractor

LLMChainExtractor is the original LangChain compressor. It sends each chunk to an LLM with the query and asks the model to pull out only the relevant sentences.

from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
extractor = LLMChainExtractor.from_llm(llm)

doc = Document(
    page_content="""
    Metformin is a first-line medication for type 2 diabetes.
    It works by decreasing glucose production in the liver.
    Common side effects include nausea, vomiting, and diarrhea.
    Rare but serious side effects include lactic acidosis.
    The drug was first synthesized in 1922 by Emil Werner.
    It is on the WHO list of essential medicines.
    """,
    metadata={"source": "medical_db"}
)

query = "What are the side effects of metformin?"
compressed = extractor.compress_documents([doc], query)
print(compressed[0].page_content)
# → "Common side effects include nausea, vomiting, and diarrhea.
#    Rare but serious side effects include lactic acidosis."

Best for: High-quality extraction where accuracy matters more than cost. Expect one to three LLM calls per retrieved chunk.

Compressor 2: LLMChainFilter

LLMChainFilter takes a binary approach — it asks the LLM whether a chunk is relevant at all, then keeps or drops it wholesale. No extraction, just a relevance judgment.

from langchain.retrievers.document_compressors import LLMChainFilter

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
filter_compressor = LLMChainFilter.from_llm(llm)

docs = [
    Document(page_content="Metformin reduces liver glucose production.", metadata={"id": 1}),
    Document(page_content="The history of diabetes treatment began in ancient Egypt.", metadata={"id": 2}),
    Document(page_content="Lactic acidosis is a rare but serious risk with metformin.", metadata={"id": 3}),
]

query = "What are the risks of taking metformin?"
filtered = filter_compressor.compress_documents(docs, query)
print(f"Kept {len(filtered)}/{len(docs)} documents")
# → Kept 2/3 documents (drops the history doc)

This approach is faster than LLMChainExtractor because the LLM only outputs YES or NO rather than generating extracted text. You still pay for the input tokens on each chunk, but the output is minimal.

Compressor 3: EmbeddingsFilter

No LLM needed. EmbeddingsFilter computes cosine similarity between the query embedding and each chunk's embedding, then drops chunks below a threshold.

from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

emb_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.76  # tune this per domain
)

docs = [
    Document(page_content="Metformin controls blood sugar in type 2 diabetes."),
    Document(page_content="Ancient Rome had a thriving olive oil trade."),
    Document(page_content="Side effects of metformin include GI upset."),
]

query = "metformin side effects"
filtered = emb_filter.compress_documents(docs, query)
print(f"Kept {len(filtered)} docs after embedding filter")

Cost: Only embedding API calls — roughly $0.00002 per 1,000 tokens with text-embedding-3-small. Orders of magnitude cheaper than LLM filtering. For high-throughput systems, this is almost always the first compressor you should add.

Tradeoff: Embedding similarity doesn't always capture semantic relevance perfectly. A chunk about "drug interactions with blood sugar medications" might score lower than expected against a query like "metformin side effects" even though it's highly relevant.

Compressor 4: CohereRerank

Cohere's Rerank API is purpose-built for this problem. It takes a query and a list of documents, runs a cross-encoder model, and returns relevance scores. Top-k documents survive.

from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever

reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=3,  # keep top 3 after reranking
    cohere_api_key="your-cohere-api-key"
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever  # retrieves k=10 initially
)

results = compression_retriever.invoke("diabetes medication side effects")
for doc in results:
    print(doc.metadata.get("relevance_score", "N/A"), doc.page_content[:100])

Cohere Rerank is one of the most effective compressors for production RAG. The cross-encoder architecture understands query-document interaction better than bi-encoder similarity because it looks at both inputs together rather than independently.

Pricing: $1 per 1,000 API calls (each call can rank up to 1,000 documents). Very affordable at scale.

Compressor 5: FlashrankRerank (Local)

If you can't send data to Cohere, FlashrankRerank runs a small cross-encoder locally with no API costs.

# pip install flashrank
from langchain_community.document_compressors import FlashrankRerank

local_reranker = FlashrankRerank(
    model_name="ms-marco-MiniLM-L-12-v2",
    top_n=3
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=local_reranker,
    base_retriever=base_retriever
)

results = compression_retriever.invoke("metformin dosage for elderly patients")
for doc in results:
    print(doc.page_content[:150])

Tradeoff: Slightly lower quality than Cohere's hosted model, but zero data leaves your infrastructure. Works offline. Perfect for healthcare and finance applications with strict data residency requirements.

Compressor 6: DocumentCompressorPipeline

Stack multiple compressors. The most powerful pattern is a fast pre-filter followed by a high-quality extractor.

from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.text_splitter import CharacterTextSplitter

# Stage 1: re-split long chunks into smaller pieces
splitter = CharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=0,
    separator=". "
)

# Stage 2: embedding similarity filter (fast, cheap)
emb_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.75
)

# Stage 3: LLM extraction (accurate, expensive — runs on fewer docs now)
llm_extractor = LLMChainExtractor.from_llm(
    ChatOpenAI(model="gpt-4o-mini", temperature=0)
)

pipeline = DocumentCompressorPipeline(
    transformers=[splitter, emb_filter, llm_extractor]
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline,
    base_retriever=base_retriever
)

This three-stage pipeline typically reduces context by 70–80% while maintaining answer quality. The LLM extractor only runs on the small set of chunks that passed embedding filtering, keeping costs low while preserving accuracy where it matters.

Compressor 7: LLMListwiseRerank

LLMListwiseRerank sends all retrieved documents to an LLM at once and asks it to rank them by relevance. This is slower but can capture complex cross-document reasoning.

from langchain.retrievers.document_compressors import LLMListwiseRerank

listwise_reranker = LLMListwiseRerank.from_llm(
    llm=ChatOpenAI(model="gpt-4o", temperature=0),
    num_docs_to_keep=3
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=listwise_reranker,
    base_retriever=base_retriever
)

results = compression_retriever.invoke("Compare first-line diabetes medications")
for i, doc in enumerate(results):
    print(f"Rank {i+1}: {doc.page_content[:200]}")

Use case: Complex analytical queries where context between documents matters. Not suitable for high-throughput applications due to cost and latency. Works best when you're building a research or analysis tool rather than a real-time chatbot.

Compressor 8: Custom LLM Compressor with Structured Output

Sometimes you need domain-specific compression logic. Build a custom compressor using LCEL:

from langchain_core.documents import Document
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain.retrievers.document_compressors.base import BaseDocumentCompressor
from pydantic import BaseModel, Field
from typing import List, Sequence, Optional
from langchain_core.callbacks import Callbacks

class StructuredCompressor(BaseDocumentCompressor):
    llm: object

    class Config:
        arbitrary_types_allowed = True

    def compress_documents(
        self,
        documents: Sequence[Document],
        query: str,
        callbacks: Optional[Callbacks] = None
    ) -> List[Document]:

        prompt = ChatPromptTemplate.from_messages([
            ("system", "Extract the most relevant excerpt from this document for the query. Return JSON with keys: excerpt, relevance_score (0-1), reasoning."),
            ("human", "Query: {query}\n\nDocument: {document}")
        ])

        chain = prompt | self.llm | JsonOutputParser()

        compressed = []
        for doc in documents:
            try:
                result = chain.invoke({
                    "query": query,
                    "document": doc.page_content
                })
                if result["relevance_score"] > 0.5:
                    compressed.append(Document(
                        page_content=result["excerpt"],
                        metadata={
                            **doc.metadata,
                            "relevance_score": result["relevance_score"],
                            "reasoning": result["reasoning"]
                        }
                    ))
            except Exception:
                pass  # drop malformed responses

        return sorted(
            compressed,
            key=lambda x: x.metadata["relevance_score"],
            reverse=True
        )

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
custom_compressor = StructuredCompressor(llm=llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=custom_compressor,
    base_retriever=base_retriever
)

The structured output gives you explainability — you can log why each chunk was kept or dropped, which is valuable for debugging and compliance.

Compressor 9: Semantic Chunking + Embedding Filter

Combine semantic chunking with embedding filtering for better chunk boundaries:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Semantic splitter creates chunks at natural topic boundaries
semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90
)

# Then filter by similarity
emb_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.78
)

pipeline = DocumentCompressorPipeline(
    transformers=[semantic_splitter, emb_filter]
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline,
    base_retriever=base_retriever
)

Semantic chunking ensures that when a chunk is split, it doesn't cut mid-thought. This improves embedding filter accuracy because each chunk represents a coherent idea rather than an arbitrary text window.

For more on embedding strategies, see the semantic search tutorial.

Compressor 10: Parent Document Retriever with Compression

Retrieve parent documents (large context) but use compressed child documents for the LLM:

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader

# Child splitter for embedding (small, precise)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# Parent splitter for LLM context (larger, richer)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

store = InMemoryStore()
parent_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)

# Add documents
loader = WebBaseLoader("https://example.com/medical-docs")
docs = loader.load()
parent_retriever.add_documents(docs)

# Combine with LLM compressor for final context reduction
llm_filter = LLMChainFilter.from_llm(ChatOpenAI(model="gpt-4o-mini"))

final_retriever = ContextualCompressionRetriever(
    base_compressor=llm_filter,
    base_retriever=parent_retriever
)

results = final_retriever.invoke("What dosage adjustments are needed for renal impairment?")

The parent-child architecture retrieves semantically precise child chunks but stores parent context. After filtering, the LLM receives rich parent-level passages for high-quality answers without the overhead of embedding entire large documents.

Cost Savings Comparison Table

Compressor	API Calls per Query	Avg Token Reduction	Cost per 1K Queries	Best Use Case
LLMChainExtractor	k LLM calls	65–75%	$0.30–$1.20	High accuracy needs
LLMChainFilter	k LLM calls	40–60% (drops chunks)	$0.15–$0.60	Binary relevance
EmbeddingsFilter	k embed calls	50–70%	$0.001–$0.005	High throughput
CohereRerank	1 Cohere call	60–80%	$0.001	Production RAG
FlashrankRerank	0 API calls	60–80%	$0.000	Air-gapped / private
Pipeline (emb→LLM)	k embed + filtered LLM	70–85%	$0.05–$0.20	Balanced cost/quality
LLMListwiseRerank	1 large LLM call	70–80%	$0.50–$2.00	Complex queries
Custom Structured	k LLM calls	60–75%	$0.20–$0.80	Domain-specific logic
Semantic + Emb	k embed calls	55–72%	$0.002–$0.008	Semantic coherence
Parent-Child + Filter	k LLM calls	40–55%	$0.15–$0.50	Long-doc retrieval

Example cost savings calculation:

Baseline: 8 chunks × 500 tokens = 4,000 tokens at $5/M = $0.020 per query
With CohereRerank (keep top 3): 3 × 500 = 1,500 tokens = $0.0075 + $0.001 Cohere = $0.0085
Savings: 57% cost reduction per query
At 100,000 queries/month: $2,000 baseline → $850 with reranking → $1,150 monthly savings

Measuring Compression Quality

Don't tune compressors blindly. Track context precision with RAGAS:

# pip install ragas
from ragas import evaluate
from ragas.metrics import context_precision, faithfulness
from datasets import Dataset

questions = ["What are metformin side effects?", "How does metformin work?"]
results_compressed = []

for q in questions:
    compressed_docs = compression_retriever.invoke(q)
    context = "\n\n".join(d.page_content for d in compressed_docs)
    
    from langchain_core.prompts import ChatPromptTemplate
    prompt = ChatPromptTemplate.from_template(
        "Answer using only this context:\n{context}\n\nQuestion: {question}"
    )
    answer = (prompt | ChatOpenAI(model="gpt-4o")).invoke({
        "context": context,
        "question": q
    }).content
    
    results_compressed.append({
        "question": q,
        "answer": answer,
        "contexts": [d.page_content for d in compressed_docs],
        "ground_truth": "Reference answer here"
    })

eval_results = evaluate(
    Dataset.from_list(results_compressed),
    metrics=[context_precision, faithfulness]
)
print(eval_results)

A healthy production setup should show 50–80% token reduction with latency under 500ms for embedding-based filters and under 2 seconds for LLM-based extractors.

Tuning the EmbeddingsFilter Threshold

The similarity_threshold parameter is the most important knob to tune:

import numpy as np

def find_optimal_threshold(queries, docs, ground_truth_relevant, embeddings):
    thresholds = np.arange(0.60, 0.95, 0.05)
    results = []

    for threshold in thresholds:
        emb_filter = EmbeddingsFilter(
            embeddings=embeddings,
            similarity_threshold=float(threshold)
        )

        total_tp = total_fp = total_fn = 0
        for query, doc_list, relevant_ids in zip(queries, docs, ground_truth_relevant):
            filtered = emb_filter.compress_documents(doc_list, query)
            filtered_ids = {d.metadata["id"] for d in filtered}

            tp = len(filtered_ids & relevant_ids)
            fp = len(filtered_ids - relevant_ids)
            fn = len(relevant_ids - filtered_ids)

            total_tp += tp
            total_fp += fp
            total_fn += fn

        precision = total_tp / (total_tp + total_fp + 1e-9)
        recall = total_tp / (total_tp + total_fn + 1e-9)
        f1 = 2 * precision * recall / (precision + recall + 1e-9)

        results.append({
            "threshold": threshold,
            "precision": precision,
            "recall": recall,
            "f1": f1
        })

    best = max(results, key=lambda x: x["f1"])
    print(f"Optimal threshold: {best['threshold']:.2f} (F1={best['f1']:.3f})")
    return best["threshold"]

Most RAG applications land between 0.72 and 0.82. Medical and legal domains often need higher thresholds (0.80+) to avoid including loosely related content.

Production Architecture Pattern

For a production RAG pipeline, combine compressors with caching and async retrieval:

import asyncio
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based only on the provided context. Say so if the answer isn't there."),
    ("human", "Context:\n{context}\n\nQuestion: {question}")
])

def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        for doc in docs
    )

rag_chain = (
    RunnableParallel({
        "context": compression_retriever | format_docs,
        "question": RunnablePassthrough()
    })
    | prompt
    | ChatOpenAI(model="gpt-4o", temperature=0)
    | StrOutputParser()
)

# Async for concurrent requests
async def answer_questions(questions: list[str]) -> list[str]:
    tasks = [rag_chain.ainvoke(q) for q in questions]
    return await asyncio.gather(*tasks)

# Stream individual responses
async def stream_answer(question: str):
    async for chunk in rag_chain.astream(question):
        print(chunk, end="", flush=True)

For complete RAG pipeline examples, see the RAG system tutorial and Build AI agent with LangChain.

Choosing the Right Compressor

High throughput (>10K queries/day): EmbeddingsFilter or CohereRerank. Both are fast and cheap. Combine them in a pipeline for best results.

High accuracy (medical, legal, financial): LLMChainExtractor or Pipeline(EmbeddingsFilter → LLMChainExtractor). The extra LLM cost is worth it when wrong answers have real consequences.

Private data / air-gapped: FlashrankRerank is your best option without API calls. Performance is surprisingly good for most use cases.

Complex multi-document reasoning: LLMListwiseRerank. Expensive, but it can compare documents against each other to find the most comprehensive answer.

Cost-optimized production: Pipeline with EmbeddingsFilter (threshold 0.76) → CohereRerank (top_n=3). This reduces LLM context by 75–85% with minimal accuracy loss.

The OpenAI API integration guide covers token counting utilities that help you measure compression ratios in production.

Document compressors are one of the highest-ROI optimizations you can make to a RAG system. A well-tuned compression pipeline pays for itself within days through reduced API costs, and the answer quality improvements are often immediately visible. Start with EmbeddingsFilter for quick wins, add CohereRerank for production accuracy, and build a DocumentCompressorPipeline when you need both speed and quality.

For the full picture, explore the LangChain tutorial 2025, AI agent memory and planning, and Deploy AI model to production.

Frequently Asked Questions

What is a LangChain document compressor? A document compressor is a post-retrieval filter that takes the chunks returned by your vector store and strips out sentences or passages that aren't relevant to the query before sending them to the LLM, reducing token usage and improving answer quality.

Which compressor is cheapest to run? EmbeddingsFilter is the cheapest because it uses only embedding similarity comparisons and never calls an LLM. CohereRerank is also cost-effective at scale because Cohere charges per 1,000 API calls, not per token.

Can I chain multiple compressors together? Yes. Use DocumentCompressorPipeline to stack compressors in order. A common pattern is EmbeddingsFilter first to remove clearly irrelevant chunks, then LLMChainExtractor to compress the survivors.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

A document compressor is a post-retrieval filter that takes the chunks returned by your vector store and strips out sentences or passages that aren't relevant to the query before sending them to the LLM, reducing token usage and improving answer quality.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes NotesEmbeddings & Vector Databases Reference BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

10 LangChain Document Compressors to Reduce Context Length

⚡ Quick Answer

Learn 10 LangChain document compressors that slash context length, cut LLM costs, and keep RAG pipelines fast and accurate in production.

AiTechWorlds Team May 31, 2026 14 min read

#LangChain #document compressors #context length #RAG #embeddings

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

This guide covers 10 practical compressors, shows you real Python code for each, and gives you a cost-savings table so you can pick the right tool for your budget.

If you haven't set up retrieval yet, the RAG system tutorial and Vector database guide are good starting points.

Why Context Length Matters More Than You Think

Document compressors address three problems simultaneously:

Cost — fewer tokens in means a lower API bill
Quality — less noise means the LLM focuses on relevant evidence
Latency — smaller prompts process faster

The ContextualCompressionRetriever Wrapper

All LangChain compressors plug into ContextualCompressionRetriever, which wraps any base retriever:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Build base vector store retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    collection_name="docs",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 8})

# Wrap with compressor
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

docs = compression_retriever.invoke("What are the side effects of metformin?")
for doc in docs:
    print(f"[{len(doc.page_content)} chars] {doc.page_content[:200]}")

Eight raw retrievals become two or three tightly focused passages. Now let's look at each compressor in detail.

Compressor 1: LLMChainExtractor

LLMChainExtractor is the original LangChain compressor. It sends each chunk to an LLM with the query and asks the model to pull out only the relevant sentences.

from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
extractor = LLMChainExtractor.from_llm(llm)

doc = Document(
    page_content="""
    Metformin is a first-line medication for type 2 diabetes.
    It works by decreasing glucose production in the liver.
    Common side effects include nausea, vomiting, and diarrhea.
    Rare but serious side effects include lactic acidosis.
    The drug was first synthesized in 1922 by Emil Werner.
    It is on the WHO list of essential medicines.
    """,
    metadata={"source": "medical_db"}
)

query = "What are the side effects of metformin?"
compressed = extractor.compress_documents([doc], query)
print(compressed[0].page_content)
# → "Common side effects include nausea, vomiting, and diarrhea.
#    Rare but serious side effects include lactic acidosis."

Best for: High-quality extraction where accuracy matters more than cost. Expect one to three LLM calls per retrieved chunk.

Compressor 2: LLMChainFilter

LLMChainFilter takes a binary approach — it asks the LLM whether a chunk is relevant at all, then keeps or drops it wholesale. No extraction, just a relevance judgment.

from langchain.retrievers.document_compressors import LLMChainFilter

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
filter_compressor = LLMChainFilter.from_llm(llm)

docs = [
    Document(page_content="Metformin reduces liver glucose production.", metadata={"id": 1}),
    Document(page_content="The history of diabetes treatment began in ancient Egypt.", metadata={"id": 2}),
    Document(page_content="Lactic acidosis is a rare but serious risk with metformin.", metadata={"id": 3}),
]

query = "What are the risks of taking metformin?"
filtered = filter_compressor.compress_documents(docs, query)
print(f"Kept {len(filtered)}/{len(docs)} documents")
# → Kept 2/3 documents (drops the history doc)

Compressor 3: EmbeddingsFilter

No LLM needed. EmbeddingsFilter computes cosine similarity between the query embedding and each chunk's embedding, then drops chunks below a threshold.

from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

emb_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.76  # tune this per domain
)

docs = [
    Document(page_content="Metformin controls blood sugar in type 2 diabetes."),
    Document(page_content="Ancient Rome had a thriving olive oil trade."),
    Document(page_content="Side effects of metformin include GI upset."),
]

query = "metformin side effects"
filtered = emb_filter.compress_documents(docs, query)
print(f"Kept {len(filtered)} docs after embedding filter")

Compressor 4: CohereRerank

Cohere's Rerank API is purpose-built for this problem. It takes a query and a list of documents, runs a cross-encoder model, and returns relevance scores. Top-k documents survive.

from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever

reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=3,  # keep top 3 after reranking
    cohere_api_key="your-cohere-api-key"
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever  # retrieves k=10 initially
)

results = compression_retriever.invoke("diabetes medication side effects")
for doc in results:
    print(doc.metadata.get("relevance_score", "N/A"), doc.page_content[:100])

Pricing: $1 per 1,000 API calls (each call can rank up to 1,000 documents). Very affordable at scale.

Compressor 5: FlashrankRerank (Local)

If you can't send data to Cohere, FlashrankRerank runs a small cross-encoder locally with no API costs.

# pip install flashrank
from langchain_community.document_compressors import FlashrankRerank

local_reranker = FlashrankRerank(
    model_name="ms-marco-MiniLM-L-12-v2",
    top_n=3
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=local_reranker,
    base_retriever=base_retriever
)

results = compression_retriever.invoke("metformin dosage for elderly patients")
for doc in results:
    print(doc.page_content[:150])

Compressor 6: DocumentCompressorPipeline

Stack multiple compressors. The most powerful pattern is a fast pre-filter followed by a high-quality extractor.

from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.text_splitter import CharacterTextSplitter

# Stage 1: re-split long chunks into smaller pieces
splitter = CharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=0,
    separator=". "
)

# Stage 2: embedding similarity filter (fast, cheap)
emb_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.75
)

# Stage 3: LLM extraction (accurate, expensive — runs on fewer docs now)
llm_extractor = LLMChainExtractor.from_llm(
    ChatOpenAI(model="gpt-4o-mini", temperature=0)
)

pipeline = DocumentCompressorPipeline(
    transformers=[splitter, emb_filter, llm_extractor]
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline,
    base_retriever=base_retriever
)

Compressor 7: LLMListwiseRerank

LLMListwiseRerank sends all retrieved documents to an LLM at once and asks it to rank them by relevance. This is slower but can capture complex cross-document reasoning.

from langchain.retrievers.document_compressors import LLMListwiseRerank

listwise_reranker = LLMListwiseRerank.from_llm(
    llm=ChatOpenAI(model="gpt-4o", temperature=0),
    num_docs_to_keep=3
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=listwise_reranker,
    base_retriever=base_retriever
)

results = compression_retriever.invoke("Compare first-line diabetes medications")
for i, doc in enumerate(results):
    print(f"Rank {i+1}: {doc.page_content[:200]}")

Compressor 8: Custom LLM Compressor with Structured Output

Sometimes you need domain-specific compression logic. Build a custom compressor using LCEL:

from langchain_core.documents import Document
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain.retrievers.document_compressors.base import BaseDocumentCompressor
from pydantic import BaseModel, Field
from typing import List, Sequence, Optional
from langchain_core.callbacks import Callbacks

class StructuredCompressor(BaseDocumentCompressor):
    llm: object

    class Config:
        arbitrary_types_allowed = True

    def compress_documents(
        self,
        documents: Sequence[Document],
        query: str,
        callbacks: Optional[Callbacks] = None
    ) -> List[Document]:

        prompt = ChatPromptTemplate.from_messages([
            ("system", "Extract the most relevant excerpt from this document for the query. Return JSON with keys: excerpt, relevance_score (0-1), reasoning."),
            ("human", "Query: {query}\n\nDocument: {document}")
        ])

        chain = prompt | self.llm | JsonOutputParser()

        compressed = []
        for doc in documents:
            try:
                result = chain.invoke({
                    "query": query,
                    "document": doc.page_content
                })
                if result["relevance_score"] > 0.5:
                    compressed.append(Document(
                        page_content=result["excerpt"],
                        metadata={
                            **doc.metadata,
                            "relevance_score": result["relevance_score"],
                            "reasoning": result["reasoning"]
                        }
                    ))
            except Exception:
                pass  # drop malformed responses

        return sorted(
            compressed,
            key=lambda x: x.metadata["relevance_score"],
            reverse=True
        )

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
custom_compressor = StructuredCompressor(llm=llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=custom_compressor,
    base_retriever=base_retriever
)

The structured output gives you explainability — you can log why each chunk was kept or dropped, which is valuable for debugging and compliance.

Compressor 9: Semantic Chunking + Embedding Filter

Combine semantic chunking with embedding filtering for better chunk boundaries:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Semantic splitter creates chunks at natural topic boundaries
semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90
)

# Then filter by similarity
emb_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.78
)

pipeline = DocumentCompressorPipeline(
    transformers=[semantic_splitter, emb_filter]
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline,
    base_retriever=base_retriever
)

For more on embedding strategies, see the semantic search tutorial.

Compressor 10: Parent Document Retriever with Compression

Retrieve parent documents (large context) but use compressed child documents for the LLM:

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader

# Child splitter for embedding (small, precise)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# Parent splitter for LLM context (larger, richer)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

store = InMemoryStore()
parent_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)

# Add documents
loader = WebBaseLoader("https://example.com/medical-docs")
docs = loader.load()
parent_retriever.add_documents(docs)

# Combine with LLM compressor for final context reduction
llm_filter = LLMChainFilter.from_llm(ChatOpenAI(model="gpt-4o-mini"))

final_retriever = ContextualCompressionRetriever(
    base_compressor=llm_filter,
    base_retriever=parent_retriever
)

results = final_retriever.invoke("What dosage adjustments are needed for renal impairment?")

Cost Savings Comparison Table

Compressor	API Calls per Query	Avg Token Reduction	Cost per 1K Queries	Best Use Case
LLMChainExtractor	k LLM calls	65–75%	$0.30–$1.20	High accuracy needs
LLMChainFilter	k LLM calls	40–60% (drops chunks)	$0.15–$0.60	Binary relevance
EmbeddingsFilter	k embed calls	50–70%	$0.001–$0.005	High throughput
CohereRerank	1 Cohere call	60–80%	$0.001	Production RAG
FlashrankRerank	0 API calls	60–80%	$0.000	Air-gapped / private
Pipeline (emb→LLM)	k embed + filtered LLM	70–85%	$0.05–$0.20	Balanced cost/quality
LLMListwiseRerank	1 large LLM call	70–80%	$0.50–$2.00	Complex queries
Custom Structured	k LLM calls	60–75%	$0.20–$0.80	Domain-specific logic
Semantic + Emb	k embed calls	55–72%	$0.002–$0.008	Semantic coherence
Parent-Child + Filter	k LLM calls	40–55%	$0.15–$0.50	Long-doc retrieval

Example cost savings calculation:

Baseline: 8 chunks × 500 tokens = 4,000 tokens at $5/M = $0.020 per query
With CohereRerank (keep top 3): 3 × 500 = 1,500 tokens = $0.0075 + $0.001 Cohere = $0.0085
Savings: 57% cost reduction per query
At 100,000 queries/month: $2,000 baseline → $850 with reranking → $1,150 monthly savings

Measuring Compression Quality

Don't tune compressors blindly. Track context precision with RAGAS:

# pip install ragas
from ragas import evaluate
from ragas.metrics import context_precision, faithfulness
from datasets import Dataset

questions = ["What are metformin side effects?", "How does metformin work?"]
results_compressed = []

for q in questions:
    compressed_docs = compression_retriever.invoke(q)
    context = "\n\n".join(d.page_content for d in compressed_docs)
    
    from langchain_core.prompts import ChatPromptTemplate
    prompt = ChatPromptTemplate.from_template(
        "Answer using only this context:\n{context}\n\nQuestion: {question}"
    )
    answer = (prompt | ChatOpenAI(model="gpt-4o")).invoke({
        "context": context,
        "question": q
    }).content
    
    results_compressed.append({
        "question": q,
        "answer": answer,
        "contexts": [d.page_content for d in compressed_docs],
        "ground_truth": "Reference answer here"
    })

eval_results = evaluate(
    Dataset.from_list(results_compressed),
    metrics=[context_precision, faithfulness]
)
print(eval_results)

A healthy production setup should show 50–80% token reduction with latency under 500ms for embedding-based filters and under 2 seconds for LLM-based extractors.

Tuning the EmbeddingsFilter Threshold

The similarity_threshold parameter is the most important knob to tune:

import numpy as np

def find_optimal_threshold(queries, docs, ground_truth_relevant, embeddings):
    thresholds = np.arange(0.60, 0.95, 0.05)
    results = []

    for threshold in thresholds:
        emb_filter = EmbeddingsFilter(
            embeddings=embeddings,
            similarity_threshold=float(threshold)
        )

        total_tp = total_fp = total_fn = 0
        for query, doc_list, relevant_ids in zip(queries, docs, ground_truth_relevant):
            filtered = emb_filter.compress_documents(doc_list, query)
            filtered_ids = {d.metadata["id"] for d in filtered}

            tp = len(filtered_ids & relevant_ids)
            fp = len(filtered_ids - relevant_ids)
            fn = len(relevant_ids - filtered_ids)

            total_tp += tp
            total_fp += fp
            total_fn += fn

        precision = total_tp / (total_tp + total_fp + 1e-9)
        recall = total_tp / (total_tp + total_fn + 1e-9)
        f1 = 2 * precision * recall / (precision + recall + 1e-9)

        results.append({
            "threshold": threshold,
            "precision": precision,
            "recall": recall,
            "f1": f1
        })

    best = max(results, key=lambda x: x["f1"])
    print(f"Optimal threshold: {best['threshold']:.2f} (F1={best['f1']:.3f})")
    return best["threshold"]

Most RAG applications land between 0.72 and 0.82. Medical and legal domains often need higher thresholds (0.80+) to avoid including loosely related content.

Production Architecture Pattern

For a production RAG pipeline, combine compressors with caching and async retrieval:

import asyncio
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based only on the provided context. Say so if the answer isn't there."),
    ("human", "Context:\n{context}\n\nQuestion: {question}")
])

def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        for doc in docs
    )

rag_chain = (
    RunnableParallel({
        "context": compression_retriever | format_docs,
        "question": RunnablePassthrough()
    })
    | prompt
    | ChatOpenAI(model="gpt-4o", temperature=0)
    | StrOutputParser()
)

# Async for concurrent requests
async def answer_questions(questions: list[str]) -> list[str]:
    tasks = [rag_chain.ainvoke(q) for q in questions]
    return await asyncio.gather(*tasks)

# Stream individual responses
async def stream_answer(question: str):
    async for chunk in rag_chain.astream(question):
        print(chunk, end="", flush=True)

For complete RAG pipeline examples, see the RAG system tutorial and Build AI agent with LangChain.

Choosing the Right Compressor

High throughput (>10K queries/day): EmbeddingsFilter or CohereRerank. Both are fast and cheap. Combine them in a pipeline for best results.

High accuracy (medical, legal, financial): LLMChainExtractor or Pipeline(EmbeddingsFilter → LLMChainExtractor). The extra LLM cost is worth it when wrong answers have real consequences.

Private data / air-gapped: FlashrankRerank is your best option without API calls. Performance is surprisingly good for most use cases.

Complex multi-document reasoning: LLMListwiseRerank. Expensive, but it can compare documents against each other to find the most comprehensive answer.

Cost-optimized production: Pipeline with EmbeddingsFilter (threshold 0.76) → CohereRerank (top_n=3). This reduces LLM context by 75–85% with minimal accuracy loss.

The OpenAI API integration guide covers token counting utilities that help you measure compression ratios in production.

For the full picture, explore the LangChain tutorial 2025, AI agent memory and planning, and Deploy AI model to production.

Frequently Asked Questions

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

10 LangChain Document Compressors to Reduce Context Length

Why Context Length Matters More Than You Think

The ContextualCompressionRetriever Wrapper

Compressor 1: LLMChainExtractor

Compressor 2: LLMChainFilter

Compressor 3: EmbeddingsFilter

Compressor 4: CohereRerank

Compressor 5: FlashrankRerank (Local)

Compressor 6: DocumentCompressorPipeline

Compressor 7: LLMListwiseRerank

Compressor 8: Custom LLM Compressor with Structured Output

Compressor 9: Semantic Chunking + Embedding Filter

Compressor 10: Parent Document Retriever with Compression

Cost Savings Comparison Table

Measuring Compression Quality

Tuning the EmbeddingsFilter Threshold

Production Architecture Pattern

Choosing the Right Compressor

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

10 LangChain Document Compressors to Reduce Context Length

Why Context Length Matters More Than You Think

The ContextualCompressionRetriever Wrapper

Compressor 1: LLMChainExtractor

Compressor 2: LLMChainFilter

Compressor 3: EmbeddingsFilter

Compressor 4: CohereRerank

Compressor 5: FlashrankRerank (Local)

Compressor 6: DocumentCompressorPipeline

Compressor 7: LLMListwiseRerank

Compressor 8: Custom LLM Compressor with Structured Output

Compressor 9: Semantic Chunking + Embedding Filter

Compressor 10: Parent Document Retriever with Compression

Cost Savings Comparison Table

Measuring Compression Quality

Tuning the EmbeddingsFilter Threshold

Production Architecture Pattern

Choosing the Right Compressor

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily