10 LangChain Document Compressors to Reduce Context Length
Learn 10 LangChain document compressors that slash context length, cut LLM costs, and keep RAG pipelines fast and accurate in production.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Context windows are getting bigger, but your LLM bill isn't getting smaller. Stuffing 20 raw chunks from a vector store into every prompt is wasteful — most of that text doesn't help answer the query. LangChain's document compressor layer sits between retrieval and generation, cutting the noise before it reaches the model.
This guide covers 10 practical compressors, shows you real Python code for each, and gives you a cost-savings table so you can pick the right tool for your budget.
If you haven't set up retrieval yet, the RAG system tutorial and Vector database guide are good starting points.
Why Context Length Matters More Than You Think
GPT-4o charges $5 per million input tokens. Claude 3.5 Sonnet charges $3. Gemini 1.5 Pro charges $1.25. A typical RAG query retrieves 5–10 chunks of 500 tokens each. That's 2,500–5,000 tokens per query just for context. At 10,000 daily queries, you're burning $12–$25 per day on context alone — before you count system prompts and output.
Document compressors address three problems simultaneously:
- Cost — fewer tokens in means a lower API bill
- Quality — less noise means the LLM focuses on relevant evidence
- Latency — smaller prompts process faster
A 2024 study from Databricks found that contextual compression improved RAG answer accuracy by 18% while reducing context tokens by 60% on average. Those numbers are consistent with what teams building production RAG systems report in the wild.
The ContextualCompressionRetriever Wrapper
All LangChain compressors plug into ContextualCompressionRetriever, which wraps any base retriever:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Build base vector store retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
collection_name="docs",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 8})
# Wrap with compressor
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
docs = compression_retriever.invoke("What are the side effects of metformin?")
for doc in docs:
print(f"[{len(doc.page_content)} chars] {doc.page_content[:200]}")
Eight raw retrievals become two or three tightly focused passages. Now let's look at each compressor in detail.
Compressor 1: LLMChainExtractor
LLMChainExtractor is the original LangChain compressor. It sends each chunk to an LLM with the query and asks the model to pull out only the relevant sentences.
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
extractor = LLMChainExtractor.from_llm(llm)
doc = Document(
page_content="""
Metformin is a first-line medication for type 2 diabetes.
It works by decreasing glucose production in the liver.
Common side effects include nausea, vomiting, and diarrhea.
Rare but serious side effects include lactic acidosis.
The drug was first synthesized in 1922 by Emil Werner.
It is on the WHO list of essential medicines.
""",
metadata={"source": "medical_db"}
)
query = "What are the side effects of metformin?"
compressed = extractor.compress_documents([doc], query)
print(compressed[0].page_content)
# → "Common side effects include nausea, vomiting, and diarrhea.
# Rare but serious side effects include lactic acidosis."
Best for: High-quality extraction where accuracy matters more than cost. Expect one to three LLM calls per retrieved chunk.
Compressor 2: LLMChainFilter
LLMChainFilter takes a binary approach — it asks the LLM whether a chunk is relevant at all, then keeps or drops it wholesale. No extraction, just a relevance judgment.
from langchain.retrievers.document_compressors import LLMChainFilter
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
filter_compressor = LLMChainFilter.from_llm(llm)
docs = [
Document(page_content="Metformin reduces liver glucose production.", metadata={"id": 1}),
Document(page_content="The history of diabetes treatment began in ancient Egypt.", metadata={"id": 2}),
Document(page_content="Lactic acidosis is a rare but serious risk with metformin.", metadata={"id": 3}),
]
query = "What are the risks of taking metformin?"
filtered = filter_compressor.compress_documents(docs, query)
print(f"Kept {len(filtered)}/{len(docs)} documents")
# → Kept 2/3 documents (drops the history doc)
This approach is faster than LLMChainExtractor because the LLM only outputs YES or NO rather than generating extracted text. You still pay for the input tokens on each chunk, but the output is minimal.
Compressor 3: EmbeddingsFilter
No LLM needed. EmbeddingsFilter computes cosine similarity between the query embedding and each chunk's embedding, then drops chunks below a threshold.
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
emb_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.76 # tune this per domain
)
docs = [
Document(page_content="Metformin controls blood sugar in type 2 diabetes."),
Document(page_content="Ancient Rome had a thriving olive oil trade."),
Document(page_content="Side effects of metformin include GI upset."),
]
query = "metformin side effects"
filtered = emb_filter.compress_documents(docs, query)
print(f"Kept {len(filtered)} docs after embedding filter")
Cost: Only embedding API calls — roughly $0.00002 per 1,000 tokens with text-embedding-3-small. Orders of magnitude cheaper than LLM filtering. For high-throughput systems, this is almost always the first compressor you should add.
Tradeoff: Embedding similarity doesn't always capture semantic relevance perfectly. A chunk about "drug interactions with blood sugar medications" might score lower than expected against a query like "metformin side effects" even though it's highly relevant.
Compressor 4: CohereRerank
Cohere's Rerank API is purpose-built for this problem. It takes a query and a list of documents, runs a cross-encoder model, and returns relevance scores. Top-k documents survive.
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
reranker = CohereRerank(
model="rerank-english-v3.0",
top_n=3, # keep top 3 after reranking
cohere_api_key="your-cohere-api-key"
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever # retrieves k=10 initially
)
results = compression_retriever.invoke("diabetes medication side effects")
for doc in results:
print(doc.metadata.get("relevance_score", "N/A"), doc.page_content[:100])
Cohere Rerank is one of the most effective compressors for production RAG. The cross-encoder architecture understands query-document interaction better than bi-encoder similarity because it looks at both inputs together rather than independently.
Pricing: $1 per 1,000 API calls (each call can rank up to 1,000 documents). Very affordable at scale.
Compressor 5: FlashrankRerank (Local)
If you can't send data to Cohere, FlashrankRerank runs a small cross-encoder locally with no API costs.
# pip install flashrank
from langchain_community.document_compressors import FlashrankRerank
local_reranker = FlashrankRerank(
model_name="ms-marco-MiniLM-L-12-v2",
top_n=3
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=local_reranker,
base_retriever=base_retriever
)
results = compression_retriever.invoke("metformin dosage for elderly patients")
for doc in results:
print(doc.page_content[:150])
Tradeoff: Slightly lower quality than Cohere's hosted model, but zero data leaves your infrastructure. Works offline. Perfect for healthcare and finance applications with strict data residency requirements.
Compressor 6: DocumentCompressorPipeline
Stack multiple compressors. The most powerful pattern is a fast pre-filter followed by a high-quality extractor.
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.text_splitter import CharacterTextSplitter
# Stage 1: re-split long chunks into smaller pieces
splitter = CharacterTextSplitter(
chunk_size=300,
chunk_overlap=0,
separator=". "
)
# Stage 2: embedding similarity filter (fast, cheap)
emb_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.75
)
# Stage 3: LLM extraction (accurate, expensive — runs on fewer docs now)
llm_extractor = LLMChainExtractor.from_llm(
ChatOpenAI(model="gpt-4o-mini", temperature=0)
)
pipeline = DocumentCompressorPipeline(
transformers=[splitter, emb_filter, llm_extractor]
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=pipeline,
base_retriever=base_retriever
)
This three-stage pipeline typically reduces context by 70–80% while maintaining answer quality. The LLM extractor only runs on the small set of chunks that passed embedding filtering, keeping costs low while preserving accuracy where it matters.
Compressor 7: LLMListwiseRerank
LLMListwiseRerank sends all retrieved documents to an LLM at once and asks it to rank them by relevance. This is slower but can capture complex cross-document reasoning.
from langchain.retrievers.document_compressors import LLMListwiseRerank
listwise_reranker = LLMListwiseRerank.from_llm(
llm=ChatOpenAI(model="gpt-4o", temperature=0),
num_docs_to_keep=3
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=listwise_reranker,
base_retriever=base_retriever
)
results = compression_retriever.invoke("Compare first-line diabetes medications")
for i, doc in enumerate(results):
print(f"Rank {i+1}: {doc.page_content[:200]}")
Use case: Complex analytical queries where context between documents matters. Not suitable for high-throughput applications due to cost and latency. Works best when you're building a research or analysis tool rather than a real-time chatbot.
Compressor 8: Custom LLM Compressor with Structured Output
Sometimes you need domain-specific compression logic. Build a custom compressor using LCEL:
from langchain_core.documents import Document
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain.retrievers.document_compressors.base import BaseDocumentCompressor
from pydantic import BaseModel, Field
from typing import List, Sequence, Optional
from langchain_core.callbacks import Callbacks
class StructuredCompressor(BaseDocumentCompressor):
llm: object
class Config:
arbitrary_types_allowed = True
def compress_documents(
self,
documents: Sequence[Document],
query: str,
callbacks: Optional[Callbacks] = None
) -> List[Document]:
prompt = ChatPromptTemplate.from_messages([
("system", "Extract the most relevant excerpt from this document for the query. Return JSON with keys: excerpt, relevance_score (0-1), reasoning."),
("human", "Query: {query}\n\nDocument: {document}")
])
chain = prompt | self.llm | JsonOutputParser()
compressed = []
for doc in documents:
try:
result = chain.invoke({
"query": query,
"document": doc.page_content
})
if result["relevance_score"] > 0.5:
compressed.append(Document(
page_content=result["excerpt"],
metadata={
**doc.metadata,
"relevance_score": result["relevance_score"],
"reasoning": result["reasoning"]
}
))
except Exception:
pass # drop malformed responses
return sorted(
compressed,
key=lambda x: x.metadata["relevance_score"],
reverse=True
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
custom_compressor = StructuredCompressor(llm=llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=custom_compressor,
base_retriever=base_retriever
)
The structured output gives you explainability — you can log why each chunk was kept or dropped, which is valuable for debugging and compliance.
Compressor 9: Semantic Chunking + Embedding Filter
Combine semantic chunking with embedding filtering for better chunk boundaries:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
# Semantic splitter creates chunks at natural topic boundaries
semantic_splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90
)
# Then filter by similarity
emb_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.78
)
pipeline = DocumentCompressorPipeline(
transformers=[semantic_splitter, emb_filter]
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=pipeline,
base_retriever=base_retriever
)
Semantic chunking ensures that when a chunk is split, it doesn't cut mid-thought. This improves embedding filter accuracy because each chunk represents a coherent idea rather than an arbitrary text window.
For more on embedding strategies, see the semantic search tutorial.
Compressor 10: Parent Document Retriever with Compression
Retrieve parent documents (large context) but use compressed child documents for the LLM:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
# Child splitter for embedding (small, precise)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# Parent splitter for LLM context (larger, richer)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
store = InMemoryStore()
parent_retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
# Add documents
loader = WebBaseLoader("https://example.com/medical-docs")
docs = loader.load()
parent_retriever.add_documents(docs)
# Combine with LLM compressor for final context reduction
llm_filter = LLMChainFilter.from_llm(ChatOpenAI(model="gpt-4o-mini"))
final_retriever = ContextualCompressionRetriever(
base_compressor=llm_filter,
base_retriever=parent_retriever
)
results = final_retriever.invoke("What dosage adjustments are needed for renal impairment?")
The parent-child architecture retrieves semantically precise child chunks but stores parent context. After filtering, the LLM receives rich parent-level passages for high-quality answers without the overhead of embedding entire large documents.
Cost Savings Comparison Table
| Compressor | API Calls per Query | Avg Token Reduction | Cost per 1K Queries | Best Use Case |
|---|---|---|---|---|
| LLMChainExtractor | k LLM calls | 65–75% | $0.30–$1.20 | High accuracy needs |
| LLMChainFilter | k LLM calls | 40–60% (drops chunks) | $0.15–$0.60 | Binary relevance |
| EmbeddingsFilter | k embed calls | 50–70% | $0.001–$0.005 | High throughput |
| CohereRerank | 1 Cohere call | 60–80% | $0.001 | Production RAG |
| FlashrankRerank | 0 API calls | 60–80% | $0.000 | Air-gapped / private |
| Pipeline (emb→LLM) | k embed + filtered LLM | 70–85% | $0.05–$0.20 | Balanced cost/quality |
| LLMListwiseRerank | 1 large LLM call | 70–80% | $0.50–$2.00 | Complex queries |
| Custom Structured | k LLM calls | 60–75% | $0.20–$0.80 | Domain-specific logic |
| Semantic + Emb | k embed calls | 55–72% | $0.002–$0.008 | Semantic coherence |
| Parent-Child + Filter | k LLM calls | 40–55% | $0.15–$0.50 | Long-doc retrieval |
Example cost savings calculation:
- Baseline: 8 chunks × 500 tokens = 4,000 tokens at $5/M = $0.020 per query
- With CohereRerank (keep top 3): 3 × 500 = 1,500 tokens = $0.0075 + $0.001 Cohere = $0.0085
- Savings: 57% cost reduction per query
- At 100,000 queries/month: $2,000 baseline → $850 with reranking → $1,150 monthly savings
Measuring Compression Quality
Don't tune compressors blindly. Track context precision with RAGAS:
# pip install ragas
from ragas import evaluate
from ragas.metrics import context_precision, faithfulness
from datasets import Dataset
questions = ["What are metformin side effects?", "How does metformin work?"]
results_compressed = []
for q in questions:
compressed_docs = compression_retriever.invoke(q)
context = "\n\n".join(d.page_content for d in compressed_docs)
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
"Answer using only this context:\n{context}\n\nQuestion: {question}"
)
answer = (prompt | ChatOpenAI(model="gpt-4o")).invoke({
"context": context,
"question": q
}).content
results_compressed.append({
"question": q,
"answer": answer,
"contexts": [d.page_content for d in compressed_docs],
"ground_truth": "Reference answer here"
})
eval_results = evaluate(
Dataset.from_list(results_compressed),
metrics=[context_precision, faithfulness]
)
print(eval_results)
A healthy production setup should show 50–80% token reduction with latency under 500ms for embedding-based filters and under 2 seconds for LLM-based extractors.
Tuning the EmbeddingsFilter Threshold
The similarity_threshold parameter is the most important knob to tune:
import numpy as np
def find_optimal_threshold(queries, docs, ground_truth_relevant, embeddings):
thresholds = np.arange(0.60, 0.95, 0.05)
results = []
for threshold in thresholds:
emb_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=float(threshold)
)
total_tp = total_fp = total_fn = 0
for query, doc_list, relevant_ids in zip(queries, docs, ground_truth_relevant):
filtered = emb_filter.compress_documents(doc_list, query)
filtered_ids = {d.metadata["id"] for d in filtered}
tp = len(filtered_ids & relevant_ids)
fp = len(filtered_ids - relevant_ids)
fn = len(relevant_ids - filtered_ids)
total_tp += tp
total_fp += fp
total_fn += fn
precision = total_tp / (total_tp + total_fp + 1e-9)
recall = total_tp / (total_tp + total_fn + 1e-9)
f1 = 2 * precision * recall / (precision + recall + 1e-9)
results.append({
"threshold": threshold,
"precision": precision,
"recall": recall,
"f1": f1
})
best = max(results, key=lambda x: x["f1"])
print(f"Optimal threshold: {best['threshold']:.2f} (F1={best['f1']:.3f})")
return best["threshold"]
Most RAG applications land between 0.72 and 0.82. Medical and legal domains often need higher thresholds (0.80+) to avoid including loosely related content.
Production Architecture Pattern
For a production RAG pipeline, combine compressors with caching and async retrieval:
import asyncio
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
("system", "Answer based only on the provided context. Say so if the answer isn't there."),
("human", "Context:\n{context}\n\nQuestion: {question}")
])
def format_docs(docs):
return "\n\n---\n\n".join(
f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
for doc in docs
)
rag_chain = (
RunnableParallel({
"context": compression_retriever | format_docs,
"question": RunnablePassthrough()
})
| prompt
| ChatOpenAI(model="gpt-4o", temperature=0)
| StrOutputParser()
)
# Async for concurrent requests
async def answer_questions(questions: list[str]) -> list[str]:
tasks = [rag_chain.ainvoke(q) for q in questions]
return await asyncio.gather(*tasks)
# Stream individual responses
async def stream_answer(question: str):
async for chunk in rag_chain.astream(question):
print(chunk, end="", flush=True)
For complete RAG pipeline examples, see the RAG system tutorial and Build AI agent with LangChain.
Choosing the Right Compressor
High throughput (>10K queries/day): EmbeddingsFilter or CohereRerank. Both are fast and cheap. Combine them in a pipeline for best results.
High accuracy (medical, legal, financial): LLMChainExtractor or Pipeline(EmbeddingsFilter → LLMChainExtractor). The extra LLM cost is worth it when wrong answers have real consequences.
Private data / air-gapped: FlashrankRerank is your best option without API calls. Performance is surprisingly good for most use cases.
Complex multi-document reasoning: LLMListwiseRerank. Expensive, but it can compare documents against each other to find the most comprehensive answer.
Cost-optimized production: Pipeline with EmbeddingsFilter (threshold 0.76) → CohereRerank (top_n=3). This reduces LLM context by 75–85% with minimal accuracy loss.
The OpenAI API integration guide covers token counting utilities that help you measure compression ratios in production.
Document compressors are one of the highest-ROI optimizations you can make to a RAG system. A well-tuned compression pipeline pays for itself within days through reduced API costs, and the answer quality improvements are often immediately visible. Start with EmbeddingsFilter for quick wins, add CohereRerank for production accuracy, and build a DocumentCompressorPipeline when you need both speed and quality.
For the full picture, explore the LangChain tutorial 2025, AI agent memory and planning, and Deploy AI model to production.
Frequently Asked Questions
What is a LangChain document compressor? A document compressor is a post-retrieval filter that takes the chunks returned by your vector store and strips out sentences or passages that aren't relevant to the query before sending them to the LLM, reducing token usage and improving answer quality.
Which compressor is cheapest to run? EmbeddingsFilter is the cheapest because it uses only embedding similarity comparisons and never calls an LLM. CohereRerank is also cost-effective at scale because Cohere charges per 1,000 API calls, not per token.
Can I chain multiple compressors together? Yes. Use DocumentCompressorPipeline to stack compressors in order. A common pattern is EmbeddingsFilter first to remove clearly irrelevant chunks, then LLMChainExtractor to compress the survivors.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.