7 LangChain Document Transformers (Splitters, Filters, Embeddings)
Master LangChain document transformers to preprocess documents for RAG — splitters, filters, embeddings, and redundancy removal in Python.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Garbage in, garbage out. That rule hits harder in RAG pipelines than anywhere else in AI. You can have the best retriever in the world, but if your documents are 8,000-token walls of text with duplicate paragraphs and irrelevant boilerplate, retrieval quality suffers and your LLM responses will reflect that.
Document transformers are the preprocessing layer that stands between raw content and your vector store. LangChain ships a rich set of them — text splitters, metadata enrichers, redundancy filters, and embedding-based filters — and understanding how each one works will directly improve your RAG system tutorial results.
This guide covers seven of the most useful document transformers with working Python code for each one.
Why Document Transformation Matters
Most raw documents are not retrieval-friendly out of the box. A PDF research paper might have 40 pages. A web scrape might contain navigation menus, cookie banners, and repeated footer text. A codebase might have 1,000-line files where a single function spans hundreds of lines.
When you embed raw documents without transformation:
- Long chunks exceed model context windows and get truncated mid-sentence
- Similar content creates noisy nearest-neighbor results
- Irrelevant boilerplate skews embedding directions away from the actual content
- Mixed content types confuse semantic search scoring
A well-designed transformation pipeline compresses each document into clean, focused, appropriately-sized chunks before they ever reach your vector database guide. The retriever then has much cleaner signal to work with.
According to benchmarks on BEIR (a standard retrieval evaluation suite), proper chunking strategies alone improve retrieval recall by 15–25% compared to naive whole-document embedding. That is a meaningful gain before you have even touched your retrieval algorithm.
Setup
Install the packages you need:
pip install langchain langchain-openai langchain-community chromadb tiktoken
Set your API key:
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"
Transformer 1: RecursiveCharacterTextSplitter
This is the workhorse of LangChain splitting. It tries to split on paragraph breaks first, then sentences, then words, then individual characters — working recursively until chunks are small enough. The recursive approach means it almost always produces clean splits at natural language boundaries.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
raw_text = """
LangChain is a framework for building applications powered by language models.
It provides tools for chaining together LLM calls, managing prompts, and connecting
to external data sources. The core abstraction is the Chain — a sequence of steps
that each take some input and produce some output.
Retrieval-Augmented Generation (RAG) is one of the most popular patterns built
with LangChain. In a RAG pipeline, documents are indexed in a vector store, and
at query time the most relevant chunks are retrieved and passed to the LLM as
additional context.
Agents extend this further by giving the LLM access to tools — search engines,
calculators, APIs, databases — and allowing it to decide which tools to call
based on the user's request.
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=200, # target chunk size in characters
chunk_overlap=40, # overlap between chunks for context continuity
length_function=len,
is_separator_regex=False,
)
docs = splitter.create_documents([raw_text])
for i, doc in enumerate(docs):
print(f"Chunk {i}: {len(doc.page_content)} chars")
print(doc.page_content[:100])
print("---")
For token-aware splitting, which matters when you know your model's exact context window:
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
def tiktoken_len(text):
tokenizer = tiktoken.get_encoding("cl100k_base")
tokens = tokenizer.encode(text, disallowed_special=())
return len(tokens)
splitter = RecursiveCharacterTextSplitter(
chunk_size=256, # now measured in tokens, not characters
chunk_overlap=32,
length_function=tiktoken_len,
)
Language-aware splitting for code — this keeps function definitions intact:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1000,
chunk_overlap=100,
)
python_code = """
def calculate_embeddings(texts, model="text-embedding-3-small"):
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(input=texts, model=model)
return [item.embedding for item in response.data]
class VectorStore:
def __init__(self, dimension):
self.dimension = dimension
self.vectors = []
def add(self, vector, metadata=None):
self.vectors.append({"vector": vector, "metadata": metadata or {}})
def search(self, query_vector, k=5):
import numpy as np
scores = []
for item in self.vectors:
similarity = np.dot(query_vector, item["vector"])
scores.append((similarity, item["metadata"]))
return sorted(scores, reverse=True)[:k]
"""
code_chunks = python_splitter.create_documents([python_code])
print(f"Split into {len(code_chunks)} chunks")
for chunk in code_chunks:
print(f" {chunk.page_content[:60]}...")
The from_language method sets separator priorities that match the language's syntax. For Python it splits at class definitions, function definitions, and decorators before resorting to raw newlines.
Transformer 2: MarkdownTextSplitter
Markdown documents have inherent structure — headers create sections, code blocks have delimiters, bullet lists group related items. The MarkdownTextSplitter respects that structure by preferring splits at header boundaries, which keeps related content together.
from langchain.text_splitter import MarkdownTextSplitter
markdown_doc = """
# Introduction to Transformers
Transformer models revolutionized NLP in 2017 with the "Attention Is All You Need" paper.
They replaced recurrent networks with self-attention mechanisms that capture long-range dependencies.
## Self-Attention Mechanism
The attention mechanism allows each token to attend to all other tokens in the sequence.
This enables capturing relationships that RNNs struggled with due to vanishing gradients.
### Scaled Dot-Product Attention
Attention scores are computed as the dot product of queries and keys,
scaled by the square root of the key dimension to prevent gradient saturation.
## Positional Encoding
Since attention has no inherent notion of sequence order, positional encodings are added
to inject position information into the token representations.
## Applications
Transformers are now used in vision, audio, code generation, and multimodal tasks.
The architecture has proven remarkably general across very different data types.
"""
md_splitter = MarkdownTextSplitter(chunk_size=300, chunk_overlap=50)
md_chunks = md_splitter.create_documents([markdown_doc])
for chunk in md_chunks:
print(f"Content: {chunk.page_content[:150]}")
print("---")
Transformer 3: TokenTextSplitter
When you want strict token-count guarantees rather than character estimates, TokenTextSplitter gives you exactly that. This matters most when assembling prompts manually where you need to guarantee you stay under a specific token budget.
from langchain.text_splitter import TokenTextSplitter
token_splitter = TokenTextSplitter(
model_name="gpt-4o", # uses tiktoken encoding for this model
chunk_size=512,
chunk_overlap=64,
)
# Generate a long text to demonstrate
sample_text = " ".join([
f"This is sentence number {i} with some extra filler content to make it longer."
for i in range(300)
])
token_chunks = token_splitter.create_documents([sample_text])
print(f"Created {len(token_chunks)} chunks from {len(sample_text)} characters")
# Verify actual token counts
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
for i, chunk in enumerate(token_chunks[:3]):
token_count = len(enc.encode(chunk.page_content))
print(f"Chunk {i}: {token_count} tokens (target was 512)")
The token counts will stay very close to the target — typically within 1–2 tokens of chunk_size due to encoding edge cases at boundaries.
Transformer 4: EmbeddingsFilter (Semantic Relevance Filtering)
Once documents are retrieved from your vector store, some may not actually answer the user's query — they just happened to be nearby in embedding space. EmbeddingsFilter drops those irrelevant documents by computing direct similarity between each document and the query text.
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
sample_docs = [
Document(page_content="Python is a high-level programming language.", metadata={"source": "wiki"}),
Document(page_content="The Eiffel Tower is located in Paris, France.", metadata={"source": "wiki"}),
Document(page_content="Machine learning uses statistical algorithms to find patterns.", metadata={"source": "wiki"}),
Document(page_content="Python is widely used for data science and machine learning.", metadata={"source": "wiki"}),
Document(page_content="Neural networks are inspired by biological brain structures.", metadata={"source": "wiki"}),
]
vectorstore = Chroma.from_documents(sample_docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Drop docs with similarity below 0.76 to the query
embeddings_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.76
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=embeddings_filter,
base_retriever=base_retriever,
)
query = "What programming language is good for AI?"
filtered_docs = compression_retriever.invoke(query)
print(f"Retrieved and filtered: {len(filtered_docs)} relevant documents")
for doc in filtered_docs:
print(f" - {doc.page_content}")
The Eiffel Tower document gets filtered out because it has low semantic similarity to an AI programming question. The Python and neural network documents stay. This prevents off-topic content from confusing your LLM.
Transformer 5: EmbeddingsRedundantFilter
Duplicate and near-duplicate content inflates your context window and causes the LLM to repeat information in its response. This is particularly common when scraping multiple versions of the same page, or when source documents have overlapping sections. EmbeddingsRedundantFilter removes documents that are too semantically similar to each other.
from langchain.retrievers.document_compressors import EmbeddingsRedundantFilter
# Simulate retrieved documents with near-duplicates
duplicate_docs = [
Document(page_content="LangChain helps you build LLM applications easily."),
Document(page_content="LangChain is a framework for building LLM-powered apps."), # near-duplicate
Document(page_content="LangChain makes it easy to create language model applications."), # near-duplicate
Document(page_content="Vector databases store embeddings for semantic search."),
Document(page_content="Embeddings are numerical representations of text meaning."),
]
redundancy_filter = EmbeddingsRedundantFilter(
embeddings=embeddings,
similarity_threshold=0.90 # drop docs with more than 90% similarity to a kept doc
)
filtered = redundancy_filter.transform_documents(duplicate_docs)
print(f"Before: {len(duplicate_docs)} docs, After: {len(filtered)} docs")
for doc in filtered:
print(f" KEPT: {doc.page_content}")
The three near-duplicate LangChain descriptions collapse down to one. You keep the most informative version and drop the repetition.
Transformer 6: LongContextReorder
Research from Liu et al. (2023) showed that LLMs perform worst when critical information sits in the middle of a long context — the "lost in the middle" problem. LongContextReorder addresses this by placing the most relevant documents at the beginning and end of the context, leaving the least relevant in the middle where the model pays less attention.
from langchain_community.document_transformers import LongContextReorder
# Documents ordered by relevance score (most relevant first)
ordered_docs = [
Document(page_content="Most relevant: directly answers the question with specific facts."),
Document(page_content="Second: very closely related to the query topic."),
Document(page_content="Third: somewhat related with partial information."),
Document(page_content="Fourth: tangentially related background context."),
Document(page_content="Fifth: general context that might help."),
Document(page_content="Sixth: loosely related information."),
Document(page_content="Seventh: barely related to the query."),
Document(page_content="Eighth: least relevant of the retrieved set."),
]
reorder = LongContextReorder()
reordered = reorder.transform_documents(ordered_docs)
print("New order (best docs at edges, worst in middle):")
for i, doc in enumerate(reordered):
print(f" Position {i}: {doc.page_content[:55]}")
The algorithm interleaves the ranked list so position 0 gets the most relevant, position 1 gets the least relevant, position 2 gets the second most relevant, and so on. Key information stays at the edges where attention is highest.
Transformer 7: HTMLHeaderTextSplitter
HTML documents from web scraping have semantic structure in their heading hierarchy that you want to preserve. HTMLHeaderTextSplitter keeps heading context as metadata on each chunk, which makes filtered retrieval much more precise.
from langchain.text_splitter import HTMLHeaderTextSplitter
html_content = """
<!DOCTYPE html>
<html>
<body>
<h1>LangChain Complete Guide</h1>
<p>LangChain is a comprehensive framework for building LLM applications.</p>
<h2>Core Components</h2>
<p>The main components are chains, agents, memory, and retrievers.</p>
<h3>Chains</h3>
<p>Chains connect multiple LLM calls together using LCEL syntax.</p>
<p>They can be composed declaratively using the pipe operator.</p>
<h3>Agents</h3>
<p>Agents use LLMs to decide dynamically which tools to call.</p>
<p>They handle open-ended tasks that require multi-step reasoning.</p>
<h2>Getting Started</h2>
<p>Install LangChain with pip install langchain and set your API key.</p>
</body>
</html>
"""
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_chunks = html_splitter.split_text(html_content)
for chunk in html_chunks:
print(f"Metadata: {chunk.metadata}")
print(f"Content: {chunk.page_content[:120]}")
print("---")
Each chunk carries its full heading hierarchy in metadata. When you filter by {"Header 2": "Core Components"}, you retrieve only chunks from that section — more precise than pure semantic search alone.
Building a Full Transformation Pipeline
Here is how you combine multiple transformers into a production-ready ingestion and retrieval pipeline:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers.document_compressors import (
EmbeddingsRedundantFilter,
EmbeddingsFilter,
)
from langchain_community.document_transformers import LongContextReorder
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
def preprocess_documents(raw_documents: list) -> list:
"""Full preprocessing pipeline for RAG ingestion."""
# Step 1: Split into appropriately-sized chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " ", ""]
)
split_docs = splitter.transform_documents(raw_documents)
print(f"After splitting: {len(split_docs)} chunks")
# Step 2: Remove near-duplicate chunks before indexing
dedup_filter = EmbeddingsRedundantFilter(
embeddings=embeddings,
similarity_threshold=0.92
)
unique_docs = dedup_filter.transform_documents(split_docs)
print(f"After deduplication: {len(unique_docs)} chunks")
return unique_docs
def retrieve_with_pipeline(
query: str,
vectorstore,
k_initial: int = 20,
k_final: int = 6
) -> list:
"""Retrieve, filter by relevance, reorder for LLM consumption."""
# Over-retrieve to give the filter enough candidates
initial_docs = vectorstore.similarity_search(query, k=k_initial)
# Keep only docs actually relevant to this query
relevance_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.75,
k=k_final
)
relevant_docs = relevance_filter.compress_documents(initial_docs, query)
print(f"After relevance filter: {len(relevant_docs)} docs")
# Reorder to combat the lost-in-the-middle effect
reorder = LongContextReorder()
final_docs = reorder.transform_documents(relevant_docs)
return final_docs
# Build the index
sample_documents = [
Document(
page_content="""
Vector databases store high-dimensional embeddings for efficient similarity search.
Popular options include Chroma, Pinecone, Weaviate, and Qdrant.
Chroma is excellent for local development — no API key required.
Pinecone is a managed service that scales to billions of vectors automatically.
Weaviate supports hybrid search combining dense and sparse retrieval.
""",
metadata={"source": "db_guide.txt", "category": "infrastructure"}
),
]
processed = preprocess_documents(sample_documents)
vectorstore = Chroma.from_documents(
processed,
embeddings,
persist_directory="./chroma_db"
)
# Query with the full pipeline
results = retrieve_with_pipeline(
"Which vector database should I use for a production app?",
vectorstore
)
print(f"\nFinal context for LLM: {len(results)} documents")
Comparison Table: LangChain Document Transformers
| Transformer | Best For | Preserves Structure | Speed | Token Awareness |
|---|---|---|---|---|
| RecursiveCharacterTextSplitter | General text, code | Partial | Fast | Optional (tiktoken) |
| MarkdownTextSplitter | Markdown docs | Yes — headers | Fast | No |
| TokenTextSplitter | Strict token limits | No | Fast | Yes |
| HTMLHeaderTextSplitter | HTML / web content | Yes — headings | Fast | No |
| EmbeddingsFilter | Relevance filtering | N/A | Slow (API) | N/A |
| EmbeddingsRedundantFilter | Deduplication | N/A | Slow (API) | N/A |
| LongContextReorder | Context ordering | N/A | Very fast | N/A |
Metadata Enrichment During Splitting
You can attach rich metadata during splitting that enables filtered retrieval later — a capability most tutorials skip:
import hashlib
from langchain.text_splitter import RecursiveCharacterTextSplitter
def split_with_rich_metadata(
text: str,
source: str,
category: str,
chunk_size: int = 512
) -> list:
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=64,
)
chunks = splitter.create_documents([text])
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source": source,
"category": category,
"chunk_index": i,
"total_chunks": len(chunks),
"char_count": len(chunk.page_content),
"chunk_hash": hashlib.md5(chunk.page_content.encode()).hexdigest()[:8],
"is_first": i == 0,
"is_last": i == len(chunks) - 1,
})
return chunks
# Retrieve only from a specific category later
engineering_results = vectorstore.similarity_search(
query="installation procedure",
k=5,
filter={"category": "engineering"}
)
Performance Considerations
Chunk size is empirical. Start with 512 tokens, run retrieval evaluation, then adjust. Shorter chunks (128–256 tokens) work better for factoid Q&A. Longer chunks (512–1024) work better when questions require synthesizing multiple paragraphs.
Overlap has diminishing returns above 25%. Ten to fifteen percent overlap maintains context continuity across chunk boundaries. Higher values waste storage and slow retrieval without measurable quality improvement.
EmbeddingsFilter adds latency and cost. Every invocation calls the embedding API. For large pipelines, run EmbeddingsRedundantFilter at ingestion time (once, offline) rather than at query time (every request).
Batch your embedding calls. The OpenAI embeddings API allows 2,048 inputs per request:
# Efficient — single API call for all documents
texts = [doc.page_content for doc in documents]
all_embeddings = embeddings.embed_documents(texts) # batched automatically
Integration with Agent Pipelines
Document transformers connect naturally to Build AI agent with LangChain pipelines. The transformed and indexed documents become the knowledge base that agents query when they need factual grounding. Combined with AI agent memory and planning, you get agents that both retrieve from documents and remember conversation history.
For an end-to-end pipeline that uses RecursiveCharacterTextSplitter, EmbeddingsRedundantFilter, and LongContextReorder together, see AI research agent build. For the full indexing workflow from raw files to production retriever, LangChain tutorial 2025 covers every step.
Common Mistakes to Avoid
Splitting before setting metadata. Always set metadata on the source Document object before splitting. Once split, chunks lose the parent-document relationship unless you explicitly copy it over.
Using default separators for code. RecursiveCharacterTextSplitter will split a function definition in half. Always use from_language() for code files — it knows about class boundaries and decorators.
Skipping deduplication. Web-scraped content almost always has near-duplicates. Running EmbeddingsRedundantFilter once at ingestion time prevents retrieval results from being dominated by slightly different phrasings of the same fact.
Ignoring chunk overlap for follow-up questions. Content straddling two chunks with zero overlap causes missed context. Ten percent overlap prevents most of these failures without excessive storage cost.
Frequently Asked Questions
What is the best text splitter for code files in LangChain?
RecursiveCharacterTextSplitter with language-specific separators works best for code. Pass language=Language.PYTHON to from_language() to get syntax-aware splits that respect function and class boundaries rather than splitting mid-definition. Supported languages include Python, JavaScript, TypeScript, Go, Rust, Java, C++, and more.
How does EmbeddingsRedundantFilter work in LangChain?
EmbeddingsRedundantFilter computes cosine similarity between document embeddings and drops any document that is too similar to one already in the kept list. It processes documents in order, retaining the first occurrence of any near-duplicate cluster. A similarity_threshold of 0.95 is conservative (only removes very close duplicates), while 0.85 is more aggressive (removes paraphrases too).
Can I chain multiple document transformers together?
Yes. Call each transformer sequentially — the output list from one feeds directly into the next as a plain list of Document objects. A typical production chain is: split with RecursiveCharacterTextSplitter, deduplicate with EmbeddingsRedundantFilter at ingestion time, then at query time filter with EmbeddingsFilter and reorder with LongContextReorder before passing context to the LLM.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.