AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

documents being split and embedded for AI — LangChain document transformers splitters

7 LangChain Document Transformers (Splitters, Filters, Embeddings)

⚡ Quick Answer

Master LangChain document transformers to preprocess documents for RAG — splitters, filters, embeddings, and redundancy removal in Python.

AiTechWorlds Team May 31, 2026 15 min read

#LangChain #document transformers #text splitters #embeddings #RAG

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Garbage in, garbage out. That rule hits harder in RAG pipelines than anywhere else in AI. You can have the best retriever in the world, but if your documents are 8,000-token walls of text with duplicate paragraphs and irrelevant boilerplate, retrieval quality suffers and your LLM responses will reflect that.

Document transformers are the preprocessing layer that stands between raw content and your vector store. LangChain ships a rich set of them — text splitters, metadata enrichers, redundancy filters, and embedding-based filters — and understanding how each one works will directly improve your RAG system tutorial results.

This guide covers seven of the most useful document transformers with working Python code for each one.

Why Document Transformation Matters

Most raw documents are not retrieval-friendly out of the box. A PDF research paper might have 40 pages. A web scrape might contain navigation menus, cookie banners, and repeated footer text. A codebase might have 1,000-line files where a single function spans hundreds of lines.

When you embed raw documents without transformation:

Long chunks exceed model context windows and get truncated mid-sentence
Similar content creates noisy nearest-neighbor results
Irrelevant boilerplate skews embedding directions away from the actual content
Mixed content types confuse semantic search scoring

A well-designed transformation pipeline compresses each document into clean, focused, appropriately-sized chunks before they ever reach your vector database guide. The retriever then has much cleaner signal to work with.

According to benchmarks on BEIR (a standard retrieval evaluation suite), proper chunking strategies alone improve retrieval recall by 15–25% compared to naive whole-document embedding. That is a meaningful gain before you have even touched your retrieval algorithm.

Setup

Install the packages you need:

pip install langchain langchain-openai langchain-community chromadb tiktoken

Set your API key:

import os
os.environ["OPENAI_API_KEY"] = "your-key-here"

Transformer 1: RecursiveCharacterTextSplitter

This is the workhorse of LangChain splitting. It tries to split on paragraph breaks first, then sentences, then words, then individual characters — working recursively until chunks are small enough. The recursive approach means it almost always produces clean splits at natural language boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

raw_text = """
LangChain is a framework for building applications powered by language models.

It provides tools for chaining together LLM calls, managing prompts, and connecting
to external data sources. The core abstraction is the Chain — a sequence of steps
that each take some input and produce some output.

Retrieval-Augmented Generation (RAG) is one of the most popular patterns built
with LangChain. In a RAG pipeline, documents are indexed in a vector store, and
at query time the most relevant chunks are retrieved and passed to the LLM as
additional context.

Agents extend this further by giving the LLM access to tools — search engines,
calculators, APIs, databases — and allowing it to decide which tools to call
based on the user's request.
"""

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,           # target chunk size in characters
    chunk_overlap=40,         # overlap between chunks for context continuity
    length_function=len,
    is_separator_regex=False,
)

docs = splitter.create_documents([raw_text])

for i, doc in enumerate(docs):
    print(f"Chunk {i}: {len(doc.page_content)} chars")
    print(doc.page_content[:100])
    print("---")

For token-aware splitting, which matters when you know your model's exact context window:

import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text, disallowed_special=())
    return len(tokens)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,       # now measured in tokens, not characters
    chunk_overlap=32,
    length_function=tiktoken_len,
)

Language-aware splitting for code — this keeps function definitions intact:

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100,
)

python_code = """
def calculate_embeddings(texts, model="text-embedding-3-small"):
    from openai import OpenAI
    client = OpenAI()
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

class VectorStore:
    def __init__(self, dimension):
        self.dimension = dimension
        self.vectors = []

    def add(self, vector, metadata=None):
        self.vectors.append({"vector": vector, "metadata": metadata or {}})

    def search(self, query_vector, k=5):
        import numpy as np
        scores = []
        for item in self.vectors:
            similarity = np.dot(query_vector, item["vector"])
            scores.append((similarity, item["metadata"]))
        return sorted(scores, reverse=True)[:k]
"""

code_chunks = python_splitter.create_documents([python_code])
print(f"Split into {len(code_chunks)} chunks")
for chunk in code_chunks:
    print(f"  {chunk.page_content[:60]}...")

The from_language method sets separator priorities that match the language's syntax. For Python it splits at class definitions, function definitions, and decorators before resorting to raw newlines.

Transformer 2: MarkdownTextSplitter

Markdown documents have inherent structure — headers create sections, code blocks have delimiters, bullet lists group related items. The MarkdownTextSplitter respects that structure by preferring splits at header boundaries, which keeps related content together.

from langchain.text_splitter import MarkdownTextSplitter

markdown_doc = """
# Introduction to Transformers

Transformer models revolutionized NLP in 2017 with the "Attention Is All You Need" paper.
They replaced recurrent networks with self-attention mechanisms that capture long-range dependencies.

## Self-Attention Mechanism

The attention mechanism allows each token to attend to all other tokens in the sequence.
This enables capturing relationships that RNNs struggled with due to vanishing gradients.

### Scaled Dot-Product Attention

Attention scores are computed as the dot product of queries and keys,
scaled by the square root of the key dimension to prevent gradient saturation.

## Positional Encoding

Since attention has no inherent notion of sequence order, positional encodings are added
to inject position information into the token representations.

## Applications

Transformers are now used in vision, audio, code generation, and multimodal tasks.
The architecture has proven remarkably general across very different data types.
"""

md_splitter = MarkdownTextSplitter(chunk_size=300, chunk_overlap=50)
md_chunks = md_splitter.create_documents([markdown_doc])

for chunk in md_chunks:
    print(f"Content: {chunk.page_content[:150]}")
    print("---")

Transformer 3: TokenTextSplitter

When you want strict token-count guarantees rather than character estimates, TokenTextSplitter gives you exactly that. This matters most when assembling prompts manually where you need to guarantee you stay under a specific token budget.

from langchain.text_splitter import TokenTextSplitter

token_splitter = TokenTextSplitter(
    model_name="gpt-4o",   # uses tiktoken encoding for this model
    chunk_size=512,
    chunk_overlap=64,
)

# Generate a long text to demonstrate
sample_text = " ".join([
    f"This is sentence number {i} with some extra filler content to make it longer."
    for i in range(300)
])

token_chunks = token_splitter.create_documents([sample_text])
print(f"Created {len(token_chunks)} chunks from {len(sample_text)} characters")

# Verify actual token counts
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
for i, chunk in enumerate(token_chunks[:3]):
    token_count = len(enc.encode(chunk.page_content))
    print(f"Chunk {i}: {token_count} tokens (target was 512)")

The token counts will stay very close to the target — typically within 1–2 tokens of chunk_size due to encoding edge cases at boundaries.

Transformer 4: EmbeddingsFilter (Semantic Relevance Filtering)

Once documents are retrieved from your vector store, some may not actually answer the user's query — they just happened to be nearby in embedding space. EmbeddingsFilter drops those irrelevant documents by computing direct similarity between each document and the query text.

from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

sample_docs = [
    Document(page_content="Python is a high-level programming language.", metadata={"source": "wiki"}),
    Document(page_content="The Eiffel Tower is located in Paris, France.", metadata={"source": "wiki"}),
    Document(page_content="Machine learning uses statistical algorithms to find patterns.", metadata={"source": "wiki"}),
    Document(page_content="Python is widely used for data science and machine learning.", metadata={"source": "wiki"}),
    Document(page_content="Neural networks are inspired by biological brain structures.", metadata={"source": "wiki"}),
]

vectorstore = Chroma.from_documents(sample_docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Drop docs with similarity below 0.76 to the query
embeddings_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.76
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter,
    base_retriever=base_retriever,
)

query = "What programming language is good for AI?"
filtered_docs = compression_retriever.invoke(query)

print(f"Retrieved and filtered: {len(filtered_docs)} relevant documents")
for doc in filtered_docs:
    print(f"  - {doc.page_content}")

The Eiffel Tower document gets filtered out because it has low semantic similarity to an AI programming question. The Python and neural network documents stay. This prevents off-topic content from confusing your LLM.

Transformer 5: EmbeddingsRedundantFilter

Duplicate and near-duplicate content inflates your context window and causes the LLM to repeat information in its response. This is particularly common when scraping multiple versions of the same page, or when source documents have overlapping sections. EmbeddingsRedundantFilter removes documents that are too semantically similar to each other.

from langchain.retrievers.document_compressors import EmbeddingsRedundantFilter

# Simulate retrieved documents with near-duplicates
duplicate_docs = [
    Document(page_content="LangChain helps you build LLM applications easily."),
    Document(page_content="LangChain is a framework for building LLM-powered apps."),      # near-duplicate
    Document(page_content="LangChain makes it easy to create language model applications."), # near-duplicate
    Document(page_content="Vector databases store embeddings for semantic search."),
    Document(page_content="Embeddings are numerical representations of text meaning."),
]

redundancy_filter = EmbeddingsRedundantFilter(
    embeddings=embeddings,
    similarity_threshold=0.90   # drop docs with more than 90% similarity to a kept doc
)

filtered = redundancy_filter.transform_documents(duplicate_docs)
print(f"Before: {len(duplicate_docs)} docs, After: {len(filtered)} docs")
for doc in filtered:
    print(f"  KEPT: {doc.page_content}")

The three near-duplicate LangChain descriptions collapse down to one. You keep the most informative version and drop the repetition.

Transformer 6: LongContextReorder

Research from Liu et al. (2023) showed that LLMs perform worst when critical information sits in the middle of a long context — the "lost in the middle" problem. LongContextReorder addresses this by placing the most relevant documents at the beginning and end of the context, leaving the least relevant in the middle where the model pays less attention.

from langchain_community.document_transformers import LongContextReorder

# Documents ordered by relevance score (most relevant first)
ordered_docs = [
    Document(page_content="Most relevant: directly answers the question with specific facts."),
    Document(page_content="Second: very closely related to the query topic."),
    Document(page_content="Third: somewhat related with partial information."),
    Document(page_content="Fourth: tangentially related background context."),
    Document(page_content="Fifth: general context that might help."),
    Document(page_content="Sixth: loosely related information."),
    Document(page_content="Seventh: barely related to the query."),
    Document(page_content="Eighth: least relevant of the retrieved set."),
]

reorder = LongContextReorder()
reordered = reorder.transform_documents(ordered_docs)

print("New order (best docs at edges, worst in middle):")
for i, doc in enumerate(reordered):
    print(f"  Position {i}: {doc.page_content[:55]}")

The algorithm interleaves the ranked list so position 0 gets the most relevant, position 1 gets the least relevant, position 2 gets the second most relevant, and so on. Key information stays at the edges where attention is highest.

Transformer 7: HTMLHeaderTextSplitter

HTML documents from web scraping have semantic structure in their heading hierarchy that you want to preserve. HTMLHeaderTextSplitter keeps heading context as metadata on each chunk, which makes filtered retrieval much more precise.

from langchain.text_splitter import HTMLHeaderTextSplitter

html_content = """
<!DOCTYPE html>
<html>
<body>
<h1>LangChain Complete Guide</h1>
<p>LangChain is a comprehensive framework for building LLM applications.</p>

<h2>Core Components</h2>
<p>The main components are chains, agents, memory, and retrievers.</p>

<h3>Chains</h3>
<p>Chains connect multiple LLM calls together using LCEL syntax.</p>
<p>They can be composed declaratively using the pipe operator.</p>

<h3>Agents</h3>
<p>Agents use LLMs to decide dynamically which tools to call.</p>
<p>They handle open-ended tasks that require multi-step reasoning.</p>

<h2>Getting Started</h2>
<p>Install LangChain with pip install langchain and set your API key.</p>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_chunks = html_splitter.split_text(html_content)

for chunk in html_chunks:
    print(f"Metadata: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:120]}")
    print("---")

Each chunk carries its full heading hierarchy in metadata. When you filter by {"Header 2": "Core Components"}, you retrieve only chunks from that section — more precise than pure semantic search alone.

Building a Full Transformation Pipeline

Here is how you combine multiple transformers into a production-ready ingestion and retrieval pipeline:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers.document_compressors import (
    EmbeddingsRedundantFilter,
    EmbeddingsFilter,
)
from langchain_community.document_transformers import LongContextReorder
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import os

os.environ["OPENAI_API_KEY"] = "your-key-here"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")


def preprocess_documents(raw_documents: list) -> list:
    """Full preprocessing pipeline for RAG ingestion."""

    # Step 1: Split into appropriately-sized chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    split_docs = splitter.transform_documents(raw_documents)
    print(f"After splitting: {len(split_docs)} chunks")

    # Step 2: Remove near-duplicate chunks before indexing
    dedup_filter = EmbeddingsRedundantFilter(
        embeddings=embeddings,
        similarity_threshold=0.92
    )
    unique_docs = dedup_filter.transform_documents(split_docs)
    print(f"After deduplication: {len(unique_docs)} chunks")

    return unique_docs


def retrieve_with_pipeline(
    query: str,
    vectorstore,
    k_initial: int = 20,
    k_final: int = 6
) -> list:
    """Retrieve, filter by relevance, reorder for LLM consumption."""

    # Over-retrieve to give the filter enough candidates
    initial_docs = vectorstore.similarity_search(query, k=k_initial)

    # Keep only docs actually relevant to this query
    relevance_filter = EmbeddingsFilter(
        embeddings=embeddings,
        similarity_threshold=0.75,
        k=k_final
    )
    relevant_docs = relevance_filter.compress_documents(initial_docs, query)
    print(f"After relevance filter: {len(relevant_docs)} docs")

    # Reorder to combat the lost-in-the-middle effect
    reorder = LongContextReorder()
    final_docs = reorder.transform_documents(relevant_docs)

    return final_docs


# Build the index
sample_documents = [
    Document(
        page_content="""
        Vector databases store high-dimensional embeddings for efficient similarity search.
        Popular options include Chroma, Pinecone, Weaviate, and Qdrant.
        Chroma is excellent for local development — no API key required.
        Pinecone is a managed service that scales to billions of vectors automatically.
        Weaviate supports hybrid search combining dense and sparse retrieval.
        """,
        metadata={"source": "db_guide.txt", "category": "infrastructure"}
    ),
]

processed = preprocess_documents(sample_documents)
vectorstore = Chroma.from_documents(
    processed,
    embeddings,
    persist_directory="./chroma_db"
)

# Query with the full pipeline
results = retrieve_with_pipeline(
    "Which vector database should I use for a production app?",
    vectorstore
)
print(f"\nFinal context for LLM: {len(results)} documents")

Comparison Table: LangChain Document Transformers

Transformer	Best For	Preserves Structure	Speed	Token Awareness
RecursiveCharacterTextSplitter	General text, code	Partial	Fast	Optional (tiktoken)
MarkdownTextSplitter	Markdown docs	Yes — headers	Fast	No
TokenTextSplitter	Strict token limits	No	Fast	Yes
HTMLHeaderTextSplitter	HTML / web content	Yes — headings	Fast	No
EmbeddingsFilter	Relevance filtering	N/A	Slow (API)	N/A
EmbeddingsRedundantFilter	Deduplication	N/A	Slow (API)	N/A
LongContextReorder	Context ordering	N/A	Very fast	N/A

Metadata Enrichment During Splitting

You can attach rich metadata during splitting that enables filtered retrieval later — a capability most tutorials skip:

import hashlib
from langchain.text_splitter import RecursiveCharacterTextSplitter


def split_with_rich_metadata(
    text: str,
    source: str,
    category: str,
    chunk_size: int = 512
) -> list:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=64,
    )

    chunks = splitter.create_documents([text])

    for i, chunk in enumerate(chunks):
        chunk.metadata.update({
            "source": source,
            "category": category,
            "chunk_index": i,
            "total_chunks": len(chunks),
            "char_count": len(chunk.page_content),
            "chunk_hash": hashlib.md5(chunk.page_content.encode()).hexdigest()[:8],
            "is_first": i == 0,
            "is_last": i == len(chunks) - 1,
        })

    return chunks


# Retrieve only from a specific category later
engineering_results = vectorstore.similarity_search(
    query="installation procedure",
    k=5,
    filter={"category": "engineering"}
)

Performance Considerations

Chunk size is empirical. Start with 512 tokens, run retrieval evaluation, then adjust. Shorter chunks (128–256 tokens) work better for factoid Q&A. Longer chunks (512–1024) work better when questions require synthesizing multiple paragraphs.

Overlap has diminishing returns above 25%. Ten to fifteen percent overlap maintains context continuity across chunk boundaries. Higher values waste storage and slow retrieval without measurable quality improvement.

EmbeddingsFilter adds latency and cost. Every invocation calls the embedding API. For large pipelines, run EmbeddingsRedundantFilter at ingestion time (once, offline) rather than at query time (every request).

Batch your embedding calls. The OpenAI embeddings API allows 2,048 inputs per request:

# Efficient — single API call for all documents
texts = [doc.page_content for doc in documents]
all_embeddings = embeddings.embed_documents(texts)  # batched automatically

Integration with Agent Pipelines

Document transformers connect naturally to Build AI agent with LangChain pipelines. The transformed and indexed documents become the knowledge base that agents query when they need factual grounding. Combined with AI agent memory and planning, you get agents that both retrieve from documents and remember conversation history.

For an end-to-end pipeline that uses RecursiveCharacterTextSplitter, EmbeddingsRedundantFilter, and LongContextReorder together, see AI research agent build. For the full indexing workflow from raw files to production retriever, LangChain tutorial 2025 covers every step.

Common Mistakes to Avoid

Splitting before setting metadata. Always set metadata on the source Document object before splitting. Once split, chunks lose the parent-document relationship unless you explicitly copy it over.

Using default separators for code. RecursiveCharacterTextSplitter will split a function definition in half. Always use from_language() for code files — it knows about class boundaries and decorators.

Skipping deduplication. Web-scraped content almost always has near-duplicates. Running EmbeddingsRedundantFilter once at ingestion time prevents retrieval results from being dominated by slightly different phrasings of the same fact.

Ignoring chunk overlap for follow-up questions. Content straddling two chunks with zero overlap causes missed context. Ten percent overlap prevents most of these failures without excessive storage cost.

Frequently Asked Questions

What is the best text splitter for code files in LangChain?

RecursiveCharacterTextSplitter with language-specific separators works best for code. Pass language=Language.PYTHON to from_language() to get syntax-aware splits that respect function and class boundaries rather than splitting mid-definition. Supported languages include Python, JavaScript, TypeScript, Go, Rust, Java, C++, and more.

How does EmbeddingsRedundantFilter work in LangChain?

EmbeddingsRedundantFilter computes cosine similarity between document embeddings and drops any document that is too similar to one already in the kept list. It processes documents in order, retaining the first occurrence of any near-duplicate cluster. A similarity_threshold of 0.95 is conservative (only removes very close duplicates), while 0.85 is more aggressive (removes paraphrases too).

Can I chain multiple document transformers together?

Yes. Call each transformer sequentially — the output list from one feeds directly into the next as a plain list of Document objects. A typical production chain is: split with RecursiveCharacterTextSplitter, deduplicate with EmbeddingsRedundantFilter at ingestion time, then at query time filter with EmbeddingsFilter and reorder with LongContextReorder before passing context to the LLM.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

RecursiveCharacterTextSplitter with language-specific separators works best for code. Pass language=Language.PYTHON to from_language() to get syntax-aware splits that respect function and class boundaries rather than splitting mid-definition.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes NotesEmbeddings & Vector Databases Reference BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

7 LangChain Document Transformers (Splitters, Filters, Embeddings)

⚡ Quick Answer

Master LangChain document transformers to preprocess documents for RAG — splitters, filters, embeddings, and redundancy removal in Python.

AiTechWorlds Team May 31, 2026 15 min read

#LangChain #document transformers #text splitters #embeddings #RAG

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

This guide covers seven of the most useful document transformers with working Python code for each one.

Why Document Transformation Matters

When you embed raw documents without transformation:

Long chunks exceed model context windows and get truncated mid-sentence
Similar content creates noisy nearest-neighbor results
Irrelevant boilerplate skews embedding directions away from the actual content
Mixed content types confuse semantic search scoring

Setup

Install the packages you need:

pip install langchain langchain-openai langchain-community chromadb tiktoken

Set your API key:

import os
os.environ["OPENAI_API_KEY"] = "your-key-here"

Transformer 1: RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

raw_text = """
LangChain is a framework for building applications powered by language models.

It provides tools for chaining together LLM calls, managing prompts, and connecting
to external data sources. The core abstraction is the Chain — a sequence of steps
that each take some input and produce some output.

Retrieval-Augmented Generation (RAG) is one of the most popular patterns built
with LangChain. In a RAG pipeline, documents are indexed in a vector store, and
at query time the most relevant chunks are retrieved and passed to the LLM as
additional context.

Agents extend this further by giving the LLM access to tools — search engines,
calculators, APIs, databases — and allowing it to decide which tools to call
based on the user's request.
"""

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,           # target chunk size in characters
    chunk_overlap=40,         # overlap between chunks for context continuity
    length_function=len,
    is_separator_regex=False,
)

docs = splitter.create_documents([raw_text])

for i, doc in enumerate(docs):
    print(f"Chunk {i}: {len(doc.page_content)} chars")
    print(doc.page_content[:100])
    print("---")

For token-aware splitting, which matters when you know your model's exact context window:

import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text, disallowed_special=())
    return len(tokens)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,       # now measured in tokens, not characters
    chunk_overlap=32,
    length_function=tiktoken_len,
)

Language-aware splitting for code — this keeps function definitions intact:

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100,
)

python_code = """
def calculate_embeddings(texts, model="text-embedding-3-small"):
    from openai import OpenAI
    client = OpenAI()
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

class VectorStore:
    def __init__(self, dimension):
        self.dimension = dimension
        self.vectors = []

    def add(self, vector, metadata=None):
        self.vectors.append({"vector": vector, "metadata": metadata or {}})

    def search(self, query_vector, k=5):
        import numpy as np
        scores = []
        for item in self.vectors:
            similarity = np.dot(query_vector, item["vector"])
            scores.append((similarity, item["metadata"]))
        return sorted(scores, reverse=True)[:k]
"""

code_chunks = python_splitter.create_documents([python_code])
print(f"Split into {len(code_chunks)} chunks")
for chunk in code_chunks:
    print(f"  {chunk.page_content[:60]}...")

Transformer 2: MarkdownTextSplitter

from langchain.text_splitter import MarkdownTextSplitter

markdown_doc = """
# Introduction to Transformers

Transformer models revolutionized NLP in 2017 with the "Attention Is All You Need" paper.
They replaced recurrent networks with self-attention mechanisms that capture long-range dependencies.

## Self-Attention Mechanism

The attention mechanism allows each token to attend to all other tokens in the sequence.
This enables capturing relationships that RNNs struggled with due to vanishing gradients.

### Scaled Dot-Product Attention

Attention scores are computed as the dot product of queries and keys,
scaled by the square root of the key dimension to prevent gradient saturation.

## Positional Encoding

Since attention has no inherent notion of sequence order, positional encodings are added
to inject position information into the token representations.

## Applications

Transformers are now used in vision, audio, code generation, and multimodal tasks.
The architecture has proven remarkably general across very different data types.
"""

md_splitter = MarkdownTextSplitter(chunk_size=300, chunk_overlap=50)
md_chunks = md_splitter.create_documents([markdown_doc])

for chunk in md_chunks:
    print(f"Content: {chunk.page_content[:150]}")
    print("---")

Transformer 3: TokenTextSplitter

from langchain.text_splitter import TokenTextSplitter

token_splitter = TokenTextSplitter(
    model_name="gpt-4o",   # uses tiktoken encoding for this model
    chunk_size=512,
    chunk_overlap=64,
)

# Generate a long text to demonstrate
sample_text = " ".join([
    f"This is sentence number {i} with some extra filler content to make it longer."
    for i in range(300)
])

token_chunks = token_splitter.create_documents([sample_text])
print(f"Created {len(token_chunks)} chunks from {len(sample_text)} characters")

# Verify actual token counts
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
for i, chunk in enumerate(token_chunks[:3]):
    token_count = len(enc.encode(chunk.page_content))
    print(f"Chunk {i}: {token_count} tokens (target was 512)")

The token counts will stay very close to the target — typically within 1–2 tokens of chunk_size due to encoding edge cases at boundaries.

Transformer 4: EmbeddingsFilter (Semantic Relevance Filtering)

from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

sample_docs = [
    Document(page_content="Python is a high-level programming language.", metadata={"source": "wiki"}),
    Document(page_content="The Eiffel Tower is located in Paris, France.", metadata={"source": "wiki"}),
    Document(page_content="Machine learning uses statistical algorithms to find patterns.", metadata={"source": "wiki"}),
    Document(page_content="Python is widely used for data science and machine learning.", metadata={"source": "wiki"}),
    Document(page_content="Neural networks are inspired by biological brain structures.", metadata={"source": "wiki"}),
]

vectorstore = Chroma.from_documents(sample_docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Drop docs with similarity below 0.76 to the query
embeddings_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.76
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter,
    base_retriever=base_retriever,
)

query = "What programming language is good for AI?"
filtered_docs = compression_retriever.invoke(query)

print(f"Retrieved and filtered: {len(filtered_docs)} relevant documents")
for doc in filtered_docs:
    print(f"  - {doc.page_content}")

Transformer 5: EmbeddingsRedundantFilter

from langchain.retrievers.document_compressors import EmbeddingsRedundantFilter

# Simulate retrieved documents with near-duplicates
duplicate_docs = [
    Document(page_content="LangChain helps you build LLM applications easily."),
    Document(page_content="LangChain is a framework for building LLM-powered apps."),      # near-duplicate
    Document(page_content="LangChain makes it easy to create language model applications."), # near-duplicate
    Document(page_content="Vector databases store embeddings for semantic search."),
    Document(page_content="Embeddings are numerical representations of text meaning."),
]

redundancy_filter = EmbeddingsRedundantFilter(
    embeddings=embeddings,
    similarity_threshold=0.90   # drop docs with more than 90% similarity to a kept doc
)

filtered = redundancy_filter.transform_documents(duplicate_docs)
print(f"Before: {len(duplicate_docs)} docs, After: {len(filtered)} docs")
for doc in filtered:
    print(f"  KEPT: {doc.page_content}")

The three near-duplicate LangChain descriptions collapse down to one. You keep the most informative version and drop the repetition.

Transformer 6: LongContextReorder

from langchain_community.document_transformers import LongContextReorder

# Documents ordered by relevance score (most relevant first)
ordered_docs = [
    Document(page_content="Most relevant: directly answers the question with specific facts."),
    Document(page_content="Second: very closely related to the query topic."),
    Document(page_content="Third: somewhat related with partial information."),
    Document(page_content="Fourth: tangentially related background context."),
    Document(page_content="Fifth: general context that might help."),
    Document(page_content="Sixth: loosely related information."),
    Document(page_content="Seventh: barely related to the query."),
    Document(page_content="Eighth: least relevant of the retrieved set."),
]

reorder = LongContextReorder()
reordered = reorder.transform_documents(ordered_docs)

print("New order (best docs at edges, worst in middle):")
for i, doc in enumerate(reordered):
    print(f"  Position {i}: {doc.page_content[:55]}")

Transformer 7: HTMLHeaderTextSplitter

from langchain.text_splitter import HTMLHeaderTextSplitter

html_content = """
<!DOCTYPE html>
<html>
<body>
<h1>LangChain Complete Guide</h1>
<p>LangChain is a comprehensive framework for building LLM applications.</p>

<h2>Core Components</h2>
<p>The main components are chains, agents, memory, and retrievers.</p>

<h3>Chains</h3>
<p>Chains connect multiple LLM calls together using LCEL syntax.</p>
<p>They can be composed declaratively using the pipe operator.</p>

<h3>Agents</h3>
<p>Agents use LLMs to decide dynamically which tools to call.</p>
<p>They handle open-ended tasks that require multi-step reasoning.</p>

<h2>Getting Started</h2>
<p>Install LangChain with pip install langchain and set your API key.</p>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_chunks = html_splitter.split_text(html_content)

for chunk in html_chunks:
    print(f"Metadata: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:120]}")
    print("---")

Building a Full Transformation Pipeline

Here is how you combine multiple transformers into a production-ready ingestion and retrieval pipeline:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers.document_compressors import (
    EmbeddingsRedundantFilter,
    EmbeddingsFilter,
)
from langchain_community.document_transformers import LongContextReorder
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import os

os.environ["OPENAI_API_KEY"] = "your-key-here"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")


def preprocess_documents(raw_documents: list) -> list:
    """Full preprocessing pipeline for RAG ingestion."""

    # Step 1: Split into appropriately-sized chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    split_docs = splitter.transform_documents(raw_documents)
    print(f"After splitting: {len(split_docs)} chunks")

    # Step 2: Remove near-duplicate chunks before indexing
    dedup_filter = EmbeddingsRedundantFilter(
        embeddings=embeddings,
        similarity_threshold=0.92
    )
    unique_docs = dedup_filter.transform_documents(split_docs)
    print(f"After deduplication: {len(unique_docs)} chunks")

    return unique_docs


def retrieve_with_pipeline(
    query: str,
    vectorstore,
    k_initial: int = 20,
    k_final: int = 6
) -> list:
    """Retrieve, filter by relevance, reorder for LLM consumption."""

    # Over-retrieve to give the filter enough candidates
    initial_docs = vectorstore.similarity_search(query, k=k_initial)

    # Keep only docs actually relevant to this query
    relevance_filter = EmbeddingsFilter(
        embeddings=embeddings,
        similarity_threshold=0.75,
        k=k_final
    )
    relevant_docs = relevance_filter.compress_documents(initial_docs, query)
    print(f"After relevance filter: {len(relevant_docs)} docs")

    # Reorder to combat the lost-in-the-middle effect
    reorder = LongContextReorder()
    final_docs = reorder.transform_documents(relevant_docs)

    return final_docs


# Build the index
sample_documents = [
    Document(
        page_content="""
        Vector databases store high-dimensional embeddings for efficient similarity search.
        Popular options include Chroma, Pinecone, Weaviate, and Qdrant.
        Chroma is excellent for local development — no API key required.
        Pinecone is a managed service that scales to billions of vectors automatically.
        Weaviate supports hybrid search combining dense and sparse retrieval.
        """,
        metadata={"source": "db_guide.txt", "category": "infrastructure"}
    ),
]

processed = preprocess_documents(sample_documents)
vectorstore = Chroma.from_documents(
    processed,
    embeddings,
    persist_directory="./chroma_db"
)

# Query with the full pipeline
results = retrieve_with_pipeline(
    "Which vector database should I use for a production app?",
    vectorstore
)
print(f"\nFinal context for LLM: {len(results)} documents")

Comparison Table: LangChain Document Transformers

Transformer	Best For	Preserves Structure	Speed	Token Awareness
RecursiveCharacterTextSplitter	General text, code	Partial	Fast	Optional (tiktoken)
MarkdownTextSplitter	Markdown docs	Yes — headers	Fast	No
TokenTextSplitter	Strict token limits	No	Fast	Yes
HTMLHeaderTextSplitter	HTML / web content	Yes — headings	Fast	No
EmbeddingsFilter	Relevance filtering	N/A	Slow (API)	N/A
EmbeddingsRedundantFilter	Deduplication	N/A	Slow (API)	N/A
LongContextReorder	Context ordering	N/A	Very fast	N/A

Metadata Enrichment During Splitting

You can attach rich metadata during splitting that enables filtered retrieval later — a capability most tutorials skip:

import hashlib
from langchain.text_splitter import RecursiveCharacterTextSplitter


def split_with_rich_metadata(
    text: str,
    source: str,
    category: str,
    chunk_size: int = 512
) -> list:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=64,
    )

    chunks = splitter.create_documents([text])

    for i, chunk in enumerate(chunks):
        chunk.metadata.update({
            "source": source,
            "category": category,
            "chunk_index": i,
            "total_chunks": len(chunks),
            "char_count": len(chunk.page_content),
            "chunk_hash": hashlib.md5(chunk.page_content.encode()).hexdigest()[:8],
            "is_first": i == 0,
            "is_last": i == len(chunks) - 1,
        })

    return chunks


# Retrieve only from a specific category later
engineering_results = vectorstore.similarity_search(
    query="installation procedure",
    k=5,
    filter={"category": "engineering"}
)

Performance Considerations

Batch your embedding calls. The OpenAI embeddings API allows 2,048 inputs per request:

# Efficient — single API call for all documents
texts = [doc.page_content for doc in documents]
all_embeddings = embeddings.embed_documents(texts)  # batched automatically

Integration with Agent Pipelines

Common Mistakes to Avoid

Splitting before setting metadata. Always set metadata on the source Document object before splitting. Once split, chunks lose the parent-document relationship unless you explicitly copy it over.

Frequently Asked Questions

What is the best text splitter for code files in LangChain?

How does EmbeddingsRedundantFilter work in LangChain?

Can I chain multiple document transformers together?

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

7 LangChain Document Transformers (Splitters, Filters, Embeddings)

Why Document Transformation Matters

Setup

Transformer 1: RecursiveCharacterTextSplitter

Transformer 2: MarkdownTextSplitter

Transformer 3: TokenTextSplitter

Transformer 4: EmbeddingsFilter (Semantic Relevance Filtering)

Transformer 5: EmbeddingsRedundantFilter

Transformer 6: LongContextReorder

Transformer 7: HTMLHeaderTextSplitter

Building a Full Transformation Pipeline

Comparison Table: LangChain Document Transformers

Metadata Enrichment During Splitting

Performance Considerations

Integration with Agent Pipelines

Common Mistakes to Avoid

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

7 LangChain Document Transformers (Splitters, Filters, Embeddings)

Why Document Transformation Matters

Setup

Transformer 1: RecursiveCharacterTextSplitter

Transformer 2: MarkdownTextSplitter

Transformer 3: TokenTextSplitter

Transformer 4: EmbeddingsFilter (Semantic Relevance Filtering)

Transformer 5: EmbeddingsRedundantFilter

Transformer 6: LongContextReorder

Transformer 7: HTMLHeaderTextSplitter

Building a Full Transformation Pipeline

Comparison Table: LangChain Document Transformers

Metadata Enrichment During Splitting

Performance Considerations

Integration with Agent Pipelines

Common Mistakes to Avoid

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily