AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

document text being split into chunks by separator patterns — LangChain text splitters chunking

10 LangChain Text Splitters: Recursive, Markdown, Code (2026)

⚡ Quick Answer

A practical guide to all 10 LangChain text splitters — Recursive, Markdown, Code, HTML, Semantic, Token — with comparison table and chunking best practices.

AiTechWorlds Team May 31, 2026 16 min read

#LangChain #text splitters #chunking #RAG #document processing

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Chunking is the step everyone underestimates. I've seen teams spend weeks optimizing their embedding models and retrieval algorithms while leaving their text splitter at default settings — and then wonder why their RAG system keeps returning irrelevant or cut-off answers.

The truth is that poor chunking creates problems that no amount of retrieval optimization can fix. If a chunk cuts a sentence in half, the embedding captures noise. If a function is split across two chunks, code search breaks down. If you chunk a Markdown document without respecting headers, you lose the structural signal that makes document hierarchies useful.

LangChain ships with a surprisingly complete set of text splitters, each designed for a specific document type or splitting strategy. This guide covers all 10 in practical depth — what each one does, when to use it, and working Python code for each. I'll also include a comparison table and the key decisions that determine chunk quality.

Before diving in, if you're building a full retrieval pipeline around these chunks, the RAG system tutorial covers the storage and retrieval side in detail.

Why Chunking Strategy Matters More Than You Think

Here's a simple mental model: an embedding model compresses a text chunk into a single vector. If the chunk is incoherent — half a sentence, a code snippet ripped from its context, three unrelated paragraphs — the embedding averages out to something that doesn't represent anything clearly.

Coherent chunks produce embeddings that cluster reliably in the vector space. Incoherent chunks produce noise.

Research from LlamaIndex's chunking benchmarks (2024) shows that semantic chunking approaches improve retrieval MRR (Mean Reciprocal Rank) by 12–18% compared to naive fixed-size splitting, especially for long-form documents with varied structure.

The other factor is chunk size. The rule of thumb I use: your chunk should be the smallest unit of text that can answer a question completely on its own. For prose, that's usually 2–4 paragraphs. For code, it's usually one function. For API documentation, it's usually one endpoint.

Comparison Table: All 10 Splitters

Splitter	Chunk Coherence	Overlap Handling	Code Awareness	Speed	Best For
RecursiveCharacterTextSplitter	High	Yes	None	Fast	General prose, mixed documents
CharacterTextSplitter	Medium	Yes	None	Fastest	Simple text, quick prototyping
MarkdownHeaderTextSplitter	Very High	No (header-based)	None	Fast	Markdown docs, wikis
MarkdownTextSplitter	High	Yes	None	Fast	Long Markdown without strict headers
CodeTextSplitter	Very High	Yes	Full	Fast	Source code (15 languages)
HTMLHeaderTextSplitter	Very High	No (tag-based)	None	Fast	HTML docs, web content
HTMLSectionSplitter	High	No	None	Fast	HTML with section tags
TokenTextSplitter	Medium	Yes	None	Medium	LLM context window management
SentenceTransformersTokenTextSplitter	High	Yes	None	Slow	Semantic model alignment
SemanticChunker	Very High	Auto	None	Slowest	Long-form unstructured documents

1. RecursiveCharacterTextSplitter

This is the one I reach for first with any general document. It tries a hierarchy of separators in order: paragraph breaks (\n\n), then line breaks (\n), then sentence endings (. ), then spaces, then individual characters. It only moves to the next separator if a chunk exceeds chunk_size.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len  # Can also use tiktoken for token-based sizing
)

text = """
LangChain is a framework for building LLM applications.
It provides abstractions for chains, agents, and memory.

The retrieval module supports multiple vector stores including
FAISS, Chroma, Pinecone, and LanceDB. Each has different 
performance characteristics depending on dataset size.

Agent frameworks in LangChain support tool use, planning,
and multi-step reasoning patterns.
"""

chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} ({len(chunk)} chars):")
    print(chunk)
    print("---")

# With documents
from langchain.schema import Document
docs = [Document(page_content=text, metadata={"source": "intro.txt"})]
doc_chunks = splitter.split_documents(docs)
print(f"Created {len(doc_chunks)} chunks")

For token-based sizing (better for LLM context management):

import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text: str) -> int:
    enc = tiktoken.get_encoding("cl100k_base")
    return len(enc.encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,      # tokens, not characters
    chunk_overlap=40,
    length_function=tiktoken_len
)

Using tiktoken_len instead of len ensures chunks respect actual token counts, which prevents the silent truncation that happens when you embed more tokens than your model's context window.

2. CharacterTextSplitter

The simplest splitter — splits on a single separator character. Useful when you know your document has a predictable structure.

from langchain.text_splitter import CharacterTextSplitter

# Split on double newlines (paragraph boundaries)
splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=100
)

chunks = splitter.split_text(text)

# Split on custom delimiter (e.g., HR tags in formatted docs)
hr_splitter = CharacterTextSplitter(
    separator="---",
    chunk_size=800,
    chunk_overlap=0  # No overlap when sections are self-contained
)

I use CharacterTextSplitter mostly for preprocessed documents where sections are already clearly delimited. For anything organic, RecursiveCharacterTextSplitter handles edge cases better.

3. MarkdownHeaderTextSplitter

This is the right choice for any Markdown-structured content — documentation, wikis, README files. It splits on header levels and preserves the header hierarchy in chunk metadata.

from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_text = """
# Introduction

This guide covers LangChain fundamentals.

## Installation

Install with pip:

```bash
pip install langchain

Core Concepts

Chains

Chains connect multiple components together.

Agents

Agents use LLMs to decide which tools to call.

Configuration

Set your API key as an environment variable. """

Define which headers to split on

headers_to_split_on = [ ("#", "h1"), ("##", "h2"), ("###", "h3"), ]

splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, strip_headers=False # Keep header text in chunk content )

chunks = splitter.split_text(markdown_text)

for chunk in chunks: print("Content:", chunk.page_content[:100]) print("Metadata:", chunk.metadata) print("---")


Each chunk's metadata includes the header path, so you know exactly where in the document hierarchy each chunk came from. This is invaluable for citation and source attribution in RAG responses.

For longer Markdown documents where header-based chunks are still too large, chain it with `RecursiveCharacterTextSplitter`:

```python
from langchain.text_splitter import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter
)

# First split by headers
header_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
header_chunks = header_splitter.split_text(long_markdown)

# Then split large sections further
char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50
)
final_chunks = char_splitter.split_documents(header_chunks)
# Metadata from header splitting is preserved!

This two-stage approach preserves structural metadata while keeping chunk sizes manageable.

4. CodeTextSplitter

Code is fundamentally different from prose. The natural unit is a function, method, or class — not a paragraph. CodeTextSplitter uses language-specific separators to respect code structure.

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

# Python code splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100
)

python_code = '''
import os
from typing import List

def load_documents(path: str) -> List[str]:
    """Load all text files from a directory."""
    documents = []
    for filename in os.listdir(path):
        if filename.endswith(".txt"):
            with open(os.path.join(path, filename)) as f:
                documents.append(f.read())
    return documents

class DocumentProcessor:
    def __init__(self, chunk_size: int = 500):
        self.chunk_size = chunk_size
    
    def process(self, documents: List[str]) -> List[str]:
        """Process and chunk documents."""
        return [doc[:self.chunk_size] for doc in documents]

def main():
    docs = load_documents("./data")
    processor = DocumentProcessor(chunk_size=400)
    chunks = processor.process(docs)
    print(f"Processed {len(chunks)} chunks")
'''

chunks = python_splitter.split_text(python_code)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:")
    print(chunk[:200])
    print("---")

# Supported languages
print("Supported:", [lang.value for lang in Language])
# python, js, ts, markdown, latex, html, sol, rust, go, cpp, c, scala, ruby, cobol, lua

The Python separators try to split on class definitions, function definitions, and method definitions first, so you rarely get a chunk that cuts through a function body.

For JavaScript/TypeScript (common in full-stack codebases):

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS,
    chunk_size=800,
    chunk_overlap=80
)

ts_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.TS,
    chunk_size=800,
    chunk_overlap=80
)

For a code search application built on this pattern, the build AI agent with LangChain post shows how code chunks can feed a tool-using agent.

5. HTMLHeaderTextSplitter

For HTML documents — scraped web pages, exported documentation, HTML reports — this splitter preserves the document's heading hierarchy in metadata, similar to MarkdownHeaderTextSplitter.

from langchain.text_splitter import HTMLHeaderTextSplitter

html_content = """
<!DOCTYPE html>
<html>
<body>
<h1>LangChain Documentation</h1>
<p>LangChain is a framework for building LLM applications.</p>

<h2>Getting Started</h2>
<p>Install LangChain with pip install langchain.</p>

<h2>Core Components</h2>
<h3>Chains</h3>
<p>Chains connect LLM calls with other components.</p>
<h3>Agents</h3>
<p>Agents use LLMs to make decisions about tool use.</p>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(html_content)

for chunk in chunks:
    print("Content:", chunk.page_content)
    print("Metadata:", chunk.metadata)
    print("---")

This splitter is especially useful for web scraping pipelines where you've collected HTML pages and want to preserve navigational context in each chunk's metadata.

6. HTMLSectionSplitter

Similar to HTMLHeaderTextSplitter, but splits on <section>, <article>, and <div> tags in addition to headings. Better for modern HTML where semantic sectioning elements are used.

from langchain_text_splitters import HTMLSectionSplitter

html_with_sections = """
<article>
  <section>
    <h2>Introduction</h2>
    <p>This covers the basics of LangChain.</p>
  </section>
  <section>
    <h2>Advanced Usage</h2>
    <p>Advanced patterns include custom chains and agents.</p>
  </section>
</article>
"""

splitter = HTMLSectionSplitter(
    headers_to_split_on=[("h2", "section_title")]
)
chunks = splitter.split_text(html_with_sections)

7. TokenTextSplitter

When you need precise control over token counts — especially for LLM calls where you're managing context windows — TokenTextSplitter splits based on actual token counts rather than character counts.

from langchain.text_splitter import TokenTextSplitter

# Uses tiktoken under the hood
splitter = TokenTextSplitter(
    encoding_name="cl100k_base",  # OpenAI's encoding
    chunk_size=512,    # tokens
    chunk_overlap=50   # tokens
)

long_text = "..." * 5000  # Your long document here

chunks = splitter.split_text(long_text)
print(f"Number of chunks: {len(chunks)}")

# Verify token counts
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
for i, chunk in enumerate(chunks[:3]):
    token_count = len(enc.encode(chunk))
    print(f"Chunk {i+1}: {token_count} tokens")

I use TokenTextSplitter primarily when I'm feeding chunks directly into LLM prompts with strict context limits — for example, when summarizing or classifying chunks in a batch job. For embedding-based retrieval, RecursiveCharacterTextSplitter with a tiktoken_len function is usually cleaner.

8. SentenceTransformersTokenTextSplitter

For projects using sentence-transformers models (BERT-based, not OpenAI), this splitter aligns chunk sizes with the model's tokenizer rather than tiktoken.

from langchain.text_splitter import SentenceTransformersTokenTextSplitter

# Sized for all-MiniLM-L6-v2 (512 token limit)
splitter = SentenceTransformersTokenTextSplitter(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    tokens_per_chunk=256,
    chunk_overlap=25
)

chunks = splitter.split_text(text)

# Each chunk will fit within the model's context window
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode([c for c in chunks])
print(f"Embedded {len(chunks)} chunks")

This matters because sentence-transformer models have a hard 512-token limit. Chunks that exceed this limit get silently truncated during encoding, which degrades embedding quality. SentenceTransformersTokenTextSplitter prevents that.

For more on open-source embedding models, the Hugging Face transformers tutorial covers the full model selection process.

9. SemanticChunker

This is the most computationally expensive splitter, and also the most intelligent. Instead of splitting on character counts or separators, SemanticChunker uses embedding similarity to find natural break points — places where the topic shifts.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95  # Split at the 95th percentile of similarity drops
)

long_article = """
[Your long, multi-topic document here]
"""

chunks = splitter.split_text(long_article)

print(f"Created {len(chunks)} semantic chunks")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1} ({len(chunk)} chars):")
    print(chunk[:200])

Three breakpoint strategies:

"percentile": Splits where the similarity drop is in the top N percentile. Good for documents with clear topic shifts.
"standard_deviation": Splits where similarity drops more than N standard deviations below the mean. Good for consistent documents.
"interquartile": Uses the IQR method. More robust to outliers.

SemanticChunker is slower because it embeds every sentence to compute similarities. For a 10,000-word document, expect 5–15 seconds and a small embedding cost. Worth it for long-form content where topic coherence matters more than processing speed.

10. LatexTextSplitter

For academic or scientific documents written in LaTeX, the dedicated splitter respects LaTeX structure:

from langchain.text_splitter import LatexTextSplitter

latex_text = r"""
\section{Introduction}
This paper presents a novel approach to ...

\subsection{Background}
Previous work on retrieval-augmented generation ...

\section{Methodology}
We propose the following architecture ...
"""

splitter = LatexTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_text(latex_text)

for chunk in chunks:
    print(chunk[:200])
    print("---")

Choosing the Right Splitter: Decision Guide

Here's the practical decision flow I follow:

Production Pipeline: Multi-Format Document Ingestion

Real applications have to handle multiple document types simultaneously. Here's a routing pattern:

from pathlib import Path
from typing import List
from langchain.schema import Document
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
    TokenTextSplitter,
    Language
)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

EXTENSION_TO_SPLITTER = {
    ".md": "markdown",
    ".markdown": "markdown",
    ".py": "python",
    ".js": "javascript",
    ".ts": "typescript",
    ".html": "html",
    ".htm": "html",
    ".tex": "latex",
}

def get_splitter_for_file(
    file_path: str,
    chunk_size: int = 500,
    chunk_overlap: int = 50,
    use_semantic: bool = False,
    embeddings=None
):
    """Return the appropriate text splitter based on file extension."""
    ext = Path(file_path).suffix.lower()
    splitter_type = EXTENSION_TO_SPLITTER.get(ext, "default")
    
    if splitter_type == "markdown":
        return MarkdownHeaderTextSplitter(
            headers_to_split_on=[
                ("#", "h1"), ("##", "h2"), ("###", "h3")
            ]
        )
    
    elif splitter_type == "python":
        return RecursiveCharacterTextSplitter.from_language(
            language=Language.PYTHON,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
    
    elif splitter_type == "javascript":
        return RecursiveCharacterTextSplitter.from_language(
            language=Language.JS,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
    
    elif splitter_type == "typescript":
        return RecursiveCharacterTextSplitter.from_language(
            language=Language.TS,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
    
    elif use_semantic and embeddings:
        return SemanticChunker(
            embeddings=embeddings,
            breakpoint_threshold_type="percentile",
            breakpoint_threshold_amount=95
        )
    
    else:
        return RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )


def chunk_documents(
    documents: List[Document],
    chunk_size: int = 500,
    chunk_overlap: int = 50,
    use_semantic_for_long: bool = False
) -> List[Document]:
    """
    Chunk a list of documents using the appropriate splitter
    for each file type.
    """
    embeddings = OpenAIEmbeddings() if use_semantic_for_long else None
    all_chunks = []
    
    for doc in documents:
        source = doc.metadata.get("source", "")
        
        # For very short documents, skip chunking
        if len(doc.page_content) < 200:
            all_chunks.append(doc)
            continue
        
        # Use semantic chunking for long unstructured documents
        use_semantic = (
            use_semantic_for_long
            and len(doc.page_content) > 5000
            and Path(source).suffix not in EXTENSION_TO_SPLITTER
        )
        
        splitter = get_splitter_for_file(
            file_path=source,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            use_semantic=use_semantic,
            embeddings=embeddings
        )
        
        try:
            if hasattr(splitter, 'split_documents'):
                chunks = splitter.split_documents([doc])
            else:
                texts = splitter.split_text(doc.page_content)
                chunks = [
                    Document(
                        page_content=text,
                        metadata={**doc.metadata, "chunk_index": i}
                    )
                    for i, text in enumerate(texts)
                ]
            all_chunks.extend(chunks)
            
        except Exception as e:
            print(f"Warning: Chunking failed for {source}: {e}")
            # Fall back to default splitter
            fallback = RecursiveCharacterTextSplitter(
                chunk_size=chunk_size, chunk_overlap=chunk_overlap
            )
            chunks = fallback.split_documents([doc])
            all_chunks.extend(chunks)
    
    print(f"Chunked {len(documents)} documents into {len(all_chunks)} chunks")
    return all_chunks


# Usage example
if __name__ == "__main__":
    from langchain_community.document_loaders import DirectoryLoader
    
    loader = DirectoryLoader(
        "./knowledge_base/",
        glob="**/*",
        show_progress=True
    )
    documents = loader.load()
    
    chunks = chunk_documents(
        documents=documents,
        chunk_size=400,
        chunk_overlap=40,
        use_semantic_for_long=False  # Set True for better quality, slower
    )
    
    # Group chunks by source type
    from collections import Counter
    source_types = Counter(
        Path(c.metadata.get("source", "")).suffix 
        for c in chunks
    )
    print("Chunks by file type:", dict(source_types))

This production pipeline automatically routes each document to the best splitter for its type, with a fallback for unknown formats.

For how these chunks feed into a larger agent system, the post on AI agent memory and planning covers the retrieval-memory connection.

Evaluating Chunk Quality

You can actually measure chunk quality before spending money on embeddings. A few quick checks:

from typing import List
import statistics

def analyze_chunks(chunks: List[str]) -> dict:
    """Quick quality analysis of chunk output."""
    lengths = [len(c) for c in chunks]
    
    # Check for very short chunks (likely noise)
    short_chunks = [c for c in chunks if len(c) < 50]
    
    # Check for chunks that look cut off (end mid-sentence)
    def ends_cleanly(text: str) -> bool:
        stripped = text.rstrip()
        return (
            stripped.endswith('.') 
            or stripped.endswith('?')
            or stripped.endswith('!')
            or stripped.endswith('```')
            or stripped.endswith('\n')
        )
    
    clean_endings = sum(1 for c in chunks if ends_cleanly(c))
    
    return {
        "total_chunks": len(chunks),
        "avg_length": statistics.mean(lengths),
        "median_length": statistics.median(lengths),
        "std_dev": statistics.stdev(lengths) if len(lengths) > 1 else 0,
        "short_chunks_pct": len(short_chunks) / len(chunks) * 100,
        "clean_endings_pct": clean_endings / len(chunks) * 100,
        "min_length": min(lengths),
        "max_length": max(lengths)
    }

# Use it
stats = analyze_chunks([c.page_content for c in chunks])
print(f"Quality stats: {stats}")
# Target: clean_endings_pct > 70%, short_chunks_pct < 5%

Target metrics for a healthy chunk set: clean_endings_pct above 70%, short_chunks_pct below 5%, and std_dev below 200 characters (consistent chunk sizes).

For the embedding and storage step after chunking, see the semantic search tutorial for how chunk quality translates to retrieval performance.

Conclusion

Text splitting is the foundation everything else in your RAG pipeline sits on. Picking the wrong splitter means your embeddings are noisy, your retrieval is imprecise, and your LLM gets confused context — problems that are hard to debug because they're downstream from the root cause.

The choice comes down to document type: Markdown headers for structured docs, CodeTextSplitter for source code, SemanticChunker for long unstructured content, and RecursiveCharacterTextSplitter as the reliable default for everything else. The two-stage pattern — header splitting followed by character splitting for oversized sections — handles the most common real-world case (documentation) particularly well.

Use the analyze_chunks() function to validate your output before ingesting into a vector database. A 15-minute audit of your chunking output can save hours of debugging downstream retrieval failures.

For the next step after splitting, the LangChain tutorial 2025 walks through embedding and storing your chunks for retrieval.

FAQs

Which LangChain text splitter should I use for general documents? RecursiveCharacterTextSplitter is the best default choice for most documents. It tries paragraph breaks first, then sentence breaks, then word breaks, producing coherent chunks that don't cut mid-sentence. Start with chunk_size=500 and chunk_overlap=50 and adjust from there.

What chunk size gives the best RAG retrieval performance? Research from Pinecone and LlamaIndex suggests that 300–600 tokens per chunk works best for most retrieval tasks. Smaller chunks (100–200 tokens) improve precision but lose context. Larger chunks (1000+ tokens) preserve context but dilute relevance scores. For code, the optimal unit is usually one function or class method.

Does chunk overlap actually improve retrieval quality? Yes, but the improvement has diminishing returns. An overlap of 10–15% of chunk size (e.g., 50 tokens for a 400-token chunk) typically prevents information loss at chunk boundaries. Going above 20% overlap wastes storage and embedding cost without proportional quality gains.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

RecursiveCharacterTextSplitter is the best default choice for most documents. It tries paragraph breaks first, then sentence breaks, then word breaks, producing coherent chunks that don't cut mid-sentence. Start with chunk_size=500 and chunk_overlap=50 and adjust from there.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizRAG Systems

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

10 LangChain Text Splitters: Recursive, Markdown, Code (2026)

⚡ Quick Answer

A practical guide to all 10 LangChain text splitters — Recursive, Markdown, Code, HTML, Semantic, Token — with comparison table and chunking best practices.

AiTechWorlds Team May 31, 2026 16 min read

#LangChain #text splitters #chunking #RAG #document processing

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Before diving in, if you're building a full retrieval pipeline around these chunks, the RAG system tutorial covers the storage and retrieval side in detail.

Why Chunking Strategy Matters More Than You Think

Coherent chunks produce embeddings that cluster reliably in the vector space. Incoherent chunks produce noise.

Comparison Table: All 10 Splitters

Splitter	Chunk Coherence	Overlap Handling	Code Awareness	Speed	Best For
RecursiveCharacterTextSplitter	High	Yes	None	Fast	General prose, mixed documents
CharacterTextSplitter	Medium	Yes	None	Fastest	Simple text, quick prototyping
MarkdownHeaderTextSplitter	Very High	No (header-based)	None	Fast	Markdown docs, wikis
MarkdownTextSplitter	High	Yes	None	Fast	Long Markdown without strict headers
CodeTextSplitter	Very High	Yes	Full	Fast	Source code (15 languages)
HTMLHeaderTextSplitter	Very High	No (tag-based)	None	Fast	HTML docs, web content
HTMLSectionSplitter	High	No	None	Fast	HTML with section tags
TokenTextSplitter	Medium	Yes	None	Medium	LLM context window management
SentenceTransformersTokenTextSplitter	High	Yes	None	Slow	Semantic model alignment
SemanticChunker	Very High	Auto	None	Slowest	Long-form unstructured documents

1. RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len  # Can also use tiktoken for token-based sizing
)

text = """
LangChain is a framework for building LLM applications.
It provides abstractions for chains, agents, and memory.

The retrieval module supports multiple vector stores including
FAISS, Chroma, Pinecone, and LanceDB. Each has different 
performance characteristics depending on dataset size.

Agent frameworks in LangChain support tool use, planning,
and multi-step reasoning patterns.
"""

chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} ({len(chunk)} chars):")
    print(chunk)
    print("---")

# With documents
from langchain.schema import Document
docs = [Document(page_content=text, metadata={"source": "intro.txt"})]
doc_chunks = splitter.split_documents(docs)
print(f"Created {len(doc_chunks)} chunks")

For token-based sizing (better for LLM context management):

import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text: str) -> int:
    enc = tiktoken.get_encoding("cl100k_base")
    return len(enc.encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,      # tokens, not characters
    chunk_overlap=40,
    length_function=tiktoken_len
)

Using tiktoken_len instead of len ensures chunks respect actual token counts, which prevents the silent truncation that happens when you embed more tokens than your model's context window.

2. CharacterTextSplitter

The simplest splitter — splits on a single separator character. Useful when you know your document has a predictable structure.

from langchain.text_splitter import CharacterTextSplitter

# Split on double newlines (paragraph boundaries)
splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=100
)

chunks = splitter.split_text(text)

# Split on custom delimiter (e.g., HR tags in formatted docs)
hr_splitter = CharacterTextSplitter(
    separator="---",
    chunk_size=800,
    chunk_overlap=0  # No overlap when sections are self-contained
)

I use CharacterTextSplitter mostly for preprocessed documents where sections are already clearly delimited. For anything organic, RecursiveCharacterTextSplitter handles edge cases better.

3. MarkdownHeaderTextSplitter

This is the right choice for any Markdown-structured content — documentation, wikis, README files. It splits on header levels and preserves the header hierarchy in chunk metadata.

from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_text = """
# Introduction

This guide covers LangChain fundamentals.

## Installation

Install with pip:

```bash
pip install langchain

Core Concepts

Chains

Chains connect multiple components together.

Agents

Agents use LLMs to decide which tools to call.

Configuration

Set your API key as an environment variable. """

Define which headers to split on

headers_to_split_on = [ ("#", "h1"), ("##", "h2"), ("###", "h3"), ]

splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, strip_headers=False # Keep header text in chunk content )

chunks = splitter.split_text(markdown_text)

for chunk in chunks: print("Content:", chunk.page_content[:100]) print("Metadata:", chunk.metadata) print("---")


Each chunk's metadata includes the header path, so you know exactly where in the document hierarchy each chunk came from. This is invaluable for citation and source attribution in RAG responses.

For longer Markdown documents where header-based chunks are still too large, chain it with `RecursiveCharacterTextSplitter`:

```python
from langchain.text_splitter import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter
)

# First split by headers
header_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
header_chunks = header_splitter.split_text(long_markdown)

# Then split large sections further
char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50
)
final_chunks = char_splitter.split_documents(header_chunks)
# Metadata from header splitting is preserved!

This two-stage approach preserves structural metadata while keeping chunk sizes manageable.

4. CodeTextSplitter

Code is fundamentally different from prose. The natural unit is a function, method, or class — not a paragraph. CodeTextSplitter uses language-specific separators to respect code structure.

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

# Python code splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100
)

python_code = '''
import os
from typing import List

def load_documents(path: str) -> List[str]:
    """Load all text files from a directory."""
    documents = []
    for filename in os.listdir(path):
        if filename.endswith(".txt"):
            with open(os.path.join(path, filename)) as f:
                documents.append(f.read())
    return documents

class DocumentProcessor:
    def __init__(self, chunk_size: int = 500):
        self.chunk_size = chunk_size
    
    def process(self, documents: List[str]) -> List[str]:
        """Process and chunk documents."""
        return [doc[:self.chunk_size] for doc in documents]

def main():
    docs = load_documents("./data")
    processor = DocumentProcessor(chunk_size=400)
    chunks = processor.process(docs)
    print(f"Processed {len(chunks)} chunks")
'''

chunks = python_splitter.split_text(python_code)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:")
    print(chunk[:200])
    print("---")

# Supported languages
print("Supported:", [lang.value for lang in Language])
# python, js, ts, markdown, latex, html, sol, rust, go, cpp, c, scala, ruby, cobol, lua

The Python separators try to split on class definitions, function definitions, and method definitions first, so you rarely get a chunk that cuts through a function body.

For JavaScript/TypeScript (common in full-stack codebases):

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS,
    chunk_size=800,
    chunk_overlap=80
)

ts_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.TS,
    chunk_size=800,
    chunk_overlap=80
)

For a code search application built on this pattern, the build AI agent with LangChain post shows how code chunks can feed a tool-using agent.

5. HTMLHeaderTextSplitter

For HTML documents — scraped web pages, exported documentation, HTML reports — this splitter preserves the document's heading hierarchy in metadata, similar to MarkdownHeaderTextSplitter.

from langchain.text_splitter import HTMLHeaderTextSplitter

html_content = """
<!DOCTYPE html>
<html>
<body>
<h1>LangChain Documentation</h1>
<p>LangChain is a framework for building LLM applications.</p>

<h2>Getting Started</h2>
<p>Install LangChain with pip install langchain.</p>

<h2>Core Components</h2>
<h3>Chains</h3>
<p>Chains connect LLM calls with other components.</p>
<h3>Agents</h3>
<p>Agents use LLMs to make decisions about tool use.</p>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(html_content)

for chunk in chunks:
    print("Content:", chunk.page_content)
    print("Metadata:", chunk.metadata)
    print("---")

This splitter is especially useful for web scraping pipelines where you've collected HTML pages and want to preserve navigational context in each chunk's metadata.

6. HTMLSectionSplitter

Similar to HTMLHeaderTextSplitter, but splits on <section>, <article>, and <div> tags in addition to headings. Better for modern HTML where semantic sectioning elements are used.

from langchain_text_splitters import HTMLSectionSplitter

html_with_sections = """
<article>
  <section>
    <h2>Introduction</h2>
    <p>This covers the basics of LangChain.</p>
  </section>
  <section>
    <h2>Advanced Usage</h2>
    <p>Advanced patterns include custom chains and agents.</p>
  </section>
</article>
"""

splitter = HTMLSectionSplitter(
    headers_to_split_on=[("h2", "section_title")]
)
chunks = splitter.split_text(html_with_sections)

7. TokenTextSplitter

from langchain.text_splitter import TokenTextSplitter

# Uses tiktoken under the hood
splitter = TokenTextSplitter(
    encoding_name="cl100k_base",  # OpenAI's encoding
    chunk_size=512,    # tokens
    chunk_overlap=50   # tokens
)

long_text = "..." * 5000  # Your long document here

chunks = splitter.split_text(long_text)
print(f"Number of chunks: {len(chunks)}")

# Verify token counts
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
for i, chunk in enumerate(chunks[:3]):
    token_count = len(enc.encode(chunk))
    print(f"Chunk {i+1}: {token_count} tokens")

8. SentenceTransformersTokenTextSplitter

For projects using sentence-transformers models (BERT-based, not OpenAI), this splitter aligns chunk sizes with the model's tokenizer rather than tiktoken.

from langchain.text_splitter import SentenceTransformersTokenTextSplitter

# Sized for all-MiniLM-L6-v2 (512 token limit)
splitter = SentenceTransformersTokenTextSplitter(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    tokens_per_chunk=256,
    chunk_overlap=25
)

chunks = splitter.split_text(text)

# Each chunk will fit within the model's context window
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode([c for c in chunks])
print(f"Embedded {len(chunks)} chunks")

For more on open-source embedding models, the Hugging Face transformers tutorial covers the full model selection process.

9. SemanticChunker

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95  # Split at the 95th percentile of similarity drops
)

long_article = """
[Your long, multi-topic document here]
"""

chunks = splitter.split_text(long_article)

print(f"Created {len(chunks)} semantic chunks")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1} ({len(chunk)} chars):")
    print(chunk[:200])

Three breakpoint strategies:

"percentile": Splits where the similarity drop is in the top N percentile. Good for documents with clear topic shifts.
"standard_deviation": Splits where similarity drops more than N standard deviations below the mean. Good for consistent documents.
"interquartile": Uses the IQR method. More robust to outliers.

10. LatexTextSplitter

For academic or scientific documents written in LaTeX, the dedicated splitter respects LaTeX structure:

from langchain.text_splitter import LatexTextSplitter

latex_text = r"""
\section{Introduction}
This paper presents a novel approach to ...

\subsection{Background}
Previous work on retrieval-augmented generation ...

\section{Methodology}
We propose the following architecture ...
"""

splitter = LatexTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_text(latex_text)

for chunk in chunks:
    print(chunk[:200])
    print("---")

Choosing the Right Splitter: Decision Guide

Here's the practical decision flow I follow:

Production Pipeline: Multi-Format Document Ingestion

Real applications have to handle multiple document types simultaneously. Here's a routing pattern:

from pathlib import Path
from typing import List
from langchain.schema import Document
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
    TokenTextSplitter,
    Language
)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

EXTENSION_TO_SPLITTER = {
    ".md": "markdown",
    ".markdown": "markdown",
    ".py": "python",
    ".js": "javascript",
    ".ts": "typescript",
    ".html": "html",
    ".htm": "html",
    ".tex": "latex",
}

def get_splitter_for_file(
    file_path: str,
    chunk_size: int = 500,
    chunk_overlap: int = 50,
    use_semantic: bool = False,
    embeddings=None
):
    """Return the appropriate text splitter based on file extension."""
    ext = Path(file_path).suffix.lower()
    splitter_type = EXTENSION_TO_SPLITTER.get(ext, "default")
    
    if splitter_type == "markdown":
        return MarkdownHeaderTextSplitter(
            headers_to_split_on=[
                ("#", "h1"), ("##", "h2"), ("###", "h3")
            ]
        )
    
    elif splitter_type == "python":
        return RecursiveCharacterTextSplitter.from_language(
            language=Language.PYTHON,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
    
    elif splitter_type == "javascript":
        return RecursiveCharacterTextSplitter.from_language(
            language=Language.JS,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
    
    elif splitter_type == "typescript":
        return RecursiveCharacterTextSplitter.from_language(
            language=Language.TS,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
    
    elif use_semantic and embeddings:
        return SemanticChunker(
            embeddings=embeddings,
            breakpoint_threshold_type="percentile",
            breakpoint_threshold_amount=95
        )
    
    else:
        return RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )


def chunk_documents(
    documents: List[Document],
    chunk_size: int = 500,
    chunk_overlap: int = 50,
    use_semantic_for_long: bool = False
) -> List[Document]:
    """
    Chunk a list of documents using the appropriate splitter
    for each file type.
    """
    embeddings = OpenAIEmbeddings() if use_semantic_for_long else None
    all_chunks = []
    
    for doc in documents:
        source = doc.metadata.get("source", "")
        
        # For very short documents, skip chunking
        if len(doc.page_content) < 200:
            all_chunks.append(doc)
            continue
        
        # Use semantic chunking for long unstructured documents
        use_semantic = (
            use_semantic_for_long
            and len(doc.page_content) > 5000
            and Path(source).suffix not in EXTENSION_TO_SPLITTER
        )
        
        splitter = get_splitter_for_file(
            file_path=source,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            use_semantic=use_semantic,
            embeddings=embeddings
        )
        
        try:
            if hasattr(splitter, 'split_documents'):
                chunks = splitter.split_documents([doc])
            else:
                texts = splitter.split_text(doc.page_content)
                chunks = [
                    Document(
                        page_content=text,
                        metadata={**doc.metadata, "chunk_index": i}
                    )
                    for i, text in enumerate(texts)
                ]
            all_chunks.extend(chunks)
            
        except Exception as e:
            print(f"Warning: Chunking failed for {source}: {e}")
            # Fall back to default splitter
            fallback = RecursiveCharacterTextSplitter(
                chunk_size=chunk_size, chunk_overlap=chunk_overlap
            )
            chunks = fallback.split_documents([doc])
            all_chunks.extend(chunks)
    
    print(f"Chunked {len(documents)} documents into {len(all_chunks)} chunks")
    return all_chunks


# Usage example
if __name__ == "__main__":
    from langchain_community.document_loaders import DirectoryLoader
    
    loader = DirectoryLoader(
        "./knowledge_base/",
        glob="**/*",
        show_progress=True
    )
    documents = loader.load()
    
    chunks = chunk_documents(
        documents=documents,
        chunk_size=400,
        chunk_overlap=40,
        use_semantic_for_long=False  # Set True for better quality, slower
    )
    
    # Group chunks by source type
    from collections import Counter
    source_types = Counter(
        Path(c.metadata.get("source", "")).suffix 
        for c in chunks
    )
    print("Chunks by file type:", dict(source_types))

This production pipeline automatically routes each document to the best splitter for its type, with a fallback for unknown formats.

For how these chunks feed into a larger agent system, the post on AI agent memory and planning covers the retrieval-memory connection.

Evaluating Chunk Quality

You can actually measure chunk quality before spending money on embeddings. A few quick checks:

from typing import List
import statistics

def analyze_chunks(chunks: List[str]) -> dict:
    """Quick quality analysis of chunk output."""
    lengths = [len(c) for c in chunks]
    
    # Check for very short chunks (likely noise)
    short_chunks = [c for c in chunks if len(c) < 50]
    
    # Check for chunks that look cut off (end mid-sentence)
    def ends_cleanly(text: str) -> bool:
        stripped = text.rstrip()
        return (
            stripped.endswith('.') 
            or stripped.endswith('?')
            or stripped.endswith('!')
            or stripped.endswith('```')
            or stripped.endswith('\n')
        )
    
    clean_endings = sum(1 for c in chunks if ends_cleanly(c))
    
    return {
        "total_chunks": len(chunks),
        "avg_length": statistics.mean(lengths),
        "median_length": statistics.median(lengths),
        "std_dev": statistics.stdev(lengths) if len(lengths) > 1 else 0,
        "short_chunks_pct": len(short_chunks) / len(chunks) * 100,
        "clean_endings_pct": clean_endings / len(chunks) * 100,
        "min_length": min(lengths),
        "max_length": max(lengths)
    }

# Use it
stats = analyze_chunks([c.page_content for c in chunks])
print(f"Quality stats: {stats}")
# Target: clean_endings_pct > 70%, short_chunks_pct < 5%

Target metrics for a healthy chunk set: clean_endings_pct above 70%, short_chunks_pct below 5%, and std_dev below 200 characters (consistent chunk sizes).

For the embedding and storage step after chunking, see the semantic search tutorial for how chunk quality translates to retrieval performance.

Conclusion

For the next step after splitting, the LangChain tutorial 2025 walks through embedding and storing your chunks for retrieval.

FAQs

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizRAG Systems

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

10 LangChain Text Splitters: Recursive, Markdown, Code (2026)

Why Chunking Strategy Matters More Than You Think

Comparison Table: All 10 Splitters

1. RecursiveCharacterTextSplitter

2. CharacterTextSplitter

3. MarkdownHeaderTextSplitter

Core Concepts

Chains

Agents

Configuration

Define which headers to split on

4. CodeTextSplitter

5. HTMLHeaderTextSplitter

6. HTMLSectionSplitter

7. TokenTextSplitter

8. SentenceTransformersTokenTextSplitter

9. SemanticChunker

10. LatexTextSplitter

Choosing the Right Splitter: Decision Guide

Production Pipeline: Multi-Format Document Ingestion

Evaluating Chunk Quality

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

10 LangChain Text Splitters: Recursive, Markdown, Code (2026)

Why Chunking Strategy Matters More Than You Think

Comparison Table: All 10 Splitters

1. RecursiveCharacterTextSplitter

2. CharacterTextSplitter

3. MarkdownHeaderTextSplitter

Core Concepts

Chains

Agents

Configuration

Define which headers to split on

4. CodeTextSplitter

5. HTMLHeaderTextSplitter

6. HTMLSectionSplitter

7. TokenTextSplitter

8. SentenceTransformersTokenTextSplitter

9. SemanticChunker

10. LatexTextSplitter

Choosing the Right Splitter: Decision Guide

Production Pipeline: Multi-Format Document Ingestion

Evaluating Chunk Quality

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily