AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

PDF document being processed by AI system — LangChain document QA system PDF HTML

Build a LangChain Document Q&A System (PDF, HTML, DOCX)

⚡ Quick Answer

Build a complete LangChain document Q&A system that loads PDF, HTML, and DOCX files with PyPDFLoader, RecursiveCharacterTextSplitter, and a full retrieval pipeline.

AiTechWorlds Team May 31, 2026 11 min read

#LangChain #Document QA #PDF #RAG #Vector Search

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Most enterprise AI projects I talk to are fundamentally document Q&A problems. The company has thousands of PDFs, Word docs, HTML pages, and internal wikis. Nobody can find anything. They want to ask questions and get answers from those documents, with citations. It's a solved problem in principle — getting it to actually work reliably across messy real-world documents is where things get interesting.

This guide builds a complete document Q&A system from scratch. We'll load PDFs, HTML, and DOCX files, split them intelligently, store them in a vector index, and build a retrieval chain that answers questions with source attribution. I'll also cover the loading decisions that trip people up — table extraction, scanned PDFs, large files — because the loading layer is where most document Q&A systems fail in practice.

For background on the RAG architecture we're building, RAG system tutorial is the conceptual foundation. For vector storage details, Vector database guide covers the options. And Build AI agent with LangChain shows how to turn this Q&A system into a full agent.

The Document Q&A Architecture

The pipeline has four stages:

Load: Read documents from various formats into LangChain Document objects
Split: Break large documents into chunks that fit in the LLM context
Index: Embed chunks and store in a vector database
Query: Embed the question, retrieve relevant chunks, generate an answer

Each stage has choices to make. We'll go through them all.

Installation

pip install langchain langchain-openai langchain-community
pip install pypdf unstructured python-docx
pip install faiss-cpu  # or chromadb
pip install beautifulsoup4 lxml

# For table extraction from PDFs
pip install pdfplumber

# For better unstructured document processing
pip install "unstructured[all-docs]"

Stage 1: Document Loaders

PyPDFLoader (Fast, Standard PDFs)

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("documents/annual_report_2025.pdf")
pages = loader.load()

print(f"Loaded {len(pages)} pages")
print(f"First page metadata: {pages[0].metadata}")
# {'source': 'documents/annual_report_2025.pdf', 'page': 0}

print(f"First 500 chars: {pages[0].page_content[:500]}")

# PyPDFLoader loads per-page — each Document is one page
# Good for: standard text PDFs, preserves page numbers in metadata
# Bad for: scanned PDFs (no OCR), complex tables

PDFPlumberLoader (Best for Tables)

from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader("documents/financial_report.pdf")
docs = loader.load()

# PDFPlumber preserves table structure better
# Tables are extracted as text with spacing preserved
print(docs[0].page_content[:1000])

# For explicit table extraction:
import pdfplumber

def extract_tables_from_pdf(pdf_path: str) -> list:
    """Extract tables as list of lists."""
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            page_tables = page.extract_tables()
            for table in page_tables:
                tables.append({
                    "page": page_num + 1,
                    "data": table
                })
    return tables

tables = extract_tables_from_pdf("documents/financial_report.pdf")
for t in tables:
    print(f"Page {t['page']}: {len(t['data'])} rows x {len(t['data'][0])} cols")

UnstructuredPDFLoader (Handles Complex Layouts)

from langchain_community.document_loaders import UnstructuredPDFLoader

# Basic mode
loader = UnstructuredPDFLoader(
    "documents/complex_layout.pdf",
    mode="elements"  # "single" or "elements" or "paged"
)
elements = loader.load()

# "elements" mode gives you granular control — each text element is a separate Document
for elem in elements[:5]:
    print(f"Type: {elem.metadata.get('category', 'unknown')}")
    print(f"Content: {elem.page_content[:100]}")
    print("---")

# For scanned PDFs (requires tesseract)
# pip install pytesseract pillow
loader_ocr = UnstructuredPDFLoader(
    "documents/scanned_contract.pdf",
    strategy="hi_res",  # enables OCR
    languages=["eng"]
)
docs_ocr = loader_ocr.load()

HTMLLoader (Web Pages and HTML Files)

from langchain_community.document_loaders import UnstructuredHTMLLoader, BSHTMLLoader

# BSHTMLLoader — faster, uses BeautifulSoup
loader = BSHTMLLoader("documents/product_manual.html")
docs = loader.load()
print(f"Loaded {len(docs)} documents from HTML")
print(docs[0].page_content[:500])

# UnstructuredHTMLLoader — better structure preservation
unstructured_loader = UnstructuredHTMLLoader("documents/complex_page.html")
docs = unstructured_loader.load()

# Loading from URL
from langchain_community.document_loaders import WebBaseLoader

web_loader = WebBaseLoader(
    web_paths=["https://docs.langchain.com/docs/"],
    bs_kwargs={
        "parse_only": None,  # parse everything
    }
)
web_docs = web_loader.load()
print(f"Loaded {len(web_docs)} web pages")

# Multiple URLs
multi_loader = WebBaseLoader([
    "https://example.com/page1",
    "https://example.com/page2"
])

UnstructuredWordDocumentLoader (DOCX)

from langchain_community.document_loaders import UnstructuredWordDocumentLoader, Docx2txtLoader

# Docx2txt — simple and reliable
loader = Docx2txtLoader("documents/meeting_notes.docx")
docs = loader.load()
print(docs[0].page_content[:500])

# Unstructured — preserves more structure (tables, headers, etc.)
unstructured_loader = UnstructuredWordDocumentLoader(
    "documents/contract.docx",
    mode="elements"
)
elements = unstructured_loader.load()

# Check element types
categories = set(e.metadata.get("category") for e in elements)
print("Element types found:", categories)
# {'Title', 'NarrativeText', 'Table', 'ListItem', 'Header'}

# Process different element types differently
tables = [e for e in elements if e.metadata.get("category") == "Table"]
headings = [e for e in elements if e.metadata.get("category") == "Title"]
print(f"Found {len(tables)} tables, {len(headings)} headings")

Loading Multiple Document Types at Once

import os
from pathlib import Path
from typing import List
from langchain_core.documents import Document

def load_documents_from_directory(directory: str) -> List[Document]:
    """Load all supported document types from a directory."""
    from langchain_community.document_loaders import (
        PyPDFLoader, Docx2txtLoader, BSHTMLLoader,
        TextLoader, CSVLoader
    )
    
    loaders = {
        ".pdf": PyPDFLoader,
        ".docx": Docx2txtLoader,
        ".html": BSHTMLLoader,
        ".txt": TextLoader,
        ".csv": CSVLoader
    }
    
    all_docs = []
    directory_path = Path(directory)
    
    for file_path in directory_path.rglob("*"):
        if file_path.is_file():
            suffix = file_path.suffix.lower()
            if suffix in loaders:
                try:
                    loader_class = loaders[suffix]
                    loader = loader_class(str(file_path))
                    docs = loader.load()
                    
                    # Add filename to metadata
                    for doc in docs:
                        doc.metadata["filename"] = file_path.name
                        doc.metadata["file_type"] = suffix
                    
                    all_docs.extend(docs)
                    print(f"Loaded {len(docs)} docs from {file_path.name}")
                    
                except Exception as e:
                    print(f"Failed to load {file_path.name}: {e}")
    
    return all_docs

docs = load_documents_from_directory("./documents")
print(f"Total documents loaded: {len(docs)}")

Stage 2: Splitting Documents

Raw documents are too large for LLM context windows. We split them into chunks that preserve semantic coherence.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# RecursiveCharacterTextSplitter is the right default for most use cases
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,       # characters per chunk
    chunk_overlap=150,    # overlap to preserve context at boundaries
    length_function=len,
    separators=[
        "\n\n",    # Try paragraph breaks first
        "\n",      # Then line breaks
        ". ",      # Then sentence boundaries
        " ",       # Then words
        ""         # Finally characters
    ]
)

chunks = splitter.split_documents(docs)
print(f"Split {len(docs)} documents into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")

# Inspect a chunk
print("\nSample chunk:")
print(chunks[10].page_content)
print("\nMetadata:", chunks[10].metadata)

For code documentation or technical docs with headers:

from langchain.text_splitter import MarkdownTextSplitter, PythonCodeTextSplitter

# Markdown-aware splitting
md_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)
md_chunks = md_splitter.create_documents([markdown_content])

# Python code splitting
code_splitter = PythonCodeTextSplitter(chunk_size=500, chunk_overlap=50)
code_chunks = code_splitter.create_documents([python_code])

Stage 3: Indexing with Embeddings

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS, Chroma
import os

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# FAISS — good for local development and medium-sized collections
vectorstore = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)

# Save and reload
vectorstore.save_local("faiss_index")
# Later: vectorstore = FAISS.load_local("faiss_index", embeddings)

# Chroma — persistent, good for larger collections
vectorstore_chroma = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="chroma_db",
    collection_name="document_qa"
)

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximal Marginal Relevance — reduces redundancy
    search_kwargs={
        "k": 6,                # number of chunks to retrieve
        "fetch_k": 20,         # candidates before MMR filtering
        "lambda_mult": 0.5     # diversity vs relevance balance
    }
)

# Test retrieval
test_docs = retriever.invoke("What was the revenue in Q3 2025?")
for doc in test_docs:
    print(f"Source: {doc.metadata.get('source', 'unknown')}, Page: {doc.metadata.get('page', 'N/A')}")
    print(f"Content: {doc.page_content[:200]}\n")

Stage 4: The Q&A Chain

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Prompt with explicit instructions for source-grounded answers
qa_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert at answering questions based on provided documents.
    
Answer the question using ONLY the information from the context below.
If the answer isn't in the context, say "I don't have enough information in the provided documents to answer this."

Always cite your sources by mentioning the document name and page number when available.
Be precise and factual."""),
    ("human", """Context:
{context}

Question: {question}""")
])

def format_docs(docs):
    """Format retrieved documents with source attribution."""
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("filename", doc.metadata.get("source", "unknown"))
        page = doc.metadata.get("page", "N/A")
        formatted.append(f"[Source {i}: {source}, Page {page}]\n{doc.page_content}")
    return "\n\n".join(formatted)

# Build the complete RAG chain
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | qa_prompt
    | llm
    | StrOutputParser()
)

# Query the system
answer = rag_chain.invoke("What were the main findings of the Q3 2025 report?")
print(answer)

Adding Conversation History

from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Contextualize the question given chat history
contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", """Given a chat history and the latest user question which might reference 
    context in the chat history, formulate a standalone question which can be understood 
    without the chat history. Do NOT answer the question, just reformulate it if needed."""),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}")
])

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_prompt
)

# QA prompt with history
qa_with_history_prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the question based only on the context provided.
    If you don't know the answer, say so.
    
Context: {context}"""),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}")
])

question_answer_chain = create_stuff_documents_chain(llm, qa_with_history_prompt)
rag_chain_with_history = create_retrieval_chain(
    history_aware_retriever,
    question_answer_chain
)

# Use with conversation history
chat_history = []

def chat(question: str) -> str:
    result = rag_chain_with_history.invoke({
        "input": question,
        "chat_history": chat_history
    })
    
    chat_history.extend([
        HumanMessage(content=question),
        AIMessage(content=result["answer"])
    ])
    
    return result["answer"]

# Multi-turn conversation
print(chat("What is the total revenue mentioned in the documents?"))
print(chat("What about the expenses?"))  # "expenses" understood in context of revenue
print(chat("Calculate the profit margin based on those numbers."))

Comparison Table: PDF Loaders

Loader	Speed	Tables	Scanned PDF (OCR)	Complex Layout	Setup
PyPDFLoader	Fast	Poor	No	Poor	Easy
PDFPlumberLoader	Medium	Good	No	Good	Easy
UnstructuredPDFLoader	Slow	Excellent	Yes (hi_res)	Excellent	Complex
PDFMinerLoader	Medium	Medium	No	Medium	Easy
Amazon Textract	Medium	Excellent	Yes	Excellent	Cloud setup

For standard text PDFs in a business context, PyPDFLoader is fastest and most reliable. If your documents have important tables, switch to PDFPlumber. If you're dealing with scanned documents or complex layouts (multi-column academic papers, legal documents with annotations), UnstructuredPDFLoader with strategy="hi_res" is worth the extra setup time.

According to Unstructured's benchmarks, their hi_res strategy achieves 95%+ accuracy on complex PDFs compared to ~70% for basic text extraction approaches.

Production Considerations

# Metadata filtering for large document collections
from langchain_community.vectorstores import Chroma

# Store with rich metadata
chunks_with_metadata = []
for chunk in chunks:
    chunk.metadata.update({
        "department": "finance",
        "year": 2025,
        "document_type": "annual_report"
    })
    chunks_with_metadata.append(chunk)

vectorstore = Chroma.from_documents(chunks_with_metadata, embeddings)

# Filter retrieval by metadata
filtered_retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 5,
        "filter": {
            "department": "finance",
            "year": 2025
        }
    }
)

For more on vector storage patterns that scale, see Vector database guide. The semantic search tutorial covers embedding and retrieval strategies in detail.

If you're integrating this Q&A system into a larger agent that also uses web tools, AI research agent build shows the architecture. The LangChain tutorial 2025 also has a section on document Q&A that complements this guide.

Conclusion

Building a document Q&A system that works reliably on enterprise documents is mostly a loading and chunking problem. Get those two right and the retrieval and generation steps are straightforward. Use PyPDFLoader for standard PDFs, PDFPlumber when tables matter, and UnstructuredPDFLoader for anything complex or scanned. Split with RecursiveCharacterTextSplitter at 800 characters with 150 overlap — that's a solid default for most documents.

The conversation history pattern with create_history_aware_retriever is what separates a basic Q&A system from a useful one. Users always ask follow-up questions. Make sure your system handles them.

Ready to take this further? RAG system tutorial covers advanced retrieval patterns, and Deploy AI model to production shows how to serve this system at scale.

Frequently Asked Questions

What is the best PDF loader for LangChain with table support?

For tables, PDFPlumber consistently outperforms PyPDFLoader and UnstructuredPDFLoader. It uses a spatial analysis approach that preserves table structure as markdown or CSV. For scanned PDFs, you'll need an OCR solution like Tesseract or Amazon Textract before any loader can process them.

How large can documents be in a LangChain Q&A system?

Document size is limited by your vector store capacity, not LangChain itself. For large document collections (thousands of PDFs), use Pinecone or Weaviate rather than FAISS. The splitting strategy matters more than raw size — use RecursiveCharacterTextSplitter with a chunk size of 500-1000 chars and 10-20% overlap.

How do I handle multiple documents in LangChain Q&A?

Load all documents into a single vectorstore and add metadata (filename, page number, document type) to each chunk. Use metadata filtering in your retriever to search specific documents or document types. The MultiVectorRetriever pattern also works well for large collections.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide ToolWatermark PDF — Add Text Watermark to PDF Online Free ToolPDF to Text / Markdown — Extract Text from PDF Free ToolText to PDF — Convert Text or Markdown to PDF Free ToolUnlock PDF Online Free — Remove PDF Restrictions ToolOCR PDF Online Free — Scanned PDF to Text

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

Build a LangChain Document Q&A System (PDF, HTML, DOCX)

⚡ Quick Answer

Build a complete LangChain document Q&A system that loads PDF, HTML, and DOCX files with PyPDFLoader, RecursiveCharacterTextSplitter, and a full retrieval pipeline.

AiTechWorlds Team May 31, 2026 11 min read

#LangChain #Document QA #PDF #RAG #Vector Search

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

The Document Q&A Architecture

The pipeline has four stages:

Load: Read documents from various formats into LangChain Document objects
Split: Break large documents into chunks that fit in the LLM context
Index: Embed chunks and store in a vector database
Query: Embed the question, retrieve relevant chunks, generate an answer

Each stage has choices to make. We'll go through them all.

Installation

pip install langchain langchain-openai langchain-community
pip install pypdf unstructured python-docx
pip install faiss-cpu  # or chromadb
pip install beautifulsoup4 lxml

# For table extraction from PDFs
pip install pdfplumber

# For better unstructured document processing
pip install "unstructured[all-docs]"

Stage 1: Document Loaders

PyPDFLoader (Fast, Standard PDFs)

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("documents/annual_report_2025.pdf")
pages = loader.load()

print(f"Loaded {len(pages)} pages")
print(f"First page metadata: {pages[0].metadata}")
# {'source': 'documents/annual_report_2025.pdf', 'page': 0}

print(f"First 500 chars: {pages[0].page_content[:500]}")

# PyPDFLoader loads per-page — each Document is one page
# Good for: standard text PDFs, preserves page numbers in metadata
# Bad for: scanned PDFs (no OCR), complex tables

PDFPlumberLoader (Best for Tables)

from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader("documents/financial_report.pdf")
docs = loader.load()

# PDFPlumber preserves table structure better
# Tables are extracted as text with spacing preserved
print(docs[0].page_content[:1000])

# For explicit table extraction:
import pdfplumber

def extract_tables_from_pdf(pdf_path: str) -> list:
    """Extract tables as list of lists."""
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            page_tables = page.extract_tables()
            for table in page_tables:
                tables.append({
                    "page": page_num + 1,
                    "data": table
                })
    return tables

tables = extract_tables_from_pdf("documents/financial_report.pdf")
for t in tables:
    print(f"Page {t['page']}: {len(t['data'])} rows x {len(t['data'][0])} cols")

UnstructuredPDFLoader (Handles Complex Layouts)

from langchain_community.document_loaders import UnstructuredPDFLoader

# Basic mode
loader = UnstructuredPDFLoader(
    "documents/complex_layout.pdf",
    mode="elements"  # "single" or "elements" or "paged"
)
elements = loader.load()

# "elements" mode gives you granular control — each text element is a separate Document
for elem in elements[:5]:
    print(f"Type: {elem.metadata.get('category', 'unknown')}")
    print(f"Content: {elem.page_content[:100]}")
    print("---")

# For scanned PDFs (requires tesseract)
# pip install pytesseract pillow
loader_ocr = UnstructuredPDFLoader(
    "documents/scanned_contract.pdf",
    strategy="hi_res",  # enables OCR
    languages=["eng"]
)
docs_ocr = loader_ocr.load()

HTMLLoader (Web Pages and HTML Files)

from langchain_community.document_loaders import UnstructuredHTMLLoader, BSHTMLLoader

# BSHTMLLoader — faster, uses BeautifulSoup
loader = BSHTMLLoader("documents/product_manual.html")
docs = loader.load()
print(f"Loaded {len(docs)} documents from HTML")
print(docs[0].page_content[:500])

# UnstructuredHTMLLoader — better structure preservation
unstructured_loader = UnstructuredHTMLLoader("documents/complex_page.html")
docs = unstructured_loader.load()

# Loading from URL
from langchain_community.document_loaders import WebBaseLoader

web_loader = WebBaseLoader(
    web_paths=["https://docs.langchain.com/docs/"],
    bs_kwargs={
        "parse_only": None,  # parse everything
    }
)
web_docs = web_loader.load()
print(f"Loaded {len(web_docs)} web pages")

# Multiple URLs
multi_loader = WebBaseLoader([
    "https://example.com/page1",
    "https://example.com/page2"
])

UnstructuredWordDocumentLoader (DOCX)

from langchain_community.document_loaders import UnstructuredWordDocumentLoader, Docx2txtLoader

# Docx2txt — simple and reliable
loader = Docx2txtLoader("documents/meeting_notes.docx")
docs = loader.load()
print(docs[0].page_content[:500])

# Unstructured — preserves more structure (tables, headers, etc.)
unstructured_loader = UnstructuredWordDocumentLoader(
    "documents/contract.docx",
    mode="elements"
)
elements = unstructured_loader.load()

# Check element types
categories = set(e.metadata.get("category") for e in elements)
print("Element types found:", categories)
# {'Title', 'NarrativeText', 'Table', 'ListItem', 'Header'}

# Process different element types differently
tables = [e for e in elements if e.metadata.get("category") == "Table"]
headings = [e for e in elements if e.metadata.get("category") == "Title"]
print(f"Found {len(tables)} tables, {len(headings)} headings")

Loading Multiple Document Types at Once

import os
from pathlib import Path
from typing import List
from langchain_core.documents import Document

def load_documents_from_directory(directory: str) -> List[Document]:
    """Load all supported document types from a directory."""
    from langchain_community.document_loaders import (
        PyPDFLoader, Docx2txtLoader, BSHTMLLoader,
        TextLoader, CSVLoader
    )
    
    loaders = {
        ".pdf": PyPDFLoader,
        ".docx": Docx2txtLoader,
        ".html": BSHTMLLoader,
        ".txt": TextLoader,
        ".csv": CSVLoader
    }
    
    all_docs = []
    directory_path = Path(directory)
    
    for file_path in directory_path.rglob("*"):
        if file_path.is_file():
            suffix = file_path.suffix.lower()
            if suffix in loaders:
                try:
                    loader_class = loaders[suffix]
                    loader = loader_class(str(file_path))
                    docs = loader.load()
                    
                    # Add filename to metadata
                    for doc in docs:
                        doc.metadata["filename"] = file_path.name
                        doc.metadata["file_type"] = suffix
                    
                    all_docs.extend(docs)
                    print(f"Loaded {len(docs)} docs from {file_path.name}")
                    
                except Exception as e:
                    print(f"Failed to load {file_path.name}: {e}")
    
    return all_docs

docs = load_documents_from_directory("./documents")
print(f"Total documents loaded: {len(docs)}")

Stage 2: Splitting Documents

Raw documents are too large for LLM context windows. We split them into chunks that preserve semantic coherence.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# RecursiveCharacterTextSplitter is the right default for most use cases
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,       # characters per chunk
    chunk_overlap=150,    # overlap to preserve context at boundaries
    length_function=len,
    separators=[
        "\n\n",    # Try paragraph breaks first
        "\n",      # Then line breaks
        ". ",      # Then sentence boundaries
        " ",       # Then words
        ""         # Finally characters
    ]
)

chunks = splitter.split_documents(docs)
print(f"Split {len(docs)} documents into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")

# Inspect a chunk
print("\nSample chunk:")
print(chunks[10].page_content)
print("\nMetadata:", chunks[10].metadata)

For code documentation or technical docs with headers:

from langchain.text_splitter import MarkdownTextSplitter, PythonCodeTextSplitter

# Markdown-aware splitting
md_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)
md_chunks = md_splitter.create_documents([markdown_content])

# Python code splitting
code_splitter = PythonCodeTextSplitter(chunk_size=500, chunk_overlap=50)
code_chunks = code_splitter.create_documents([python_code])

Stage 3: Indexing with Embeddings

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS, Chroma
import os

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# FAISS — good for local development and medium-sized collections
vectorstore = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)

# Save and reload
vectorstore.save_local("faiss_index")
# Later: vectorstore = FAISS.load_local("faiss_index", embeddings)

# Chroma — persistent, good for larger collections
vectorstore_chroma = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="chroma_db",
    collection_name="document_qa"
)

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximal Marginal Relevance — reduces redundancy
    search_kwargs={
        "k": 6,                # number of chunks to retrieve
        "fetch_k": 20,         # candidates before MMR filtering
        "lambda_mult": 0.5     # diversity vs relevance balance
    }
)

# Test retrieval
test_docs = retriever.invoke("What was the revenue in Q3 2025?")
for doc in test_docs:
    print(f"Source: {doc.metadata.get('source', 'unknown')}, Page: {doc.metadata.get('page', 'N/A')}")
    print(f"Content: {doc.page_content[:200]}\n")

Stage 4: The Q&A Chain

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Prompt with explicit instructions for source-grounded answers
qa_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert at answering questions based on provided documents.
    
Answer the question using ONLY the information from the context below.
If the answer isn't in the context, say "I don't have enough information in the provided documents to answer this."

Always cite your sources by mentioning the document name and page number when available.
Be precise and factual."""),
    ("human", """Context:
{context}

Question: {question}""")
])

def format_docs(docs):
    """Format retrieved documents with source attribution."""
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("filename", doc.metadata.get("source", "unknown"))
        page = doc.metadata.get("page", "N/A")
        formatted.append(f"[Source {i}: {source}, Page {page}]\n{doc.page_content}")
    return "\n\n".join(formatted)

# Build the complete RAG chain
rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | qa_prompt
    | llm
    | StrOutputParser()
)

# Query the system
answer = rag_chain.invoke("What were the main findings of the Q3 2025 report?")
print(answer)

Adding Conversation History

from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Contextualize the question given chat history
contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", """Given a chat history and the latest user question which might reference 
    context in the chat history, formulate a standalone question which can be understood 
    without the chat history. Do NOT answer the question, just reformulate it if needed."""),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}")
])

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_prompt
)

# QA prompt with history
qa_with_history_prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the question based only on the context provided.
    If you don't know the answer, say so.
    
Context: {context}"""),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}")
])

question_answer_chain = create_stuff_documents_chain(llm, qa_with_history_prompt)
rag_chain_with_history = create_retrieval_chain(
    history_aware_retriever,
    question_answer_chain
)

# Use with conversation history
chat_history = []

def chat(question: str) -> str:
    result = rag_chain_with_history.invoke({
        "input": question,
        "chat_history": chat_history
    })
    
    chat_history.extend([
        HumanMessage(content=question),
        AIMessage(content=result["answer"])
    ])
    
    return result["answer"]

# Multi-turn conversation
print(chat("What is the total revenue mentioned in the documents?"))
print(chat("What about the expenses?"))  # "expenses" understood in context of revenue
print(chat("Calculate the profit margin based on those numbers."))

Comparison Table: PDF Loaders

Loader	Speed	Tables	Scanned PDF (OCR)	Complex Layout	Setup
PyPDFLoader	Fast	Poor	No	Poor	Easy
PDFPlumberLoader	Medium	Good	No	Good	Easy
UnstructuredPDFLoader	Slow	Excellent	Yes (hi_res)	Excellent	Complex
PDFMinerLoader	Medium	Medium	No	Medium	Easy
Amazon Textract	Medium	Excellent	Yes	Excellent	Cloud setup

According to Unstructured's benchmarks, their hi_res strategy achieves 95%+ accuracy on complex PDFs compared to ~70% for basic text extraction approaches.

Production Considerations

# Metadata filtering for large document collections
from langchain_community.vectorstores import Chroma

# Store with rich metadata
chunks_with_metadata = []
for chunk in chunks:
    chunk.metadata.update({
        "department": "finance",
        "year": 2025,
        "document_type": "annual_report"
    })
    chunks_with_metadata.append(chunk)

vectorstore = Chroma.from_documents(chunks_with_metadata, embeddings)

# Filter retrieval by metadata
filtered_retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 5,
        "filter": {
            "department": "finance",
            "year": 2025
        }
    }
)

For more on vector storage patterns that scale, see Vector database guide. The semantic search tutorial covers embedding and retrieval strategies in detail.

Conclusion

Ready to take this further? RAG system tutorial covers advanced retrieval patterns, and Deploy AI model to production shows how to serve this system at scale.

Frequently Asked Questions

What is the best PDF loader for LangChain with table support?

How large can documents be in a LangChain Q&A system?

How do I handle multiple documents in LangChain Q&A?

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Build a LangChain Document Q&A System (PDF, HTML, DOCX)

The Document Q&A Architecture

Installation

Stage 1: Document Loaders

PyPDFLoader (Fast, Standard PDFs)

PDFPlumberLoader (Best for Tables)

UnstructuredPDFLoader (Handles Complex Layouts)

HTMLLoader (Web Pages and HTML Files)

UnstructuredWordDocumentLoader (DOCX)

Loading Multiple Document Types at Once

Stage 2: Splitting Documents

Stage 3: Indexing with Embeddings

Stage 4: The Q&A Chain

Adding Conversation History

Comparison Table: PDF Loaders

Production Considerations

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

Build a LangChain Document Q&A System (PDF, HTML, DOCX)

The Document Q&A Architecture

Installation

Stage 1: Document Loaders

PyPDFLoader (Fast, Standard PDFs)

PDFPlumberLoader (Best for Tables)

UnstructuredPDFLoader (Handles Complex Layouts)

HTMLLoader (Web Pages and HTML Files)

UnstructuredWordDocumentLoader (DOCX)

Loading Multiple Document Types at Once

Stage 2: Splitting Documents

Stage 3: Indexing with Embeddings

Stage 4: The Q&A Chain

Adding Conversation History

Comparison Table: PDF Loaders

Production Considerations

Conclusion

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily