Build a LangChain Document Q&A System (PDF, HTML, DOCX)
Build a complete LangChain document Q&A system that loads PDF, HTML, and DOCX files with PyPDFLoader, RecursiveCharacterTextSplitter, and a full retrieval pipeline.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Most enterprise AI projects I talk to are fundamentally document Q&A problems. The company has thousands of PDFs, Word docs, HTML pages, and internal wikis. Nobody can find anything. They want to ask questions and get answers from those documents, with citations. It's a solved problem in principle — getting it to actually work reliably across messy real-world documents is where things get interesting.
This guide builds a complete document Q&A system from scratch. We'll load PDFs, HTML, and DOCX files, split them intelligently, store them in a vector index, and build a retrieval chain that answers questions with source attribution. I'll also cover the loading decisions that trip people up — table extraction, scanned PDFs, large files — because the loading layer is where most document Q&A systems fail in practice.
For background on the RAG architecture we're building, RAG system tutorial is the conceptual foundation. For vector storage details, Vector database guide covers the options. And Build AI agent with LangChain shows how to turn this Q&A system into a full agent.
The Document Q&A Architecture
The pipeline has four stages:
- Load: Read documents from various formats into LangChain
Documentobjects - Split: Break large documents into chunks that fit in the LLM context
- Index: Embed chunks and store in a vector database
- Query: Embed the question, retrieve relevant chunks, generate an answer
Each stage has choices to make. We'll go through them all.
Installation
pip install langchain langchain-openai langchain-community
pip install pypdf unstructured python-docx
pip install faiss-cpu # or chromadb
pip install beautifulsoup4 lxml
# For table extraction from PDFs
pip install pdfplumber
# For better unstructured document processing
pip install "unstructured[all-docs]"
Stage 1: Document Loaders
PyPDFLoader (Fast, Standard PDFs)
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("documents/annual_report_2025.pdf")
pages = loader.load()
print(f"Loaded {len(pages)} pages")
print(f"First page metadata: {pages[0].metadata}")
# {'source': 'documents/annual_report_2025.pdf', 'page': 0}
print(f"First 500 chars: {pages[0].page_content[:500]}")
# PyPDFLoader loads per-page — each Document is one page
# Good for: standard text PDFs, preserves page numbers in metadata
# Bad for: scanned PDFs (no OCR), complex tables
PDFPlumberLoader (Best for Tables)
from langchain_community.document_loaders import PDFPlumberLoader
loader = PDFPlumberLoader("documents/financial_report.pdf")
docs = loader.load()
# PDFPlumber preserves table structure better
# Tables are extracted as text with spacing preserved
print(docs[0].page_content[:1000])
# For explicit table extraction:
import pdfplumber
def extract_tables_from_pdf(pdf_path: str) -> list:
"""Extract tables as list of lists."""
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
page_tables = page.extract_tables()
for table in page_tables:
tables.append({
"page": page_num + 1,
"data": table
})
return tables
tables = extract_tables_from_pdf("documents/financial_report.pdf")
for t in tables:
print(f"Page {t['page']}: {len(t['data'])} rows x {len(t['data'][0])} cols")
UnstructuredPDFLoader (Handles Complex Layouts)
from langchain_community.document_loaders import UnstructuredPDFLoader
# Basic mode
loader = UnstructuredPDFLoader(
"documents/complex_layout.pdf",
mode="elements" # "single" or "elements" or "paged"
)
elements = loader.load()
# "elements" mode gives you granular control — each text element is a separate Document
for elem in elements[:5]:
print(f"Type: {elem.metadata.get('category', 'unknown')}")
print(f"Content: {elem.page_content[:100]}")
print("---")
# For scanned PDFs (requires tesseract)
# pip install pytesseract pillow
loader_ocr = UnstructuredPDFLoader(
"documents/scanned_contract.pdf",
strategy="hi_res", # enables OCR
languages=["eng"]
)
docs_ocr = loader_ocr.load()
HTMLLoader (Web Pages and HTML Files)
from langchain_community.document_loaders import UnstructuredHTMLLoader, BSHTMLLoader
# BSHTMLLoader — faster, uses BeautifulSoup
loader = BSHTMLLoader("documents/product_manual.html")
docs = loader.load()
print(f"Loaded {len(docs)} documents from HTML")
print(docs[0].page_content[:500])
# UnstructuredHTMLLoader — better structure preservation
unstructured_loader = UnstructuredHTMLLoader("documents/complex_page.html")
docs = unstructured_loader.load()
# Loading from URL
from langchain_community.document_loaders import WebBaseLoader
web_loader = WebBaseLoader(
web_paths=["https://docs.langchain.com/docs/"],
bs_kwargs={
"parse_only": None, # parse everything
}
)
web_docs = web_loader.load()
print(f"Loaded {len(web_docs)} web pages")
# Multiple URLs
multi_loader = WebBaseLoader([
"https://example.com/page1",
"https://example.com/page2"
])
UnstructuredWordDocumentLoader (DOCX)
from langchain_community.document_loaders import UnstructuredWordDocumentLoader, Docx2txtLoader
# Docx2txt — simple and reliable
loader = Docx2txtLoader("documents/meeting_notes.docx")
docs = loader.load()
print(docs[0].page_content[:500])
# Unstructured — preserves more structure (tables, headers, etc.)
unstructured_loader = UnstructuredWordDocumentLoader(
"documents/contract.docx",
mode="elements"
)
elements = unstructured_loader.load()
# Check element types
categories = set(e.metadata.get("category") for e in elements)
print("Element types found:", categories)
# {'Title', 'NarrativeText', 'Table', 'ListItem', 'Header'}
# Process different element types differently
tables = [e for e in elements if e.metadata.get("category") == "Table"]
headings = [e for e in elements if e.metadata.get("category") == "Title"]
print(f"Found {len(tables)} tables, {len(headings)} headings")
Loading Multiple Document Types at Once
import os
from pathlib import Path
from typing import List
from langchain_core.documents import Document
def load_documents_from_directory(directory: str) -> List[Document]:
"""Load all supported document types from a directory."""
from langchain_community.document_loaders import (
PyPDFLoader, Docx2txtLoader, BSHTMLLoader,
TextLoader, CSVLoader
)
loaders = {
".pdf": PyPDFLoader,
".docx": Docx2txtLoader,
".html": BSHTMLLoader,
".txt": TextLoader,
".csv": CSVLoader
}
all_docs = []
directory_path = Path(directory)
for file_path in directory_path.rglob("*"):
if file_path.is_file():
suffix = file_path.suffix.lower()
if suffix in loaders:
try:
loader_class = loaders[suffix]
loader = loader_class(str(file_path))
docs = loader.load()
# Add filename to metadata
for doc in docs:
doc.metadata["filename"] = file_path.name
doc.metadata["file_type"] = suffix
all_docs.extend(docs)
print(f"Loaded {len(docs)} docs from {file_path.name}")
except Exception as e:
print(f"Failed to load {file_path.name}: {e}")
return all_docs
docs = load_documents_from_directory("./documents")
print(f"Total documents loaded: {len(docs)}")
Stage 2: Splitting Documents
Raw documents are too large for LLM context windows. We split them into chunks that preserve semantic coherence.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# RecursiveCharacterTextSplitter is the right default for most use cases
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # characters per chunk
chunk_overlap=150, # overlap to preserve context at boundaries
length_function=len,
separators=[
"\n\n", # Try paragraph breaks first
"\n", # Then line breaks
". ", # Then sentence boundaries
" ", # Then words
"" # Finally characters
]
)
chunks = splitter.split_documents(docs)
print(f"Split {len(docs)} documents into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")
# Inspect a chunk
print("\nSample chunk:")
print(chunks[10].page_content)
print("\nMetadata:", chunks[10].metadata)
For code documentation or technical docs with headers:
from langchain.text_splitter import MarkdownTextSplitter, PythonCodeTextSplitter
# Markdown-aware splitting
md_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)
md_chunks = md_splitter.create_documents([markdown_content])
# Python code splitting
code_splitter = PythonCodeTextSplitter(chunk_size=500, chunk_overlap=50)
code_chunks = code_splitter.create_documents([python_code])
Stage 3: Indexing with Embeddings
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS, Chroma
import os
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# FAISS — good for local development and medium-sized collections
vectorstore = FAISS.from_documents(
documents=chunks,
embedding=embeddings
)
# Save and reload
vectorstore.save_local("faiss_index")
# Later: vectorstore = FAISS.load_local("faiss_index", embeddings)
# Chroma — persistent, good for larger collections
vectorstore_chroma = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="chroma_db",
collection_name="document_qa"
)
# Create retriever
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximal Marginal Relevance — reduces redundancy
search_kwargs={
"k": 6, # number of chunks to retrieve
"fetch_k": 20, # candidates before MMR filtering
"lambda_mult": 0.5 # diversity vs relevance balance
}
)
# Test retrieval
test_docs = retriever.invoke("What was the revenue in Q3 2025?")
for doc in test_docs:
print(f"Source: {doc.metadata.get('source', 'unknown')}, Page: {doc.metadata.get('page', 'N/A')}")
print(f"Content: {doc.page_content[:200]}\n")
Stage 4: The Q&A Chain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Prompt with explicit instructions for source-grounded answers
qa_prompt = ChatPromptTemplate.from_messages([
("system", """You are an expert at answering questions based on provided documents.
Answer the question using ONLY the information from the context below.
If the answer isn't in the context, say "I don't have enough information in the provided documents to answer this."
Always cite your sources by mentioning the document name and page number when available.
Be precise and factual."""),
("human", """Context:
{context}
Question: {question}""")
])
def format_docs(docs):
"""Format retrieved documents with source attribution."""
formatted = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("filename", doc.metadata.get("source", "unknown"))
page = doc.metadata.get("page", "N/A")
formatted.append(f"[Source {i}: {source}, Page {page}]\n{doc.page_content}")
return "\n\n".join(formatted)
# Build the complete RAG chain
rag_chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough()
}
| qa_prompt
| llm
| StrOutputParser()
)
# Query the system
answer = rag_chain.invoke("What were the main findings of the Q3 2025 report?")
print(answer)
Adding Conversation History
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
# Contextualize the question given chat history
contextualize_prompt = ChatPromptTemplate.from_messages([
("system", """Given a chat history and the latest user question which might reference
context in the chat history, formulate a standalone question which can be understood
without the chat history. Do NOT answer the question, just reformulate it if needed."""),
MessagesPlaceholder("chat_history"),
("human", "{input}")
])
history_aware_retriever = create_history_aware_retriever(
llm, retriever, contextualize_prompt
)
# QA prompt with history
qa_with_history_prompt = ChatPromptTemplate.from_messages([
("system", """Answer the question based only on the context provided.
If you don't know the answer, say so.
Context: {context}"""),
MessagesPlaceholder("chat_history"),
("human", "{input}")
])
question_answer_chain = create_stuff_documents_chain(llm, qa_with_history_prompt)
rag_chain_with_history = create_retrieval_chain(
history_aware_retriever,
question_answer_chain
)
# Use with conversation history
chat_history = []
def chat(question: str) -> str:
result = rag_chain_with_history.invoke({
"input": question,
"chat_history": chat_history
})
chat_history.extend([
HumanMessage(content=question),
AIMessage(content=result["answer"])
])
return result["answer"]
# Multi-turn conversation
print(chat("What is the total revenue mentioned in the documents?"))
print(chat("What about the expenses?")) # "expenses" understood in context of revenue
print(chat("Calculate the profit margin based on those numbers."))
Comparison Table: PDF Loaders
| Loader | Speed | Tables | Scanned PDF (OCR) | Complex Layout | Setup |
|---|---|---|---|---|---|
| PyPDFLoader | Fast | Poor | No | Poor | Easy |
| PDFPlumberLoader | Medium | Good | No | Good | Easy |
| UnstructuredPDFLoader | Slow | Excellent | Yes (hi_res) | Excellent | Complex |
| PDFMinerLoader | Medium | Medium | No | Medium | Easy |
| Amazon Textract | Medium | Excellent | Yes | Excellent | Cloud setup |
For standard text PDFs in a business context, PyPDFLoader is fastest and most reliable. If your documents have important tables, switch to PDFPlumber. If you're dealing with scanned documents or complex layouts (multi-column academic papers, legal documents with annotations), UnstructuredPDFLoader with strategy="hi_res" is worth the extra setup time.
According to Unstructured's benchmarks, their hi_res strategy achieves 95%+ accuracy on complex PDFs compared to ~70% for basic text extraction approaches.
Production Considerations
# Metadata filtering for large document collections
from langchain_community.vectorstores import Chroma
# Store with rich metadata
chunks_with_metadata = []
for chunk in chunks:
chunk.metadata.update({
"department": "finance",
"year": 2025,
"document_type": "annual_report"
})
chunks_with_metadata.append(chunk)
vectorstore = Chroma.from_documents(chunks_with_metadata, embeddings)
# Filter retrieval by metadata
filtered_retriever = vectorstore.as_retriever(
search_kwargs={
"k": 5,
"filter": {
"department": "finance",
"year": 2025
}
}
)
For more on vector storage patterns that scale, see Vector database guide. The semantic search tutorial covers embedding and retrieval strategies in detail.
If you're integrating this Q&A system into a larger agent that also uses web tools, AI research agent build shows the architecture. The LangChain tutorial 2025 also has a section on document Q&A that complements this guide.
Conclusion
Building a document Q&A system that works reliably on enterprise documents is mostly a loading and chunking problem. Get those two right and the retrieval and generation steps are straightforward. Use PyPDFLoader for standard PDFs, PDFPlumber when tables matter, and UnstructuredPDFLoader for anything complex or scanned. Split with RecursiveCharacterTextSplitter at 800 characters with 150 overlap — that's a solid default for most documents.
The conversation history pattern with create_history_aware_retriever is what separates a basic Q&A system from a useful one. Users always ask follow-up questions. Make sure your system handles them.
Ready to take this further? RAG system tutorial covers advanced retrieval patterns, and Deploy AI model to production shows how to serve this system at scale.
Frequently Asked Questions
What is the best PDF loader for LangChain with table support?
For tables, PDFPlumber consistently outperforms PyPDFLoader and UnstructuredPDFLoader. It uses a spatial analysis approach that preserves table structure as markdown or CSV. For scanned PDFs, you'll need an OCR solution like Tesseract or Amazon Textract before any loader can process them.
How large can documents be in a LangChain Q&A system?
Document size is limited by your vector store capacity, not LangChain itself. For large document collections (thousands of PDFs), use Pinecone or Weaviate rather than FAISS. The splitting strategy matters more than raw size — use RecursiveCharacterTextSplitter with a chunk size of 500-1000 chars and 10-20% overlap.
How do I handle multiple documents in LangChain Q&A?
Load all documents into a single vectorstore and add metadata (filename, page number, document type) to each chunk. Use metadata filtering in your retriever to search specific documents or document types. The MultiVectorRetriever pattern also works well for large collections.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.