5 LangChain Document Loaders: S3, Notion, YouTube, Twitter
A practical guide to LangChain's S3FileLoader, NotionDirectoryLoader, YoutubeLoader, TwitterTweetLoader, and building custom API loaders with real code examples.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Every RAG pipeline starts with the same question: how do you get your data into a form the LLM can actually work with? If your data lives in a tidy folder of PDFs, you are set. But most real projects pull from S3 buckets, Notion workspaces, YouTube transcripts, Twitter threads, and a dozen internal APIs. LangChain's document loader ecosystem covers an impressive range of these sources, and this guide walks through five of the most practically useful ones with working code for each.
A quick note before we start: document loaders are only one piece of the pipeline. Once you have documents loaded, you will want to split them, embed them, and store them in a vector database. The RAG system tutorial covers that full pipeline end to end. The LangChain tutorial 2025 is also worth reading first if you are new to the framework.
Setup
pip install langchain langchain-community langchain-openai \
boto3 \
notion-client \
youtube-transcript-api pytube \
tweepy \
python-dotenv
Create a .env file:
OPENAI_API_KEY=your_openai_key
AWS_ACCESS_KEY_ID=your_aws_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
AWS_DEFAULT_REGION=us-east-1
NOTION_API_KEY=secret_your_notion_key
TWITTER_BEARER_TOKEN=your_bearer_token
Loader 1: S3FileLoader and S3DirectoryLoader
S3 is where a lot of enterprise document storage lives — policy PDFs, contract files, training materials, exported reports. LangChain gives you two loaders: one for a single file, one for an entire bucket prefix.
Loading a Single File
from dotenv import load_dotenv
from langchain_community.document_loaders import S3FileLoader
load_dotenv()
# Load a single PDF from S3
loader = S3FileLoader(
bucket="my-company-docs",
key="policies/employee-handbook-2026.pdf"
)
documents = loader.load()
print(f"Loaded {len(documents)} document(s)")
print(f"Content preview: {documents[0].page_content[:300]}")
print(f"Metadata: {documents[0].metadata}")
Loading an Entire Prefix
from langchain_community.document_loaders import S3DirectoryLoader
# Load all files under a specific prefix
loader = S3DirectoryLoader(
bucket="my-company-docs",
prefix="contracts/2026/",
)
documents = loader.load()
print(f"Loaded {len(documents)} documents from S3 prefix")
# Group by source file
from collections import defaultdict
by_source = defaultdict(list)
for doc in documents:
by_source[doc.metadata.get("source", "unknown")].append(doc)
print(f"Unique files loaded: {len(by_source)}")
Error Handling for S3
S3 loaders fail silently or throw botocore exceptions that are not always descriptive. Wrap them properly:
import botocore.exceptions
def safe_load_s3_directory(bucket: str, prefix: str) -> list:
loader = S3DirectoryLoader(bucket=bucket, prefix=prefix)
try:
docs = loader.load()
return docs
except botocore.exceptions.ClientError as e:
error_code = e.response["Error"]["Code"]
if error_code == "NoSuchBucket":
print(f"Bucket '{bucket}' does not exist")
elif error_code == "AccessDenied":
print(f"Access denied to bucket '{bucket}'")
else:
print(f"S3 error: {error_code} — {e}")
return []
except Exception as e:
print(f"Unexpected error loading S3: {e}")
return []
docs = safe_load_s3_directory("my-company-docs", "policies/")
Loader 2: NotionDirectoryLoader
The Notion loader works with exported Notion data rather than the live API. You export your workspace or specific pages, and the loader reads the local HTML or Markdown files. This is more reliable than API-based access for large workspaces.
Exporting From Notion
In Notion, go to Settings > Export content > Export all workspace content. Choose Markdown & CSV format and tick "Include subpages." This creates a ZIP file — extract it to a local folder.
from langchain_community.document_loaders import NotionDirectoryLoader
# Point to the extracted export folder
loader = NotionDirectoryLoader(path="./notion-export/")
documents = loader.load()
print(f"Loaded {len(documents)} Notion pages")
# Inspect a document
for doc in documents[:3]:
print(f"\n--- {doc.metadata.get('source', 'unknown')} ---")
print(doc.page_content[:400])
Using the Notion API Directly
If you want live data from Notion (not exports), use NotionDBLoader with the official API:
import os
from langchain_community.document_loaders import NotionDBLoader
# You need a Notion integration token and the database ID
loader = NotionDBLoader(
integration_token=os.getenv("NOTION_API_KEY"),
database_id="your-notion-database-id-here",
request_timeout_sec=30,
)
documents = loader.load()
print(f"Loaded {len(documents)} pages from Notion database")
for doc in documents:
title = doc.metadata.get("title", "Untitled")
print(f"Page: {title} | Length: {len(doc.page_content)} chars")
To find your database ID, open the database in Notion and copy the ID from the URL: https://notion.so/workspace/{DATABASE_ID}?v=...
Filtering and Processing Notion Documents
# Filter out empty pages (Notion exports often include empty template pages)
non_empty_docs = [
doc for doc in documents
if len(doc.page_content.strip()) > 100
]
print(f"Non-empty pages: {len(non_empty_docs)} / {len(documents)}")
# Add custom metadata
for doc in non_empty_docs:
doc.metadata["source_type"] = "notion"
doc.metadata["char_count"] = len(doc.page_content)
Loader 3: YoutubeLoader
YouTube videos contain a huge amount of useful content locked in audio form. The YoutubeLoader extracts transcripts (auto-generated or manually added) and gives you text you can index and search.
from langchain_community.document_loaders import YoutubeLoader
# Load a single video's transcript
loader = YoutubeLoader.from_youtube_url(
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
add_video_info=True, # includes title, author, views, publish_date
language=["en"], # prefer English transcripts
)
documents = loader.load()
print(f"Video title: {documents[0].metadata.get('title')}")
print(f"Author: {documents[0].metadata.get('author')}")
print(f"Transcript length: {len(documents[0].page_content)} chars")
print(f"\nFirst 500 chars:\n{documents[0].page_content[:500]}")
Loading Multiple Videos
from langchain_community.document_loaders import YoutubeLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import time
video_urls = [
"https://www.youtube.com/watch?v=VIDEO_ID_1",
"https://www.youtube.com/watch?v=VIDEO_ID_2",
"https://www.youtube.com/watch?v=VIDEO_ID_3",
]
all_docs = []
failed = []
for url in video_urls:
try:
loader = YoutubeLoader.from_youtube_url(
url,
add_video_info=True,
language=["en", "en-US"],
)
docs = loader.load()
all_docs.extend(docs)
print(f"Loaded: {docs[0].metadata.get('title', url)}")
time.sleep(1) # be polite to the API
except Exception as e:
print(f"Failed to load {url}: {e}")
failed.append(url)
print(f"\nLoaded {len(all_docs)} videos, {len(failed)} failed")
# Split transcripts into chunks for indexing
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = splitter.split_documents(all_docs)
print(f"Split into {len(chunks)} chunks")
Transcripts in Other Languages
# Load with language fallback
loader = YoutubeLoader.from_youtube_url(
"https://www.youtube.com/watch?v=SOME_VIDEO",
add_video_info=True,
language=["en", "en-GB", "en-US"], # try in order
translation="en", # translate if needed
)
Loader 4: TwitterTweetLoader
Twitter/X thread content is genuinely useful for tracking discussions, sentiment, and domain knowledge. The TwitterTweetLoader uses the Tweepy API to pull tweets.
import os
import tweepy
from langchain_community.document_loaders import TwitterTweetLoader
# Create a Tweepy client
# Get your Bearer Token from developer.twitter.com
client = tweepy.Client(bearer_token=os.getenv("TWITTER_BEARER_TOKEN"))
# Load tweets from a specific user
loader = TwitterTweetLoader.from_bearer_token(
oauth2_bearer_token=os.getenv("TWITTER_BEARER_TOKEN"),
twitter_users=["LangChainAI", "OpenAI"],
number_tweets=50, # tweets to load per user
)
documents = loader.load()
print(f"Loaded {len(documents)} tweets")
for doc in documents[:3]:
print(f"\n--- Tweet ---")
print(doc.page_content[:200])
print(f"Metadata: {doc.metadata}")
Searching Tweets by Keyword
The default loader pulls from user timelines. For keyword search, combine Tweepy directly with manual document creation:
from langchain_core.documents import Document
import tweepy
def search_tweets_as_documents(
query: str,
max_results: int = 100,
bearer_token: str = None
) -> list[Document]:
"""Search tweets and convert to LangChain Documents."""
client = tweepy.Client(bearer_token=bearer_token)
response = client.search_recent_tweets(
query=f"{query} -is:retweet lang:en",
max_results=min(max_results, 100),
tweet_fields=["created_at", "author_id", "public_metrics"],
)
if not response.data:
return []
documents = []
for tweet in response.data:
doc = Document(
page_content=tweet.text,
metadata={
"tweet_id": str(tweet.id),
"author_id": str(tweet.author_id),
"created_at": str(tweet.created_at),
"likes": tweet.public_metrics.get("like_count", 0),
"retweets": tweet.public_metrics.get("retweet_count", 0),
"source": "twitter",
}
)
documents.append(doc)
return documents
# Usage
ai_docs = search_tweets_as_documents(
query="LangChain RAG",
max_results=50,
bearer_token=os.getenv("TWITTER_BEARER_TOKEN")
)
print(f"Found {len(ai_docs)} relevant tweets")
Loader 5: Custom API Loader
Sometimes you need to load from an internal API, a SaaS tool, or a data source LangChain does not have a built-in loader for. Building a custom loader is straightforward — subclass BaseLoader.
from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document
import requests
from typing import Iterator
class HackerNewsLoader(BaseLoader):
"""Loads top stories from Hacker News API."""
def __init__(self, num_stories: int = 10):
self.num_stories = num_stories
self.base_url = "https://hacker-news.firebaseio.com/v0"
def lazy_load(self) -> Iterator[Document]:
"""Yields documents one at a time (memory efficient)."""
# Get top story IDs
response = requests.get(f"{self.base_url}/topstories.json", timeout=10)
response.raise_for_status()
story_ids = response.json()[:self.num_stories]
for story_id in story_ids:
story_url = f"{self.base_url}/item/{story_id}.json"
story_resp = requests.get(story_url, timeout=10)
if story_resp.status_code != 200:
continue
story = story_resp.json()
# Skip non-story items (Ask HN, jobs, etc.)
if story.get("type") != "story":
continue
content = f"Title: {story.get('title', '')}\n"
content += f"URL: {story.get('url', 'no URL')}\n"
content += f"Score: {story.get('score', 0)}\n"
content += f"Comments: {story.get('descendants', 0)}\n"
yield Document(
page_content=content,
metadata={
"story_id": story_id,
"source": story.get("url", f"hn_{story_id}"),
"author": story.get("by", "unknown"),
"score": story.get("score", 0),
"time": story.get("time", 0),
"type": "hackernews",
}
)
def load(self) -> list[Document]:
return list(self.lazy_load())
# Usage
hn_loader = HackerNewsLoader(num_stories=20)
hn_docs = hn_loader.load()
print(f"Loaded {len(hn_docs)} HN stories")
for doc in hn_docs[:3]:
print(f"\n{doc.page_content}")
Async Custom Loader
For production workloads with high latency APIs, use async loading:
import asyncio
import aiohttp
from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document
class AsyncNewsAPILoader(BaseLoader):
"""Loads news articles from NewsAPI asynchronously."""
def __init__(self, api_key: str, query: str, num_articles: int = 20):
self.api_key = api_key
self.query = query
self.num_articles = num_articles
async def _fetch_articles(self) -> list[dict]:
url = "https://newsapi.org/v2/everything"
params = {
"q": self.query,
"pageSize": self.num_articles,
"sortBy": "publishedAt",
"apiKey": self.api_key,
}
async with aiohttp.ClientSession() as session:
async with session.get(url, params=params) as resp:
data = await resp.json()
return data.get("articles", [])
def lazy_load(self):
articles = asyncio.run(self._fetch_articles())
for article in articles:
content = f"{article.get('title', '')}\n\n{article.get('description', '')}\n\n{article.get('content', '')}"
yield Document(
page_content=content,
metadata={
"source": article.get("url", ""),
"author": article.get("author", "unknown"),
"published_at": article.get("publishedAt", ""),
"source_name": article.get("source", {}).get("name", ""),
"type": "news_article",
}
)
def load(self) -> list[Document]:
return list(self.lazy_load())
Loader Comparison Table
| Loader | Auth Required | Output Format | Rate Limits | Best For |
|---|---|---|---|---|
| S3FileLoader | AWS IAM credentials | Raw file content | None (S3 billing applies) | Enterprise docs, PDFs, files |
| S3DirectoryLoader | AWS IAM credentials | Multiple raw files | None | Batch loading from S3 prefixes |
| NotionDirectoryLoader | None (local export) | Markdown | None | Offline processing of Notion exports |
| NotionDBLoader | Notion integration token | Structured text | 3 req/sec | Live Notion database queries |
| YoutubeLoader | None (public videos) | Transcript text | Soft limits | Video content indexing |
| TwitterTweetLoader | Bearer token | Tweet text | 500k tweets/month (Basic) | Social media monitoring |
| Custom BaseLoader | Depends on API | Whatever you build | Depends on API | Any custom source |
Putting It All Together: Multi-Source Ingestion Pipeline
Here is a pipeline that loads from multiple sources and prepares everything for indexing:
from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
load_dotenv()
def load_all_sources() -> list:
"""Load documents from S3, Notion, and YouTube."""
all_docs = []
# Load from S3
print("Loading from S3...")
s3_docs = safe_load_s3_directory("my-docs", "public/")
for doc in s3_docs:
doc.metadata["source_platform"] = "s3"
all_docs.extend(s3_docs)
# Load from Notion export
print("Loading from Notion...")
notion_loader = NotionDirectoryLoader("./notion-export/")
notion_docs = notion_loader.load()
for doc in notion_docs:
doc.metadata["source_platform"] = "notion"
all_docs.extend(notion_docs)
# Load YouTube transcripts
print("Loading YouTube transcripts...")
video_urls = [
"https://www.youtube.com/watch?v=VIDEO_ID_1",
]
for url in video_urls:
try:
yt_loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
yt_docs = yt_loader.load()
for doc in yt_docs:
doc.metadata["source_platform"] = "youtube"
all_docs.extend(yt_docs)
except Exception as e:
print(f"YouTube load failed: {e}")
print(f"\nTotal documents loaded: {len(all_docs)}")
return all_docs
def build_vector_store(documents: list):
"""Split, embed, and store documents in Chroma."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=150,
separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
collection_name="multi_source_docs",
)
vectorstore.persist()
print(f"Stored {len(chunks)} chunks in Chroma")
return vectorstore
if __name__ == "__main__":
docs = load_all_sources()
vs = build_vector_store(docs)
print("Vector store ready for querying")
What to Do After Loading
Loading is step one. After you have documents in a vector store, you need good retrieval. The vector database guide covers choosing between Chroma, Pinecone, and Weaviate for different scale requirements. For building a full Q&A system on top of loaded documents, semantic search tutorial shows how to wire retrieval into a chat interface.
If your data pipeline involves custom sources with complex auth flows, the OpenAI API integration guide has relevant patterns for API key management and retry logic.
Conclusion
Document loaders are the entry point of any RAG or document-processing pipeline. The five loaders in this guide — S3, Notion, YouTube, Twitter, and custom API — cover the sources I run into most often in real projects. Each one has quirks: S3 needs proper IAM permissions, Notion export is better than the API for bulk loads, YouTube only works for public videos, and Twitter's rate limits require batching.
The custom loader pattern is worth learning even if you do not need it today. Real projects almost always have at least one data source that does not have a built-in loader, and having a clean pattern for building your own makes that a 30-minute task instead of a research project.
Start with whatever source your actual data lives in, get it loading cleanly, and then layer on the splitting and embedding pipeline. Questions about a specific loader or data source not covered here? Drop a comment below.
FAQs
Can I load private YouTube videos with LangChain's YoutubeLoader? No. YoutubeLoader uses the youtube-transcript-api library which only accesses publicly available transcripts. For private videos you would need to download the transcript manually or use the YouTube Data API with OAuth authentication and then load the text as a plain document.
What is the best way to handle large S3 buckets with thousands of files? Use S3DirectoryLoader with a prefix to narrow the scope, then process files in batches using Python's concurrent.futures or asyncio. Avoid loading all documents into memory at once — stream them through your processing pipeline and index incrementally.
How do I handle Notion pages that have nested child pages? NotionDirectoryLoader does not recurse into child pages automatically. You need to export each page and its children from Notion's export feature (Settings > Export > Include subpages), which creates a folder structure. Point NotionDirectoryLoader at the root of that export folder.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.
Build a LangChain Agent with Memory and Tools (Full Example)
Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.