AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

data pipeline loading documents from multiple sources — LangChain document loaders S3 Notion

5 LangChain Document Loaders: S3, Notion, YouTube, Twitter

⚡ Quick Answer

A practical guide to LangChain's S3FileLoader, NotionDirectoryLoader, YoutubeLoader, TwitterTweetLoader, and building custom API loaders with real code examples.

AiTechWorlds Team May 31, 2026 11 min read

#LangChain #document loaders #S3 #Notion #YouTube #RAG

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Every RAG pipeline starts with the same question: how do you get your data into a form the LLM can actually work with? If your data lives in a tidy folder of PDFs, you are set. But most real projects pull from S3 buckets, Notion workspaces, YouTube transcripts, Twitter threads, and a dozen internal APIs. LangChain's document loader ecosystem covers an impressive range of these sources, and this guide walks through five of the most practically useful ones with working code for each.

A quick note before we start: document loaders are only one piece of the pipeline. Once you have documents loaded, you will want to split them, embed them, and store them in a vector database. The RAG system tutorial covers that full pipeline end to end. The LangChain tutorial 2025 is also worth reading first if you are new to the framework.

Setup

pip install langchain langchain-community langchain-openai \
    boto3 \
    notion-client \
    youtube-transcript-api pytube \
    tweepy \
    python-dotenv

Create a .env file:

OPENAI_API_KEY=your_openai_key
AWS_ACCESS_KEY_ID=your_aws_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
AWS_DEFAULT_REGION=us-east-1
NOTION_API_KEY=secret_your_notion_key
TWITTER_BEARER_TOKEN=your_bearer_token

Loader 1: S3FileLoader and S3DirectoryLoader

S3 is where a lot of enterprise document storage lives — policy PDFs, contract files, training materials, exported reports. LangChain gives you two loaders: one for a single file, one for an entire bucket prefix.

Loading a Single File

from dotenv import load_dotenv
from langchain_community.document_loaders import S3FileLoader

load_dotenv()

# Load a single PDF from S3
loader = S3FileLoader(
    bucket="my-company-docs",
    key="policies/employee-handbook-2026.pdf"
)

documents = loader.load()
print(f"Loaded {len(documents)} document(s)")
print(f"Content preview: {documents[0].page_content[:300]}")
print(f"Metadata: {documents[0].metadata}")

Loading an Entire Prefix

from langchain_community.document_loaders import S3DirectoryLoader

# Load all files under a specific prefix
loader = S3DirectoryLoader(
    bucket="my-company-docs",
    prefix="contracts/2026/",
)

documents = loader.load()
print(f"Loaded {len(documents)} documents from S3 prefix")

# Group by source file
from collections import defaultdict
by_source = defaultdict(list)
for doc in documents:
    by_source[doc.metadata.get("source", "unknown")].append(doc)

print(f"Unique files loaded: {len(by_source)}")

Error Handling for S3

S3 loaders fail silently or throw botocore exceptions that are not always descriptive. Wrap them properly:

import botocore.exceptions

def safe_load_s3_directory(bucket: str, prefix: str) -> list:
    loader = S3DirectoryLoader(bucket=bucket, prefix=prefix)
    try:
        docs = loader.load()
        return docs
    except botocore.exceptions.ClientError as e:
        error_code = e.response["Error"]["Code"]
        if error_code == "NoSuchBucket":
            print(f"Bucket '{bucket}' does not exist")
        elif error_code == "AccessDenied":
            print(f"Access denied to bucket '{bucket}'")
        else:
            print(f"S3 error: {error_code} — {e}")
        return []
    except Exception as e:
        print(f"Unexpected error loading S3: {e}")
        return []

docs = safe_load_s3_directory("my-company-docs", "policies/")

Loader 2: NotionDirectoryLoader

The Notion loader works with exported Notion data rather than the live API. You export your workspace or specific pages, and the loader reads the local HTML or Markdown files. This is more reliable than API-based access for large workspaces.

Exporting From Notion

In Notion, go to Settings > Export content > Export all workspace content. Choose Markdown & CSV format and tick "Include subpages." This creates a ZIP file — extract it to a local folder.

from langchain_community.document_loaders import NotionDirectoryLoader

# Point to the extracted export folder
loader = NotionDirectoryLoader(path="./notion-export/")

documents = loader.load()
print(f"Loaded {len(documents)} Notion pages")

# Inspect a document
for doc in documents[:3]:
    print(f"\n--- {doc.metadata.get('source', 'unknown')} ---")
    print(doc.page_content[:400])

Using the Notion API Directly

If you want live data from Notion (not exports), use NotionDBLoader with the official API:

import os
from langchain_community.document_loaders import NotionDBLoader

# You need a Notion integration token and the database ID
loader = NotionDBLoader(
    integration_token=os.getenv("NOTION_API_KEY"),
    database_id="your-notion-database-id-here",
    request_timeout_sec=30,
)

documents = loader.load()
print(f"Loaded {len(documents)} pages from Notion database")

for doc in documents:
    title = doc.metadata.get("title", "Untitled")
    print(f"Page: {title} | Length: {len(doc.page_content)} chars")

To find your database ID, open the database in Notion and copy the ID from the URL: https://notion.so/workspace/{DATABASE_ID}?v=...

Filtering and Processing Notion Documents

# Filter out empty pages (Notion exports often include empty template pages)
non_empty_docs = [
    doc for doc in documents
    if len(doc.page_content.strip()) > 100
]

print(f"Non-empty pages: {len(non_empty_docs)} / {len(documents)}")

# Add custom metadata
for doc in non_empty_docs:
    doc.metadata["source_type"] = "notion"
    doc.metadata["char_count"] = len(doc.page_content)

Loader 3: YoutubeLoader

YouTube videos contain a huge amount of useful content locked in audio form. The YoutubeLoader extracts transcripts (auto-generated or manually added) and gives you text you can index and search.

from langchain_community.document_loaders import YoutubeLoader

# Load a single video's transcript
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    add_video_info=True,    # includes title, author, views, publish_date
    language=["en"],        # prefer English transcripts
)

documents = loader.load()

print(f"Video title: {documents[0].metadata.get('title')}")
print(f"Author: {documents[0].metadata.get('author')}")
print(f"Transcript length: {len(documents[0].page_content)} chars")
print(f"\nFirst 500 chars:\n{documents[0].page_content[:500]}")

Loading Multiple Videos

from langchain_community.document_loaders import YoutubeLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import time

video_urls = [
    "https://www.youtube.com/watch?v=VIDEO_ID_1",
    "https://www.youtube.com/watch?v=VIDEO_ID_2",
    "https://www.youtube.com/watch?v=VIDEO_ID_3",
]

all_docs = []
failed = []

for url in video_urls:
    try:
        loader = YoutubeLoader.from_youtube_url(
            url,
            add_video_info=True,
            language=["en", "en-US"],
        )
        docs = loader.load()
        all_docs.extend(docs)
        print(f"Loaded: {docs[0].metadata.get('title', url)}")
        time.sleep(1)   # be polite to the API
    except Exception as e:
        print(f"Failed to load {url}: {e}")
        failed.append(url)

print(f"\nLoaded {len(all_docs)} videos, {len(failed)} failed")

# Split transcripts into chunks for indexing
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(all_docs)
print(f"Split into {len(chunks)} chunks")

Transcripts in Other Languages

# Load with language fallback
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=SOME_VIDEO",
    add_video_info=True,
    language=["en", "en-GB", "en-US"],   # try in order
    translation="en",                     # translate if needed
)

Loader 4: TwitterTweetLoader

Twitter/X thread content is genuinely useful for tracking discussions, sentiment, and domain knowledge. The TwitterTweetLoader uses the Tweepy API to pull tweets.

import os
import tweepy
from langchain_community.document_loaders import TwitterTweetLoader

# Create a Tweepy client
# Get your Bearer Token from developer.twitter.com
client = tweepy.Client(bearer_token=os.getenv("TWITTER_BEARER_TOKEN"))

# Load tweets from a specific user
loader = TwitterTweetLoader.from_bearer_token(
    oauth2_bearer_token=os.getenv("TWITTER_BEARER_TOKEN"),
    twitter_users=["LangChainAI", "OpenAI"],
    number_tweets=50,       # tweets to load per user
)

documents = loader.load()
print(f"Loaded {len(documents)} tweets")

for doc in documents[:3]:
    print(f"\n--- Tweet ---")
    print(doc.page_content[:200])
    print(f"Metadata: {doc.metadata}")

Searching Tweets by Keyword

The default loader pulls from user timelines. For keyword search, combine Tweepy directly with manual document creation:

from langchain_core.documents import Document
import tweepy

def search_tweets_as_documents(
    query: str,
    max_results: int = 100,
    bearer_token: str = None
) -> list[Document]:
    """Search tweets and convert to LangChain Documents."""
    client = tweepy.Client(bearer_token=bearer_token)

    response = client.search_recent_tweets(
        query=f"{query} -is:retweet lang:en",
        max_results=min(max_results, 100),
        tweet_fields=["created_at", "author_id", "public_metrics"],
    )

    if not response.data:
        return []

    documents = []
    for tweet in response.data:
        doc = Document(
            page_content=tweet.text,
            metadata={
                "tweet_id": str(tweet.id),
                "author_id": str(tweet.author_id),
                "created_at": str(tweet.created_at),
                "likes": tweet.public_metrics.get("like_count", 0),
                "retweets": tweet.public_metrics.get("retweet_count", 0),
                "source": "twitter",
            }
        )
        documents.append(doc)

    return documents

# Usage
ai_docs = search_tweets_as_documents(
    query="LangChain RAG",
    max_results=50,
    bearer_token=os.getenv("TWITTER_BEARER_TOKEN")
)
print(f"Found {len(ai_docs)} relevant tweets")

Loader 5: Custom API Loader

Sometimes you need to load from an internal API, a SaaS tool, or a data source LangChain does not have a built-in loader for. Building a custom loader is straightforward — subclass BaseLoader.

from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document
import requests
from typing import Iterator

class HackerNewsLoader(BaseLoader):
    """Loads top stories from Hacker News API."""

    def __init__(self, num_stories: int = 10):
        self.num_stories = num_stories
        self.base_url = "https://hacker-news.firebaseio.com/v0"

    def lazy_load(self) -> Iterator[Document]:
        """Yields documents one at a time (memory efficient)."""
        # Get top story IDs
        response = requests.get(f"{self.base_url}/topstories.json", timeout=10)
        response.raise_for_status()
        story_ids = response.json()[:self.num_stories]

        for story_id in story_ids:
            story_url = f"{self.base_url}/item/{story_id}.json"
            story_resp = requests.get(story_url, timeout=10)

            if story_resp.status_code != 200:
                continue

            story = story_resp.json()

            # Skip non-story items (Ask HN, jobs, etc.)
            if story.get("type") != "story":
                continue

            content = f"Title: {story.get('title', '')}\n"
            content += f"URL: {story.get('url', 'no URL')}\n"
            content += f"Score: {story.get('score', 0)}\n"
            content += f"Comments: {story.get('descendants', 0)}\n"

            yield Document(
                page_content=content,
                metadata={
                    "story_id": story_id,
                    "source": story.get("url", f"hn_{story_id}"),
                    "author": story.get("by", "unknown"),
                    "score": story.get("score", 0),
                    "time": story.get("time", 0),
                    "type": "hackernews",
                }
            )

    def load(self) -> list[Document]:
        return list(self.lazy_load())


# Usage
hn_loader = HackerNewsLoader(num_stories=20)
hn_docs = hn_loader.load()
print(f"Loaded {len(hn_docs)} HN stories")

for doc in hn_docs[:3]:
    print(f"\n{doc.page_content}")

Async Custom Loader

For production workloads with high latency APIs, use async loading:

import asyncio
import aiohttp
from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document

class AsyncNewsAPILoader(BaseLoader):
    """Loads news articles from NewsAPI asynchronously."""

    def __init__(self, api_key: str, query: str, num_articles: int = 20):
        self.api_key = api_key
        self.query = query
        self.num_articles = num_articles

    async def _fetch_articles(self) -> list[dict]:
        url = "https://newsapi.org/v2/everything"
        params = {
            "q": self.query,
            "pageSize": self.num_articles,
            "sortBy": "publishedAt",
            "apiKey": self.api_key,
        }

        async with aiohttp.ClientSession() as session:
            async with session.get(url, params=params) as resp:
                data = await resp.json()
                return data.get("articles", [])

    def lazy_load(self):
        articles = asyncio.run(self._fetch_articles())
        for article in articles:
            content = f"{article.get('title', '')}\n\n{article.get('description', '')}\n\n{article.get('content', '')}"
            yield Document(
                page_content=content,
                metadata={
                    "source": article.get("url", ""),
                    "author": article.get("author", "unknown"),
                    "published_at": article.get("publishedAt", ""),
                    "source_name": article.get("source", {}).get("name", ""),
                    "type": "news_article",
                }
            )

    def load(self) -> list[Document]:
        return list(self.lazy_load())

Loader Comparison Table

Loader	Auth Required	Output Format	Rate Limits	Best For
S3FileLoader	AWS IAM credentials	Raw file content	None (S3 billing applies)	Enterprise docs, PDFs, files
S3DirectoryLoader	AWS IAM credentials	Multiple raw files	None	Batch loading from S3 prefixes
NotionDirectoryLoader	None (local export)	Markdown	None	Offline processing of Notion exports
NotionDBLoader	Notion integration token	Structured text	3 req/sec	Live Notion database queries
YoutubeLoader	None (public videos)	Transcript text	Soft limits	Video content indexing
TwitterTweetLoader	Bearer token	Tweet text	500k tweets/month (Basic)	Social media monitoring
Custom BaseLoader	Depends on API	Whatever you build	Depends on API	Any custom source

Putting It All Together: Multi-Source Ingestion Pipeline

Here is a pipeline that loads from multiple sources and prepares everything for indexing:

from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

load_dotenv()

def load_all_sources() -> list:
    """Load documents from S3, Notion, and YouTube."""
    all_docs = []

    # Load from S3
    print("Loading from S3...")
    s3_docs = safe_load_s3_directory("my-docs", "public/")
    for doc in s3_docs:
        doc.metadata["source_platform"] = "s3"
    all_docs.extend(s3_docs)

    # Load from Notion export
    print("Loading from Notion...")
    notion_loader = NotionDirectoryLoader("./notion-export/")
    notion_docs = notion_loader.load()
    for doc in notion_docs:
        doc.metadata["source_platform"] = "notion"
    all_docs.extend(notion_docs)

    # Load YouTube transcripts
    print("Loading YouTube transcripts...")
    video_urls = [
        "https://www.youtube.com/watch?v=VIDEO_ID_1",
    ]
    for url in video_urls:
        try:
            yt_loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
            yt_docs = yt_loader.load()
            for doc in yt_docs:
                doc.metadata["source_platform"] = "youtube"
            all_docs.extend(yt_docs)
        except Exception as e:
            print(f"YouTube load failed: {e}")

    print(f"\nTotal documents loaded: {len(all_docs)}")
    return all_docs

def build_vector_store(documents: list):
    """Split, embed, and store documents in Chroma."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=150,
        separators=["\n\n", "\n", ". ", " "],
    )
    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks")

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory="./chroma_db",
        collection_name="multi_source_docs",
    )
    vectorstore.persist()
    print(f"Stored {len(chunks)} chunks in Chroma")
    return vectorstore

if __name__ == "__main__":
    docs = load_all_sources()
    vs = build_vector_store(docs)
    print("Vector store ready for querying")

What to Do After Loading

Loading is step one. After you have documents in a vector store, you need good retrieval. The vector database guide covers choosing between Chroma, Pinecone, and Weaviate for different scale requirements. For building a full Q&A system on top of loaded documents, semantic search tutorial shows how to wire retrieval into a chat interface.

If your data pipeline involves custom sources with complex auth flows, the OpenAI API integration guide has relevant patterns for API key management and retry logic.

Conclusion

Document loaders are the entry point of any RAG or document-processing pipeline. The five loaders in this guide — S3, Notion, YouTube, Twitter, and custom API — cover the sources I run into most often in real projects. Each one has quirks: S3 needs proper IAM permissions, Notion export is better than the API for bulk loads, YouTube only works for public videos, and Twitter's rate limits require batching.

The custom loader pattern is worth learning even if you do not need it today. Real projects almost always have at least one data source that does not have a built-in loader, and having a clean pattern for building your own makes that a 30-minute task instead of a research project.

Start with whatever source your actual data lives in, get it loading cleanly, and then layer on the splitting and embedding pipeline. Questions about a specific loader or data source not covered here? Drop a comment below.

FAQs

Can I load private YouTube videos with LangChain's YoutubeLoader? No. YoutubeLoader uses the youtube-transcript-api library which only accesses publicly available transcripts. For private videos you would need to download the transcript manually or use the YouTube Data API with OAuth authentication and then load the text as a plain document.

What is the best way to handle large S3 buckets with thousands of files? Use S3DirectoryLoader with a prefix to narrow the scope, then process files in batches using Python's concurrent.futures or asyncio. Avoid loading all documents into memory at once — stream them through your processing pipeline and index incrementally.

How do I handle Notion pages that have nested child pages? NotionDirectoryLoader does not recurse into child pages automatically. You need to export each page and its children from Notion's export feature (Settings > Export > Include subpages), which creates a folder structure. Point NotionDirectoryLoader at the root of that export folder.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

No. YoutubeLoader uses the youtube-transcript-api library which only accesses publicly available transcripts. For private videos you would need to download the transcript manually or use the YouTube Data API with OAuth authentication and then load the text as a plain document.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesRAG: Retrieval-Augmented Generation Guide NotesAI Agent Development Notes BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide BookContent Creation with AI CourseAI Agent Development Course

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

5 LangChain Document Loaders: S3, Notion, YouTube, Twitter

⚡ Quick Answer

A practical guide to LangChain's S3FileLoader, NotionDirectoryLoader, YoutubeLoader, TwitterTweetLoader, and building custom API loaders with real code examples.

AiTechWorlds Team May 31, 2026 11 min read

#LangChain #document loaders #S3 #Notion #YouTube #RAG

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Setup

pip install langchain langchain-community langchain-openai \
    boto3 \
    notion-client \
    youtube-transcript-api pytube \
    tweepy \
    python-dotenv

Create a .env file:

OPENAI_API_KEY=your_openai_key
AWS_ACCESS_KEY_ID=your_aws_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
AWS_DEFAULT_REGION=us-east-1
NOTION_API_KEY=secret_your_notion_key
TWITTER_BEARER_TOKEN=your_bearer_token

Loader 1: S3FileLoader and S3DirectoryLoader

Loading a Single File

from dotenv import load_dotenv
from langchain_community.document_loaders import S3FileLoader

load_dotenv()

# Load a single PDF from S3
loader = S3FileLoader(
    bucket="my-company-docs",
    key="policies/employee-handbook-2026.pdf"
)

documents = loader.load()
print(f"Loaded {len(documents)} document(s)")
print(f"Content preview: {documents[0].page_content[:300]}")
print(f"Metadata: {documents[0].metadata}")

Loading an Entire Prefix

from langchain_community.document_loaders import S3DirectoryLoader

# Load all files under a specific prefix
loader = S3DirectoryLoader(
    bucket="my-company-docs",
    prefix="contracts/2026/",
)

documents = loader.load()
print(f"Loaded {len(documents)} documents from S3 prefix")

# Group by source file
from collections import defaultdict
by_source = defaultdict(list)
for doc in documents:
    by_source[doc.metadata.get("source", "unknown")].append(doc)

print(f"Unique files loaded: {len(by_source)}")

Error Handling for S3

S3 loaders fail silently or throw botocore exceptions that are not always descriptive. Wrap them properly:

import botocore.exceptions

def safe_load_s3_directory(bucket: str, prefix: str) -> list:
    loader = S3DirectoryLoader(bucket=bucket, prefix=prefix)
    try:
        docs = loader.load()
        return docs
    except botocore.exceptions.ClientError as e:
        error_code = e.response["Error"]["Code"]
        if error_code == "NoSuchBucket":
            print(f"Bucket '{bucket}' does not exist")
        elif error_code == "AccessDenied":
            print(f"Access denied to bucket '{bucket}'")
        else:
            print(f"S3 error: {error_code} — {e}")
        return []
    except Exception as e:
        print(f"Unexpected error loading S3: {e}")
        return []

docs = safe_load_s3_directory("my-company-docs", "policies/")

Loader 2: NotionDirectoryLoader

Exporting From Notion

In Notion, go to Settings > Export content > Export all workspace content. Choose Markdown & CSV format and tick "Include subpages." This creates a ZIP file — extract it to a local folder.

from langchain_community.document_loaders import NotionDirectoryLoader

# Point to the extracted export folder
loader = NotionDirectoryLoader(path="./notion-export/")

documents = loader.load()
print(f"Loaded {len(documents)} Notion pages")

# Inspect a document
for doc in documents[:3]:
    print(f"\n--- {doc.metadata.get('source', 'unknown')} ---")
    print(doc.page_content[:400])

Using the Notion API Directly

If you want live data from Notion (not exports), use NotionDBLoader with the official API:

import os
from langchain_community.document_loaders import NotionDBLoader

# You need a Notion integration token and the database ID
loader = NotionDBLoader(
    integration_token=os.getenv("NOTION_API_KEY"),
    database_id="your-notion-database-id-here",
    request_timeout_sec=30,
)

documents = loader.load()
print(f"Loaded {len(documents)} pages from Notion database")

for doc in documents:
    title = doc.metadata.get("title", "Untitled")
    print(f"Page: {title} | Length: {len(doc.page_content)} chars")

To find your database ID, open the database in Notion and copy the ID from the URL: https://notion.so/workspace/{DATABASE_ID}?v=...

Filtering and Processing Notion Documents

# Filter out empty pages (Notion exports often include empty template pages)
non_empty_docs = [
    doc for doc in documents
    if len(doc.page_content.strip()) > 100
]

print(f"Non-empty pages: {len(non_empty_docs)} / {len(documents)}")

# Add custom metadata
for doc in non_empty_docs:
    doc.metadata["source_type"] = "notion"
    doc.metadata["char_count"] = len(doc.page_content)

Loader 3: YoutubeLoader

YouTube videos contain a huge amount of useful content locked in audio form. The YoutubeLoader extracts transcripts (auto-generated or manually added) and gives you text you can index and search.

from langchain_community.document_loaders import YoutubeLoader

# Load a single video's transcript
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    add_video_info=True,    # includes title, author, views, publish_date
    language=["en"],        # prefer English transcripts
)

documents = loader.load()

print(f"Video title: {documents[0].metadata.get('title')}")
print(f"Author: {documents[0].metadata.get('author')}")
print(f"Transcript length: {len(documents[0].page_content)} chars")
print(f"\nFirst 500 chars:\n{documents[0].page_content[:500]}")

Loading Multiple Videos

from langchain_community.document_loaders import YoutubeLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import time

video_urls = [
    "https://www.youtube.com/watch?v=VIDEO_ID_1",
    "https://www.youtube.com/watch?v=VIDEO_ID_2",
    "https://www.youtube.com/watch?v=VIDEO_ID_3",
]

all_docs = []
failed = []

for url in video_urls:
    try:
        loader = YoutubeLoader.from_youtube_url(
            url,
            add_video_info=True,
            language=["en", "en-US"],
        )
        docs = loader.load()
        all_docs.extend(docs)
        print(f"Loaded: {docs[0].metadata.get('title', url)}")
        time.sleep(1)   # be polite to the API
    except Exception as e:
        print(f"Failed to load {url}: {e}")
        failed.append(url)

print(f"\nLoaded {len(all_docs)} videos, {len(failed)} failed")

# Split transcripts into chunks for indexing
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(all_docs)
print(f"Split into {len(chunks)} chunks")

Transcripts in Other Languages

# Load with language fallback
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=SOME_VIDEO",
    add_video_info=True,
    language=["en", "en-GB", "en-US"],   # try in order
    translation="en",                     # translate if needed
)

Loader 4: TwitterTweetLoader

Twitter/X thread content is genuinely useful for tracking discussions, sentiment, and domain knowledge. The TwitterTweetLoader uses the Tweepy API to pull tweets.

import os
import tweepy
from langchain_community.document_loaders import TwitterTweetLoader

# Create a Tweepy client
# Get your Bearer Token from developer.twitter.com
client = tweepy.Client(bearer_token=os.getenv("TWITTER_BEARER_TOKEN"))

# Load tweets from a specific user
loader = TwitterTweetLoader.from_bearer_token(
    oauth2_bearer_token=os.getenv("TWITTER_BEARER_TOKEN"),
    twitter_users=["LangChainAI", "OpenAI"],
    number_tweets=50,       # tweets to load per user
)

documents = loader.load()
print(f"Loaded {len(documents)} tweets")

for doc in documents[:3]:
    print(f"\n--- Tweet ---")
    print(doc.page_content[:200])
    print(f"Metadata: {doc.metadata}")

Searching Tweets by Keyword

The default loader pulls from user timelines. For keyword search, combine Tweepy directly with manual document creation:

from langchain_core.documents import Document
import tweepy

def search_tweets_as_documents(
    query: str,
    max_results: int = 100,
    bearer_token: str = None
) -> list[Document]:
    """Search tweets and convert to LangChain Documents."""
    client = tweepy.Client(bearer_token=bearer_token)

    response = client.search_recent_tweets(
        query=f"{query} -is:retweet lang:en",
        max_results=min(max_results, 100),
        tweet_fields=["created_at", "author_id", "public_metrics"],
    )

    if not response.data:
        return []

    documents = []
    for tweet in response.data:
        doc = Document(
            page_content=tweet.text,
            metadata={
                "tweet_id": str(tweet.id),
                "author_id": str(tweet.author_id),
                "created_at": str(tweet.created_at),
                "likes": tweet.public_metrics.get("like_count", 0),
                "retweets": tweet.public_metrics.get("retweet_count", 0),
                "source": "twitter",
            }
        )
        documents.append(doc)

    return documents

# Usage
ai_docs = search_tweets_as_documents(
    query="LangChain RAG",
    max_results=50,
    bearer_token=os.getenv("TWITTER_BEARER_TOKEN")
)
print(f"Found {len(ai_docs)} relevant tweets")

Loader 5: Custom API Loader

Sometimes you need to load from an internal API, a SaaS tool, or a data source LangChain does not have a built-in loader for. Building a custom loader is straightforward — subclass BaseLoader.

from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document
import requests
from typing import Iterator

class HackerNewsLoader(BaseLoader):
    """Loads top stories from Hacker News API."""

    def __init__(self, num_stories: int = 10):
        self.num_stories = num_stories
        self.base_url = "https://hacker-news.firebaseio.com/v0"

    def lazy_load(self) -> Iterator[Document]:
        """Yields documents one at a time (memory efficient)."""
        # Get top story IDs
        response = requests.get(f"{self.base_url}/topstories.json", timeout=10)
        response.raise_for_status()
        story_ids = response.json()[:self.num_stories]

        for story_id in story_ids:
            story_url = f"{self.base_url}/item/{story_id}.json"
            story_resp = requests.get(story_url, timeout=10)

            if story_resp.status_code != 200:
                continue

            story = story_resp.json()

            # Skip non-story items (Ask HN, jobs, etc.)
            if story.get("type") != "story":
                continue

            content = f"Title: {story.get('title', '')}\n"
            content += f"URL: {story.get('url', 'no URL')}\n"
            content += f"Score: {story.get('score', 0)}\n"
            content += f"Comments: {story.get('descendants', 0)}\n"

            yield Document(
                page_content=content,
                metadata={
                    "story_id": story_id,
                    "source": story.get("url", f"hn_{story_id}"),
                    "author": story.get("by", "unknown"),
                    "score": story.get("score", 0),
                    "time": story.get("time", 0),
                    "type": "hackernews",
                }
            )

    def load(self) -> list[Document]:
        return list(self.lazy_load())


# Usage
hn_loader = HackerNewsLoader(num_stories=20)
hn_docs = hn_loader.load()
print(f"Loaded {len(hn_docs)} HN stories")

for doc in hn_docs[:3]:
    print(f"\n{doc.page_content}")

Async Custom Loader

For production workloads with high latency APIs, use async loading:

import asyncio
import aiohttp
from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document

class AsyncNewsAPILoader(BaseLoader):
    """Loads news articles from NewsAPI asynchronously."""

    def __init__(self, api_key: str, query: str, num_articles: int = 20):
        self.api_key = api_key
        self.query = query
        self.num_articles = num_articles

    async def _fetch_articles(self) -> list[dict]:
        url = "https://newsapi.org/v2/everything"
        params = {
            "q": self.query,
            "pageSize": self.num_articles,
            "sortBy": "publishedAt",
            "apiKey": self.api_key,
        }

        async with aiohttp.ClientSession() as session:
            async with session.get(url, params=params) as resp:
                data = await resp.json()
                return data.get("articles", [])

    def lazy_load(self):
        articles = asyncio.run(self._fetch_articles())
        for article in articles:
            content = f"{article.get('title', '')}\n\n{article.get('description', '')}\n\n{article.get('content', '')}"
            yield Document(
                page_content=content,
                metadata={
                    "source": article.get("url", ""),
                    "author": article.get("author", "unknown"),
                    "published_at": article.get("publishedAt", ""),
                    "source_name": article.get("source", {}).get("name", ""),
                    "type": "news_article",
                }
            )

    def load(self) -> list[Document]:
        return list(self.lazy_load())

Loader Comparison Table

Loader	Auth Required	Output Format	Rate Limits	Best For
S3FileLoader	AWS IAM credentials	Raw file content	None (S3 billing applies)	Enterprise docs, PDFs, files
S3DirectoryLoader	AWS IAM credentials	Multiple raw files	None	Batch loading from S3 prefixes
NotionDirectoryLoader	None (local export)	Markdown	None	Offline processing of Notion exports
NotionDBLoader	Notion integration token	Structured text	3 req/sec	Live Notion database queries
YoutubeLoader	None (public videos)	Transcript text	Soft limits	Video content indexing
TwitterTweetLoader	Bearer token	Tweet text	500k tweets/month (Basic)	Social media monitoring
Custom BaseLoader	Depends on API	Whatever you build	Depends on API	Any custom source

Putting It All Together: Multi-Source Ingestion Pipeline

Here is a pipeline that loads from multiple sources and prepares everything for indexing:

from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

load_dotenv()

def load_all_sources() -> list:
    """Load documents from S3, Notion, and YouTube."""
    all_docs = []

    # Load from S3
    print("Loading from S3...")
    s3_docs = safe_load_s3_directory("my-docs", "public/")
    for doc in s3_docs:
        doc.metadata["source_platform"] = "s3"
    all_docs.extend(s3_docs)

    # Load from Notion export
    print("Loading from Notion...")
    notion_loader = NotionDirectoryLoader("./notion-export/")
    notion_docs = notion_loader.load()
    for doc in notion_docs:
        doc.metadata["source_platform"] = "notion"
    all_docs.extend(notion_docs)

    # Load YouTube transcripts
    print("Loading YouTube transcripts...")
    video_urls = [
        "https://www.youtube.com/watch?v=VIDEO_ID_1",
    ]
    for url in video_urls:
        try:
            yt_loader = YoutubeLoader.from_youtube_url(url, add_video_info=True)
            yt_docs = yt_loader.load()
            for doc in yt_docs:
                doc.metadata["source_platform"] = "youtube"
            all_docs.extend(yt_docs)
        except Exception as e:
            print(f"YouTube load failed: {e}")

    print(f"\nTotal documents loaded: {len(all_docs)}")
    return all_docs

def build_vector_store(documents: list):
    """Split, embed, and store documents in Chroma."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=150,
        separators=["\n\n", "\n", ". ", " "],
    )
    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks")

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory="./chroma_db",
        collection_name="multi_source_docs",
    )
    vectorstore.persist()
    print(f"Stored {len(chunks)} chunks in Chroma")
    return vectorstore

if __name__ == "__main__":
    docs = load_all_sources()
    vs = build_vector_store(docs)
    print("Vector store ready for querying")

What to Do After Loading

If your data pipeline involves custom sources with complex auth flows, the OpenAI API integration guide has relevant patterns for API key management and retry logic.

Conclusion

FAQs

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

5 LangChain Document Loaders: S3, Notion, YouTube, Twitter

Setup

Loader 1: S3FileLoader and S3DirectoryLoader

Loading a Single File

Loading an Entire Prefix

Error Handling for S3

Loader 2: NotionDirectoryLoader

Exporting From Notion

Using the Notion API Directly

Filtering and Processing Notion Documents

Loader 3: YoutubeLoader

Loading Multiple Videos

Transcripts in Other Languages

Loader 4: TwitterTweetLoader

Searching Tweets by Keyword

Loader 5: Custom API Loader

Async Custom Loader

Loader Comparison Table

Putting It All Together: Multi-Source Ingestion Pipeline

What to Do After Loading

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

5 LangChain Document Loaders: S3, Notion, YouTube, Twitter

Setup

Loader 1: S3FileLoader and S3DirectoryLoader

Loading a Single File

Loading an Entire Prefix

Error Handling for S3

Loader 2: NotionDirectoryLoader

Exporting From Notion

Using the Notion API Directly

Filtering and Processing Notion Documents

Loader 3: YoutubeLoader

Loading Multiple Videos

Transcripts in Other Languages

Loader 4: TwitterTweetLoader

Searching Tweets by Keyword

Loader 5: Custom API Loader

Async Custom Loader

Loader Comparison Table

Putting It All Together: Multi-Source Ingestion Pipeline

What to Do After Loading

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily