Build a LangChain Agent That Summarizes YouTube Videos
Build a full LangChain agent that loads YouTube transcripts, falls back to Whisper, and summarizes long videos with MapReduceDocumentsChain and GPT-4o.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
I was building a research assistant for a podcast host who needed to process two to three hours of video content per day. The manual workflow — watch the video, take notes, write a brief — was taking four to five hours. I built a LangChain agent to handle it, and now the whole pipeline runs in under two minutes.
This post builds that agent from scratch. We'll go from transcript loading to a complete deployable system with a Whisper fallback for videos without captions, MapReduceDocumentsChain for handling long videos, and a tool-using agent that can accept either a URL or a video ID and return a structured summary.
The full agent at the end is production-ready. You can drop it into a FastAPI endpoint, a Slack bot, or a scheduled job.
For background on the broader agent framework this fits into, the build AI agent with LangChain post covers the foundational patterns.
The Architecture Before Any Code
The summarization pipeline has four stages:
- Transcript acquisition — Try YouTube's built-in caption API via
YoutubeLoader. If that fails (no captions, language unsupported, age restriction), fall back to downloading the audio and running Whisper. - Chunking — Split the transcript into overlapping chunks that fit within the LLM's context window.
- Summarization — Use
MapReduceDocumentsChain: summarize each chunk independently (map), then combine chunk summaries into a final output (reduce). - Agent orchestration — Wrap the pipeline as a tool that a LangChain agent can call based on user input.
This design handles videos of any length. A 10-minute tutorial and a 3-hour conference keynote go through the same pipeline.
Comparison: Summarization Strategies
| Strategy | Speed | Quality | Max Length | Cost |
|---|---|---|---|---|
stuff chain | Fastest | Excellent | ~16k tokens | Low |
map_reduce chain | Fast (parallel map) | Very Good | Unlimited | Medium |
refine chain | Slow (sequential) | Excellent | Unlimited | Medium-High |
map_rerank chain | Medium | Good (ranks chunks) | Unlimited | Medium |
| Rolling window (custom) | Medium | Good | Unlimited | Medium |
For YouTube videos, map_reduce wins on most tradeoffs: it parallelizes the per-chunk summarization, handles unlimited transcript length, and produces quality summaries. Use stuff only for short videos (under 20 minutes) where the full transcript fits in one context window.
Installation
pip install langchain langchain-community langchain-openai
pip install youtube-transcript-api pytube
pip install openai-whisper # For Whisper fallback
pip install yt-dlp # Better than pytube for audio download
Optional for the full agent deployment:
pip install fastapi uvicorn python-dotenv
Step 1: Loading YouTube Transcripts with YoutubeLoader
YoutubeLoader from langchain_community wraps the youtube-transcript-api library. It fetches the auto-generated or manual captions from a YouTube video and returns them as a LangChain Document.
from langchain_community.document_loaders import YoutubeLoader
def load_youtube_transcript(
url: str,
language: str = "en",
translation: str = None
) -> list:
"""
Load transcript from a YouTube URL.
Args:
url: Full YouTube URL or video ID
language: Transcript language code (e.g., 'en', 'es', 'fr')
translation: Translate to this language if original not available
Returns:
List of Document objects with transcript text
"""
loader = YoutubeLoader.from_youtube_url(
url,
add_video_info=True, # Includes title, author, length in metadata
language=[language],
translation=translation
)
documents = loader.load()
if not documents:
raise ValueError(f"No transcript found for: {url}")
print(f"Loaded transcript: {len(documents[0].page_content)} characters")
print(f"Video: {documents[0].metadata.get('title', 'Unknown')}")
print(f"Author: {documents[0].metadata.get('author', 'Unknown')}")
print(f"Length: {documents[0].metadata.get('length', 0)} seconds")
return documents
# Try it
try:
docs = load_youtube_transcript(
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
language="en"
)
print("Transcript loaded successfully")
except Exception as e:
print(f"Transcript loading failed: {e}")
The add_video_info=True flag pulls in the video title, author, length, and thumbnail URL from the YouTube oEmbed API. This metadata is useful for generating attribution in summaries.
Step 2: Whisper Fallback for Videos Without Captions
Many YouTube videos — especially technical talks, short clips, and older content — don't have captions. When YoutubeLoader fails, we need to download the audio and transcribe it locally with Whisper.
import os
import tempfile
import whisper
import yt_dlp
def download_audio(url: str, output_dir: str = None) -> str:
"""
Download audio from a YouTube URL using yt-dlp.
Returns the path to the downloaded audio file.
"""
if output_dir is None:
output_dir = tempfile.mkdtemp()
output_template = os.path.join(output_dir, "%(id)s.%(ext)s")
ydl_opts = {
"format": "bestaudio/best",
"outtmpl": output_template,
"postprocessors": [{
"key": "FFmpegExtractAudio",
"preferredcodec": "mp3",
"preferredquality": "192",
}],
"quiet": True,
"no_warnings": True,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=True)
video_id = info.get("id", "audio")
audio_path = os.path.join(output_dir, f"{video_id}.mp3")
print(f"Audio downloaded: {audio_path}")
return audio_path
def transcribe_with_whisper(
audio_path: str,
model_size: str = "base",
language: str = None
) -> str:
"""
Transcribe audio file using OpenAI Whisper.
model_size options: tiny, base, small, medium, large
Larger models are more accurate but slower and use more memory.
"""
print(f"Loading Whisper model: {model_size}")
model = whisper.load_model(model_size)
print("Transcribing audio...")
transcribe_opts = {"verbose": False}
if language:
transcribe_opts["language"] = language
result = model.transcribe(audio_path, **transcribe_opts)
transcript = result["text"]
detected_language = result.get("language", "unknown")
print(f"Transcription complete: {len(transcript)} characters")
print(f"Detected language: {detected_language}")
return transcript
def load_transcript_with_fallback(
url: str,
language: str = "en",
whisper_model: str = "base",
cleanup_audio: bool = True
) -> tuple:
"""
Load transcript via YoutubeLoader with Whisper fallback.
Returns: (transcript_text, metadata_dict, source)
source is either 'youtube_captions' or 'whisper'
"""
# Try YouTube captions first
try:
loader = YoutubeLoader.from_youtube_url(
url,
add_video_info=True,
language=[language]
)
docs = loader.load()
if docs and docs[0].page_content.strip():
return (
docs[0].page_content,
docs[0].metadata,
"youtube_captions"
)
raise ValueError("Empty transcript returned")
except Exception as e:
print(f"YouTube captions unavailable ({e}). Falling back to Whisper...")
# Whisper fallback
audio_path = None
try:
audio_path = download_audio(url)
transcript_text = transcribe_with_whisper(
audio_path,
model_size=whisper_model,
language=language if language != "en" else None
)
# Extract basic metadata via yt-dlp (no download)
with yt_dlp.YoutubeDL({"quiet": True}) as ydl:
info = ydl.extract_info(url, download=False)
metadata = {
"title": info.get("title", "Unknown"),
"author": info.get("uploader", "Unknown"),
"length": info.get("duration", 0),
"source": url
}
return transcript_text, metadata, "whisper"
finally:
if cleanup_audio and audio_path and os.path.exists(audio_path):
os.remove(audio_path)
print("Audio file cleaned up")
The Whisper base model runs on CPU in reasonable time (2–4× real-time for most hardware). For faster transcription on a GPU machine, use medium or large for better accuracy on technical content with jargon.
Step 3: Chunking the Transcript
Transcripts are one long string — there are no paragraph breaks, headings, or natural document structure. RecursiveCharacterTextSplitter handles this well by splitting on sentence boundaries.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
def chunk_transcript(
transcript_text: str,
metadata: dict,
chunk_size: int = 4000,
chunk_overlap: int = 200
) -> list:
"""
Split a transcript into overlapping chunks suitable for map-reduce.
chunk_size: characters per chunk (4000 chars ≈ 1000 tokens)
chunk_overlap: characters of overlap between adjacent chunks
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=[". ", "? ", "! ", "\n", " ", ""]
)
texts = splitter.split_text(transcript_text)
# Wrap in Document objects with metadata
chunks = [
Document(
page_content=text,
metadata={
**metadata,
"chunk_index": i,
"total_chunks": len(texts)
}
)
for i, text in enumerate(texts)
]
print(f"Split transcript into {len(chunks)} chunks")
print(f"Avg chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
return chunks
For a 1-hour video, a transcript typically runs 8,000–12,000 words. With chunk_size=4000, that produces 5–10 chunks — a comfortable workload for the map step.
Step 4: MapReduceDocumentsChain Summarization
MapReduceDocumentsChain is the right tool for long transcripts. The map step summarizes each chunk independently, and the reduce step combines those partial summaries into the final output.
from langchain.chains.summarize import load_summarize_chain
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
def build_summarization_chain(
model: str = "gpt-4o-mini",
temperature: float = 0,
chain_type: str = "map_reduce",
output_format: str = "standard"
):
"""
Build a summarization chain for YouTube transcripts.
chain_type: 'map_reduce' (long videos), 'stuff' (short videos), 'refine'
output_format: 'standard', 'bullets', 'structured'
"""
llm = ChatOpenAI(model=model, temperature=temperature)
if chain_type == "stuff":
# For short videos where the full transcript fits in context
prompt = PromptTemplate(
template="""You are summarizing a YouTube video transcript.
Transcript:
{text}
Write a clear, informative summary that covers:
1. The main topic and purpose of the video
2. Key points and takeaways
3. Any specific techniques, tools, or conclusions mentioned
Summary:""",
input_variables=["text"]
)
return load_summarize_chain(
llm,
chain_type="stuff",
prompt=prompt
)
# Map-reduce for long videos
map_prompt = PromptTemplate(
template="""Summarize this section of a YouTube video transcript.
Be concise but capture all important points, facts, and conclusions.
Transcript section:
{text}
Section summary:""",
input_variables=["text"]
)
if output_format == "bullets":
combine_prompt = PromptTemplate(
template="""You are combining section summaries from a YouTube video into a final summary.
Section summaries:
{text}
Write a final summary as bullet points grouped by topic.
Format: Start each group with a bold topic heading, followed by 2-4 bullet points.
Final summary:""",
input_variables=["text"]
)
elif output_format == "structured":
combine_prompt = PromptTemplate(
template="""You are combining section summaries from a YouTube video into a structured summary.
Section summaries:
{text}
Write a final summary with these sections:
**Overview**: 2-3 sentences on what the video is about
**Key Points**: numbered list of the 5-7 most important takeaways
**Conclusion**: what the video recommends or concludes
Final summary:""",
input_variables=["text"]
)
else:
combine_prompt = PromptTemplate(
template="""You are combining section summaries from a YouTube video into a final summary.
Section summaries:
{text}
Write a cohesive, well-structured summary covering the video's main points.
Aim for 3-5 paragraphs. Be specific about techniques, tools, or data mentioned.
Final summary:""",
input_variables=["text"]
)
return load_summarize_chain(
llm,
chain_type="map_reduce",
map_prompt=map_prompt,
combine_prompt=combine_prompt,
verbose=False
)
def summarize_transcript(
chunks: list,
model: str = "gpt-4o-mini",
chain_type: str = "map_reduce",
output_format: str = "structured"
) -> str:
"""
Summarize transcript chunks using the specified chain type.
"""
# For very short transcripts, use 'stuff' directly
total_chars = sum(len(c.page_content) for c in chunks)
if total_chars < 6000 and chain_type == "map_reduce":
print("Short transcript detected, using 'stuff' chain")
chain_type = "stuff"
chain = build_summarization_chain(
model=model,
chain_type=chain_type,
output_format=output_format
)
print(f"Summarizing {len(chunks)} chunks with {chain_type} chain...")
result = chain.invoke({"input_documents": chunks})
summary = result.get("output_text", result.get("text", ""))
print(f"Summary generated: {len(summary)} characters")
return summary
Using gpt-4o-mini for the map step is intentional — it's much cheaper and the per-chunk summaries don't need the full power of GPT-4o. The combine step is where nuance matters, so I upgrade to gpt-4o there in production.
For more on how chain types work in LangChain, the LangChain tutorial 2025 has a dedicated section on chain architectures.
Step 5: Full Pipeline Function
Before building the agent, let's wrap everything into a single pipeline function:
from typing import Optional
import time
def summarize_youtube_video(
url: str,
language: str = "en",
whisper_model: str = "base",
summarization_model: str = "gpt-4o-mini",
combine_model: str = "gpt-4o",
output_format: str = "structured",
chunk_size: int = 4000,
chunk_overlap: int = 200
) -> dict:
"""
Full pipeline: URL → transcript → chunks → summary.
Returns dict with summary, metadata, timing, and source info.
"""
start_time = time.time()
# Step 1: Load transcript
print(f"\n{'='*50}")
print(f"Processing: {url}")
print(f"{'='*50}")
transcript_text, metadata, source = load_transcript_with_fallback(
url,
language=language,
whisper_model=whisper_model
)
load_time = time.time() - start_time
print(f"Transcript loaded in {load_time:.1f}s (source: {source})")
# Step 2: Chunk transcript
chunks = chunk_transcript(
transcript_text,
metadata,
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
# Step 3: Determine chain type
total_chars = sum(len(c.page_content) for c in chunks)
chain_type = "stuff" if total_chars < 6000 else "map_reduce"
# Step 4: Summarize
summary = summarize_transcript(
chunks,
model=summarization_model,
chain_type=chain_type,
output_format=output_format
)
total_time = time.time() - start_time
return {
"url": url,
"title": metadata.get("title", "Unknown"),
"author": metadata.get("author", "Unknown"),
"duration_seconds": metadata.get("length", 0),
"transcript_source": source,
"num_chunks": len(chunks),
"chain_type": chain_type,
"summary": summary,
"processing_time_seconds": round(total_time, 1)
}
# Test run
if __name__ == "__main__":
result = summarize_youtube_video(
"https://www.youtube.com/watch?v=YOUR_VIDEO_ID",
output_format="structured"
)
print(f"\nVideo: {result['title']}")
print(f"Author: {result['author']}")
print(f"Duration: {result['duration_seconds'] // 60} minutes")
print(f"Processed in: {result['processing_time_seconds']}s")
print(f"\n{'='*50}")
print("SUMMARY:")
print('='*50)
print(result['summary'])
Step 6: Building the Agent
The pipeline is useful as a standalone function, but wrapping it as a LangChain agent tool lets users interact with it naturally — they can ask follow-up questions, request different formats, or ask for specific sections.
from langchain.tools import tool
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from typing import Optional
import re
@tool
def summarize_video(url: str, output_format: str = "structured") -> str:
"""
Summarize a YouTube video given its URL.
Use this tool when the user provides a YouTube URL and asks for a summary,
key points, or overview of the video content.
Args:
url: Full YouTube URL (e.g., https://www.youtube.com/watch?v=...)
output_format: 'standard', 'bullets', or 'structured' (default: structured)
Returns:
Structured summary of the video content
"""
try:
result = summarize_youtube_video(
url=url,
output_format=output_format
)
output = f"""**Video**: {result['title']}
**Author**: {result['author']}
**Duration**: {result['duration_seconds'] // 60} minutes
**Transcript source**: {result['transcript_source']}
{result['summary']}
*Processed in {result['processing_time_seconds']}s using {result['chain_type']} summarization*"""
return output
except Exception as e:
return f"Failed to summarize video: {str(e)}"
@tool
def extract_video_id(url: str) -> str:
"""
Extract the video ID from a YouTube URL.
Use this when the user provides a YouTube URL and you need the video ID.
"""
patterns = [
r"youtube\.com/watch\?v=([a-zA-Z0-9_-]{11})",
r"youtu\.be/([a-zA-Z0-9_-]{11})",
r"youtube\.com/embed/([a-zA-Z0-9_-]{11})",
r"youtube\.com/shorts/([a-zA-Z0-9_-]{11})"
]
for pattern in patterns:
match = re.search(pattern, url)
if match:
return f"Video ID: {match.group(1)}"
return "Could not extract video ID from URL"
def build_youtube_agent(
model: str = "gpt-4o",
enable_memory: bool = True
) -> AgentExecutor:
"""
Build a YouTube summarization agent with tool use.
"""
llm = ChatOpenAI(model=model, temperature=0)
tools = [summarize_video, extract_video_id]
system_message = """You are a helpful assistant that specializes in summarizing
YouTube videos. When given a YouTube URL, you use the summarize_video tool to
generate a comprehensive summary.
You can also:
- Explain specific parts of a video when asked
- Compare key points from multiple videos
- Extract the video ID from a URL
- Generate summaries in different formats (standard, bullets, structured)
Always use the summarize_video tool when the user provides a YouTube URL.
Be helpful and informative in your responses."""
if enable_memory:
prompt = ChatPromptTemplate.from_messages([
("system", system_message),
MessagesPlaceholder("chat_history", optional=True),
("human", "{input}"),
MessagesPlaceholder("agent_scratchpad")
])
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True
)
else:
prompt = ChatPromptTemplate.from_messages([
("system", system_message),
("human", "{input}"),
MessagesPlaceholder("agent_scratchpad")
])
memory = None
agent = create_openai_tools_agent(llm, tools, prompt)
return AgentExecutor(
agent=agent,
tools=tools,
memory=memory,
verbose=True,
max_iterations=3,
handle_parsing_errors=True
)
# Run the agent
if __name__ == "__main__":
agent = build_youtube_agent(model="gpt-4o")
# Single video summary
response = agent.invoke({
"input": "Please summarize this video: https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
})
print(response["output"])
# Follow-up question (uses memory)
response2 = agent.invoke({
"input": "Can you give me just the bullet points from that video?"
})
print(response2["output"])
The agent uses ConversationBufferMemory so follow-up questions work naturally — the user can say "give me bullet points" after getting a summary, and the agent knows which video they're referring to.
For more on building agents with memory and tool use, the AI agent memory and planning post covers advanced memory patterns.
Step 7: Handling Long Videos with Custom Map-Reduce
For very long videos (3+ hours, 30,000+ word transcripts), the standard MapReduceDocumentsChain produces too many chunk summaries for a good final combine step. A hierarchical map-reduce handles this better:
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.schema import Document
def hierarchical_summarize(
chunks: list,
map_model: str = "gpt-4o-mini",
combine_model: str = "gpt-4o",
intermediate_batch_size: int = 5
) -> str:
"""
Hierarchical map-reduce for very long videos (3+ hours).
Stage 1: Summarize chunks in batches of intermediate_batch_size
Stage 2: Combine batch summaries into section summaries
Stage 3: Combine section summaries into final summary
"""
fast_llm = ChatOpenAI(model=map_model, temperature=0)
smart_llm = ChatOpenAI(model=combine_model, temperature=0)
map_prompt = PromptTemplate(
template="Summarize this transcript segment in 3-5 sentences:\n\n{text}\n\nSummary:",
input_variables=["text"]
)
batch_combine_prompt = PromptTemplate(
template="""Combine these segment summaries into a coherent section summary (5-8 sentences):
{text}
Section summary:""",
input_variables=["text"]
)
final_combine_prompt = PromptTemplate(
template="""You are writing the final summary of a YouTube video from section summaries.
Section summaries:
{text}
Write a comprehensive, well-structured summary with:
**Overview**: What this video covers and why it matters
**Key Points**: The 5-7 most important takeaways
**Tools/Techniques Mentioned**: Specific tools, frameworks, or methods discussed
**Conclusion**: Main recommendation or conclusion
Final summary:""",
input_variables=["text"]
)
print(f"Stage 1: Summarizing {len(chunks)} chunks in batches of {intermediate_batch_size}...")
# Stage 1: Map each chunk
chunk_summaries = []
for chunk in chunks:
chain = load_summarize_chain(fast_llm, chain_type="stuff", prompt=map_prompt)
result = chain.invoke({"input_documents": [chunk]})
chunk_summaries.append(result["output_text"])
# Stage 2: Combine chunks into batches
print(f"Stage 2: Combining into {len(chunk_summaries) // intermediate_batch_size + 1} sections...")
section_summaries = []
for i in range(0, len(chunk_summaries), intermediate_batch_size):
batch = chunk_summaries[i:i + intermediate_batch_size]
batch_text = "\n\n".join(batch)
batch_doc = Document(page_content=batch_text)
chain = load_summarize_chain(fast_llm, chain_type="stuff", prompt=batch_combine_prompt)
result = chain.invoke({"input_documents": [batch_doc]})
section_summaries.append(result["output_text"])
# Stage 3: Final combine
print("Stage 3: Generating final summary...")
final_text = "\n\n".join(section_summaries)
final_doc = Document(page_content=final_text)
chain = load_summarize_chain(smart_llm, chain_type="stuff", prompt=final_combine_prompt)
result = chain.invoke({"input_documents": [final_doc]})
return result["output_text"]
This three-stage approach keeps each LLM call within a manageable context window regardless of video length. A 3-hour keynote becomes 30 chunk summaries → 6 section summaries → 1 final summary.
Step 8: Complete Deployable FastAPI Endpoint
Here's the full production deployment as a FastAPI application:
import os
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, HttpUrl
from typing import Optional, Literal
import uvicorn
from datetime import datetime
import uuid
app = FastAPI(
title="YouTube Summarizer API",
description="Summarize YouTube videos using LangChain and GPT-4o",
version="1.0.0"
)
# In-memory job store (use Redis in production)
jobs = {}
class SummarizeRequest(BaseModel):
url: str
language: str = "en"
output_format: Literal["standard", "bullets", "structured"] = "structured"
whisper_model: str = "base"
class SummarizeResponse(BaseModel):
job_id: str
status: str
created_at: str
class JobResult(BaseModel):
job_id: str
status: str
result: Optional[dict] = None
error: Optional[str] = None
def run_summarization_job(job_id: str, request: SummarizeRequest):
"""Background task that runs the summarization pipeline."""
try:
jobs[job_id]["status"] = "processing"
result = summarize_youtube_video(
url=request.url,
language=request.language,
output_format=request.output_format,
whisper_model=request.whisper_model
)
jobs[job_id]["status"] = "completed"
jobs[job_id]["result"] = result
except Exception as e:
jobs[job_id]["status"] = "failed"
jobs[job_id]["error"] = str(e)
@app.post("/summarize", response_model=SummarizeResponse)
async def create_summary_job(
request: SummarizeRequest,
background_tasks: BackgroundTasks
):
"""
Submit a YouTube video for summarization.
Returns a job ID to poll for results.
"""
job_id = str(uuid.uuid4())[:8]
jobs[job_id] = {
"status": "queued",
"created_at": datetime.utcnow().isoformat(),
"result": None,
"error": None
}
background_tasks.add_task(run_summarization_job, job_id, request)
return SummarizeResponse(
job_id=job_id,
status="queued",
created_at=jobs[job_id]["created_at"]
)
@app.get("/jobs/{job_id}", response_model=JobResult)
async def get_job_status(job_id: str):
"""
Get the status and result of a summarization job.
Poll this endpoint until status is 'completed' or 'failed'.
"""
if job_id not in jobs:
raise HTTPException(status_code=404, detail=f"Job {job_id} not found")
job = jobs[job_id]
return JobResult(
job_id=job_id,
status=job["status"],
result=job.get("result"),
error=job.get("error")
)
@app.post("/summarize/sync")
async def summarize_sync(request: SummarizeRequest):
"""
Synchronous summarization endpoint.
Blocks until complete — use for videos under 30 minutes.
"""
try:
result = summarize_youtube_video(
url=request.url,
language=request.language,
output_format=request.output_format,
whisper_model=request.whisper_model
)
return {"status": "success", "data": result}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health_check():
return {"status": "ok", "active_jobs": len(jobs)}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Run it with:
uvicorn app:app --reload --port 8000
Then call it:
# Submit a job
curl -X POST http://localhost:8000/summarize \
-H "Content-Type: application/json" \
-d '{"url": "https://www.youtube.com/watch?v=YOUR_ID", "output_format": "structured"}'
# Poll for result
curl http://localhost:8000/jobs/YOUR_JOB_ID
For deploying this to production, the deploy AI model to production post covers containerization with Docker and hosting on cloud providers.
Optimizing for Cost and Speed
Running this at scale means managing token cost carefully. Here's what works in production:
Use gpt-4o-mini for map steps. The per-chunk summaries are mechanical — compress this section into 3–5 sentences. GPT-4o-mini handles this at a fraction of the cost.
# Cost comparison for a 1-hour video (~8,000 words, 8 chunks)
# gpt-4o map + combine: ~$0.08 per video
# gpt-4o-mini map + gpt-4o combine: ~$0.02 per video
# gpt-4o-mini both: ~$0.005 per video
# For high volume (1000 videos/day):
# All gpt-4o: $80/day
# Mini map + 4o combine: $20/day
# All mini: $5/day
Cache transcripts, not summaries. The transcript itself is stable — once you've loaded it, store it. Different users might want different summary formats from the same video, so caching the raw transcript and re-running only the LLM step is much cheaper than running the full pipeline twice.
import hashlib
import json
from pathlib import Path
TRANSCRIPT_CACHE_DIR = Path("./transcript_cache")
TRANSCRIPT_CACHE_DIR.mkdir(exist_ok=True)
def get_cached_transcript(url: str) -> Optional[tuple]:
"""Return cached transcript if available."""
url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
cache_path = TRANSCRIPT_CACHE_DIR / f"{url_hash}.json"
if cache_path.exists():
with open(cache_path) as f:
data = json.load(f)
print(f"Transcript cache hit: {url_hash}")
return data["text"], data["metadata"], data["source"]
return None
def cache_transcript(url: str, text: str, metadata: dict, source: str):
"""Save transcript to cache."""
url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
cache_path = TRANSCRIPT_CACHE_DIR / f"{url_hash}.json"
with open(cache_path, "w") as f:
json.dump({"text": text, "metadata": metadata, "source": source}, f)
print(f"Transcript cached: {url_hash}")
For the broader picture on AI agent cost management, the AI API cost management post covers token budgeting across multi-step pipelines.
Extending the Agent: Q&A Over Video Content
Once you have the transcript as Documents, you can do more than summarize — you can run a full RAG pipeline over the video content:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
def build_video_qa_system(chunks: list) -> RetrievalQA:
"""
Build a Q&A system over a video's transcript.
Lets users ask specific questions about video content.
"""
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-4o", temperature=0)
chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
return chain
# Use it
chunks = chunk_transcript(transcript_text, metadata)
qa_chain = build_video_qa_system(chunks)
# Ask specific questions
response = qa_chain.invoke({"query": "What library did the presenter use for authentication?"})
print("Answer:", response["result"])
This Q&A capability is the bridge between video summarization and a full RAG system tutorial-style knowledge base. You can ingest hundreds of videos, store all their transcripts in a persistent vector database, and let users query across all of them.
For more on building this kind of multi-document search system, the OpenAI API integration post covers embedding optimization for large content libraries.
Handling Edge Cases
A few problems you'll hit in production and how to handle them:
Transcript language mismatch. Auto-generated captions are sometimes in the wrong language. Check metadata["language"] after loading and retry with translation="en" if needed.
Very short videos (under 2 minutes). These sometimes have no transcript at all. Add a minimum duration check:
if metadata.get("length", 0) < 60:
return {"error": "Video too short to summarize (under 1 minute)"}
Live streams and premieres. YouTube live streams use a different transcript API endpoint. YoutubeLoader handles this, but Whisper fallback won't work on a live stream since there's no complete audio file to download.
Rate limits from transcript API. The youtube-transcript-api can hit rate limits for batch processing. Add exponential backoff:
import time
from typing import Callable, Any
def retry_with_backoff(
func: Callable,
max_retries: int = 3,
base_delay: float = 1.0
) -> Any:
"""Retry a function with exponential backoff."""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
For agent patterns that handle failures gracefully in multi-step pipelines, the AI research agent build post shows robust error handling across multiple tool calls.
Conclusion
A YouTube summarization agent covers a lot of LangChain ground in a single project: document loaders, text splitters, chain types, tool definition, agent orchestration, and async deployment. The combination of YoutubeLoader for fast caption access and a Whisper fallback for uncaptioned videos makes the system work reliably across the full range of YouTube content.
The most important architectural decision is using MapReduceDocumentsChain instead of naive stuff chaining. Long videos produce transcripts that simply don't fit in a single LLM context window. Map-reduce handles this cleanly without any special-casing.
The FastAPI deployment pattern gives you a production-ready API in about 50 lines. Add Redis for job storage, Docker for containerization, and the deployment checklist from deploy AI model to production, and you have a scalable summarization service.
The extension to full Q&A over video transcripts — a RAG pipeline using the same chunks — turns this from a summarization tool into a knowledge base. If you're building anything research-related, that extension is worth the extra 20 lines of code.
FAQs
Does YoutubeLoader work on private or age-restricted YouTube videos? No. YoutubeLoader uses the youtube-transcript-api library, which can only access transcripts on publicly available videos that have captions enabled. For private or age-restricted videos, you need to download the audio file separately and run it through Whisper or another transcription service.
How long does it take to summarize a one-hour YouTube video? With YoutubeLoader (transcript already available), loading takes 2–5 seconds. MapReduceDocumentsChain summarization depends on the number of chunks and LLM speed. For a 1-hour video with roughly 8,000 words of transcript, expect 20–40 seconds using gpt-4o-mini on the map step. Using gpt-4o for both steps takes 40–90 seconds.
What's the difference between map-reduce and refine summarization strategies? Map-reduce summarizes each chunk independently (map step) then combines those summaries into a final output (reduce step). It parallelizes well and handles very long documents. Refine passes the running summary through each chunk sequentially, updating it at each step — this produces more coherent summaries but can't be parallelized and is slower for long documents.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
Build a Research Agent with AutoGPT (Web Search + Summarize)
Build an autonomous research agent with AutoGPT that searches the web, extracts key information, and produces structured summaries with configurable output formats.
AutoGPT vs LangChain Agents: Which is More Autonomous?
Compare AutoGPT's zero-shot autonomy against LangChain's ReAct agents. Discover which handles complex tasks better and when to choose each framework.
10 LangChain Retrieval Strategies for Better RAG Results
Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.