AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

YouTube video being transcribed and summarized by AI — LangChain YouTube summarizer agent

Build a LangChain Agent That Summarizes YouTube Videos

⚡ Quick Answer

Build a full LangChain agent that loads YouTube transcripts, falls back to Whisper, and summarizes long videos with MapReduceDocumentsChain and GPT-4o.

AiTechWorlds Team May 31, 2026 21 min read

#LangChain #YouTube #summarization #Whisper #agent

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

I was building a research assistant for a podcast host who needed to process two to three hours of video content per day. The manual workflow — watch the video, take notes, write a brief — was taking four to five hours. I built a LangChain agent to handle it, and now the whole pipeline runs in under two minutes.

This post builds that agent from scratch. We'll go from transcript loading to a complete deployable system with a Whisper fallback for videos without captions, MapReduceDocumentsChain for handling long videos, and a tool-using agent that can accept either a URL or a video ID and return a structured summary.

The full agent at the end is production-ready. You can drop it into a FastAPI endpoint, a Slack bot, or a scheduled job.

For background on the broader agent framework this fits into, the build AI agent with LangChain post covers the foundational patterns.

The Architecture Before Any Code

The summarization pipeline has four stages:

Transcript acquisition — Try YouTube's built-in caption API via YoutubeLoader. If that fails (no captions, language unsupported, age restriction), fall back to downloading the audio and running Whisper.
Chunking — Split the transcript into overlapping chunks that fit within the LLM's context window.
Summarization — Use MapReduceDocumentsChain: summarize each chunk independently (map), then combine chunk summaries into a final output (reduce).
Agent orchestration — Wrap the pipeline as a tool that a LangChain agent can call based on user input.

This design handles videos of any length. A 10-minute tutorial and a 3-hour conference keynote go through the same pipeline.

Comparison: Summarization Strategies

Strategy	Speed	Quality	Max Length	Cost
`stuff` chain	Fastest	Excellent	~16k tokens	Low
`map_reduce` chain	Fast (parallel map)	Very Good	Unlimited	Medium
`refine` chain	Slow (sequential)	Excellent	Unlimited	Medium-High
`map_rerank` chain	Medium	Good (ranks chunks)	Unlimited	Medium
Rolling window (custom)	Medium	Good	Unlimited	Medium

For YouTube videos, map_reduce wins on most tradeoffs: it parallelizes the per-chunk summarization, handles unlimited transcript length, and produces quality summaries. Use stuff only for short videos (under 20 minutes) where the full transcript fits in one context window.

Installation

pip install langchain langchain-community langchain-openai
pip install youtube-transcript-api pytube
pip install openai-whisper  # For Whisper fallback
pip install yt-dlp           # Better than pytube for audio download

Optional for the full agent deployment:

pip install fastapi uvicorn python-dotenv

Step 1: Loading YouTube Transcripts with YoutubeLoader

YoutubeLoader from langchain_community wraps the youtube-transcript-api library. It fetches the auto-generated or manual captions from a YouTube video and returns them as a LangChain Document.

from langchain_community.document_loaders import YoutubeLoader

def load_youtube_transcript(
    url: str,
    language: str = "en",
    translation: str = None
) -> list:
    """
    Load transcript from a YouTube URL.
    
    Args:
        url: Full YouTube URL or video ID
        language: Transcript language code (e.g., 'en', 'es', 'fr')
        translation: Translate to this language if original not available
    
    Returns:
        List of Document objects with transcript text
    """
    loader = YoutubeLoader.from_youtube_url(
        url,
        add_video_info=True,     # Includes title, author, length in metadata
        language=[language],
        translation=translation
    )
    
    documents = loader.load()
    
    if not documents:
        raise ValueError(f"No transcript found for: {url}")
    
    print(f"Loaded transcript: {len(documents[0].page_content)} characters")
    print(f"Video: {documents[0].metadata.get('title', 'Unknown')}")
    print(f"Author: {documents[0].metadata.get('author', 'Unknown')}")
    print(f"Length: {documents[0].metadata.get('length', 0)} seconds")
    
    return documents


# Try it
try:
    docs = load_youtube_transcript(
        "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
        language="en"
    )
    print("Transcript loaded successfully")
except Exception as e:
    print(f"Transcript loading failed: {e}")

The add_video_info=True flag pulls in the video title, author, length, and thumbnail URL from the YouTube oEmbed API. This metadata is useful for generating attribution in summaries.

Step 2: Whisper Fallback for Videos Without Captions

Many YouTube videos — especially technical talks, short clips, and older content — don't have captions. When YoutubeLoader fails, we need to download the audio and transcribe it locally with Whisper.

import os
import tempfile
import whisper
import yt_dlp


def download_audio(url: str, output_dir: str = None) -> str:
    """
    Download audio from a YouTube URL using yt-dlp.
    Returns the path to the downloaded audio file.
    """
    if output_dir is None:
        output_dir = tempfile.mkdtemp()
    
    output_template = os.path.join(output_dir, "%(id)s.%(ext)s")
    
    ydl_opts = {
        "format": "bestaudio/best",
        "outtmpl": output_template,
        "postprocessors": [{
            "key": "FFmpegExtractAudio",
            "preferredcodec": "mp3",
            "preferredquality": "192",
        }],
        "quiet": True,
        "no_warnings": True,
    }
    
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=True)
        video_id = info.get("id", "audio")
        audio_path = os.path.join(output_dir, f"{video_id}.mp3")
    
    print(f"Audio downloaded: {audio_path}")
    return audio_path


def transcribe_with_whisper(
    audio_path: str,
    model_size: str = "base",
    language: str = None
) -> str:
    """
    Transcribe audio file using OpenAI Whisper.
    
    model_size options: tiny, base, small, medium, large
    Larger models are more accurate but slower and use more memory.
    """
    print(f"Loading Whisper model: {model_size}")
    model = whisper.load_model(model_size)
    
    print("Transcribing audio...")
    transcribe_opts = {"verbose": False}
    if language:
        transcribe_opts["language"] = language
    
    result = model.transcribe(audio_path, **transcribe_opts)
    
    transcript = result["text"]
    detected_language = result.get("language", "unknown")
    
    print(f"Transcription complete: {len(transcript)} characters")
    print(f"Detected language: {detected_language}")
    
    return transcript


def load_transcript_with_fallback(
    url: str,
    language: str = "en",
    whisper_model: str = "base",
    cleanup_audio: bool = True
) -> tuple:
    """
    Load transcript via YoutubeLoader with Whisper fallback.
    
    Returns: (transcript_text, metadata_dict, source)
    source is either 'youtube_captions' or 'whisper'
    """
    # Try YouTube captions first
    try:
        loader = YoutubeLoader.from_youtube_url(
            url,
            add_video_info=True,
            language=[language]
        )
        docs = loader.load()
        
        if docs and docs[0].page_content.strip():
            return (
                docs[0].page_content,
                docs[0].metadata,
                "youtube_captions"
            )
        
        raise ValueError("Empty transcript returned")
        
    except Exception as e:
        print(f"YouTube captions unavailable ({e}). Falling back to Whisper...")
    
    # Whisper fallback
    audio_path = None
    try:
        audio_path = download_audio(url)
        transcript_text = transcribe_with_whisper(
            audio_path,
            model_size=whisper_model,
            language=language if language != "en" else None
        )
        
        # Extract basic metadata via yt-dlp (no download)
        with yt_dlp.YoutubeDL({"quiet": True}) as ydl:
            info = ydl.extract_info(url, download=False)
            metadata = {
                "title": info.get("title", "Unknown"),
                "author": info.get("uploader", "Unknown"),
                "length": info.get("duration", 0),
                "source": url
            }
        
        return transcript_text, metadata, "whisper"
        
    finally:
        if cleanup_audio and audio_path and os.path.exists(audio_path):
            os.remove(audio_path)
            print("Audio file cleaned up")

The Whisper base model runs on CPU in reasonable time (2–4× real-time for most hardware). For faster transcription on a GPU machine, use medium or large for better accuracy on technical content with jargon.

Step 3: Chunking the Transcript

Transcripts are one long string — there are no paragraph breaks, headings, or natural document structure. RecursiveCharacterTextSplitter handles this well by splitting on sentence boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document


def chunk_transcript(
    transcript_text: str,
    metadata: dict,
    chunk_size: int = 4000,
    chunk_overlap: int = 200
) -> list:
    """
    Split a transcript into overlapping chunks suitable for map-reduce.
    
    chunk_size: characters per chunk (4000 chars ≈ 1000 tokens)
    chunk_overlap: characters of overlap between adjacent chunks
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=[". ", "? ", "! ", "\n", " ", ""]
    )
    
    texts = splitter.split_text(transcript_text)
    
    # Wrap in Document objects with metadata
    chunks = [
        Document(
            page_content=text,
            metadata={
                **metadata,
                "chunk_index": i,
                "total_chunks": len(texts)
            }
        )
        for i, text in enumerate(texts)
    ]
    
    print(f"Split transcript into {len(chunks)} chunks")
    print(f"Avg chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
    
    return chunks

For a 1-hour video, a transcript typically runs 8,000–12,000 words. With chunk_size=4000, that produces 5–10 chunks — a comfortable workload for the map step.

Step 4: MapReduceDocumentsChain Summarization

MapReduceDocumentsChain is the right tool for long transcripts. The map step summarizes each chunk independently, and the reduce step combines those partial summaries into the final output.

from langchain.chains.summarize import load_summarize_chain
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate


def build_summarization_chain(
    model: str = "gpt-4o-mini",
    temperature: float = 0,
    chain_type: str = "map_reduce",
    output_format: str = "standard"
):
    """
    Build a summarization chain for YouTube transcripts.
    
    chain_type: 'map_reduce' (long videos), 'stuff' (short videos), 'refine'
    output_format: 'standard', 'bullets', 'structured'
    """
    llm = ChatOpenAI(model=model, temperature=temperature)
    
    if chain_type == "stuff":
        # For short videos where the full transcript fits in context
        prompt = PromptTemplate(
            template="""You are summarizing a YouTube video transcript.

Transcript:
{text}

Write a clear, informative summary that covers:
1. The main topic and purpose of the video
2. Key points and takeaways
3. Any specific techniques, tools, or conclusions mentioned

Summary:""",
            input_variables=["text"]
        )
        return load_summarize_chain(
            llm,
            chain_type="stuff",
            prompt=prompt
        )
    
    # Map-reduce for long videos
    map_prompt = PromptTemplate(
        template="""Summarize this section of a YouTube video transcript.
Be concise but capture all important points, facts, and conclusions.

Transcript section:
{text}

Section summary:""",
        input_variables=["text"]
    )
    
    if output_format == "bullets":
        combine_prompt = PromptTemplate(
            template="""You are combining section summaries from a YouTube video into a final summary.

Section summaries:
{text}

Write a final summary as bullet points grouped by topic.
Format: Start each group with a bold topic heading, followed by 2-4 bullet points.

Final summary:""",
            input_variables=["text"]
        )
    
    elif output_format == "structured":
        combine_prompt = PromptTemplate(
            template="""You are combining section summaries from a YouTube video into a structured summary.

Section summaries:
{text}

Write a final summary with these sections:
**Overview**: 2-3 sentences on what the video is about
**Key Points**: numbered list of the 5-7 most important takeaways
**Conclusion**: what the video recommends or concludes

Final summary:""",
            input_variables=["text"]
        )
    
    else:
        combine_prompt = PromptTemplate(
            template="""You are combining section summaries from a YouTube video into a final summary.

Section summaries:
{text}

Write a cohesive, well-structured summary covering the video's main points.
Aim for 3-5 paragraphs. Be specific about techniques, tools, or data mentioned.

Final summary:""",
            input_variables=["text"]
        )
    
    return load_summarize_chain(
        llm,
        chain_type="map_reduce",
        map_prompt=map_prompt,
        combine_prompt=combine_prompt,
        verbose=False
    )


def summarize_transcript(
    chunks: list,
    model: str = "gpt-4o-mini",
    chain_type: str = "map_reduce",
    output_format: str = "structured"
) -> str:
    """
    Summarize transcript chunks using the specified chain type.
    """
    # For very short transcripts, use 'stuff' directly
    total_chars = sum(len(c.page_content) for c in chunks)
    if total_chars < 6000 and chain_type == "map_reduce":
        print("Short transcript detected, using 'stuff' chain")
        chain_type = "stuff"
    
    chain = build_summarization_chain(
        model=model,
        chain_type=chain_type,
        output_format=output_format
    )
    
    print(f"Summarizing {len(chunks)} chunks with {chain_type} chain...")
    result = chain.invoke({"input_documents": chunks})
    
    summary = result.get("output_text", result.get("text", ""))
    print(f"Summary generated: {len(summary)} characters")
    
    return summary

Using gpt-4o-mini for the map step is intentional — it's much cheaper and the per-chunk summaries don't need the full power of GPT-4o. The combine step is where nuance matters, so I upgrade to gpt-4o there in production.

For more on how chain types work in LangChain, the LangChain tutorial 2025 has a dedicated section on chain architectures.

Step 5: Full Pipeline Function

Before building the agent, let's wrap everything into a single pipeline function:

from typing import Optional
import time


def summarize_youtube_video(
    url: str,
    language: str = "en",
    whisper_model: str = "base",
    summarization_model: str = "gpt-4o-mini",
    combine_model: str = "gpt-4o",
    output_format: str = "structured",
    chunk_size: int = 4000,
    chunk_overlap: int = 200
) -> dict:
    """
    Full pipeline: URL → transcript → chunks → summary.
    
    Returns dict with summary, metadata, timing, and source info.
    """
    start_time = time.time()
    
    # Step 1: Load transcript
    print(f"\n{'='*50}")
    print(f"Processing: {url}")
    print(f"{'='*50}")
    
    transcript_text, metadata, source = load_transcript_with_fallback(
        url,
        language=language,
        whisper_model=whisper_model
    )
    
    load_time = time.time() - start_time
    print(f"Transcript loaded in {load_time:.1f}s (source: {source})")
    
    # Step 2: Chunk transcript
    chunks = chunk_transcript(
        transcript_text,
        metadata,
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    # Step 3: Determine chain type
    total_chars = sum(len(c.page_content) for c in chunks)
    chain_type = "stuff" if total_chars < 6000 else "map_reduce"
    
    # Step 4: Summarize
    summary = summarize_transcript(
        chunks,
        model=summarization_model,
        chain_type=chain_type,
        output_format=output_format
    )
    
    total_time = time.time() - start_time
    
    return {
        "url": url,
        "title": metadata.get("title", "Unknown"),
        "author": metadata.get("author", "Unknown"),
        "duration_seconds": metadata.get("length", 0),
        "transcript_source": source,
        "num_chunks": len(chunks),
        "chain_type": chain_type,
        "summary": summary,
        "processing_time_seconds": round(total_time, 1)
    }


# Test run
if __name__ == "__main__":
    result = summarize_youtube_video(
        "https://www.youtube.com/watch?v=YOUR_VIDEO_ID",
        output_format="structured"
    )
    
    print(f"\nVideo: {result['title']}")
    print(f"Author: {result['author']}")
    print(f"Duration: {result['duration_seconds'] // 60} minutes")
    print(f"Processed in: {result['processing_time_seconds']}s")
    print(f"\n{'='*50}")
    print("SUMMARY:")
    print('='*50)
    print(result['summary'])

Step 6: Building the Agent

The pipeline is useful as a standalone function, but wrapping it as a LangChain agent tool lets users interact with it naturally — they can ask follow-up questions, request different formats, or ask for specific sections.

from langchain.tools import tool
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from typing import Optional
import re


@tool
def summarize_video(url: str, output_format: str = "structured") -> str:
    """
    Summarize a YouTube video given its URL.
    
    Use this tool when the user provides a YouTube URL and asks for a summary,
    key points, or overview of the video content.
    
    Args:
        url: Full YouTube URL (e.g., https://www.youtube.com/watch?v=...)
        output_format: 'standard', 'bullets', or 'structured' (default: structured)
    
    Returns:
        Structured summary of the video content
    """
    try:
        result = summarize_youtube_video(
            url=url,
            output_format=output_format
        )
        
        output = f"""**Video**: {result['title']}
**Author**: {result['author']}
**Duration**: {result['duration_seconds'] // 60} minutes
**Transcript source**: {result['transcript_source']}

{result['summary']}

*Processed in {result['processing_time_seconds']}s using {result['chain_type']} summarization*"""
        
        return output
        
    except Exception as e:
        return f"Failed to summarize video: {str(e)}"


@tool
def extract_video_id(url: str) -> str:
    """
    Extract the video ID from a YouTube URL.
    
    Use this when the user provides a YouTube URL and you need the video ID.
    """
    patterns = [
        r"youtube\.com/watch\?v=([a-zA-Z0-9_-]{11})",
        r"youtu\.be/([a-zA-Z0-9_-]{11})",
        r"youtube\.com/embed/([a-zA-Z0-9_-]{11})",
        r"youtube\.com/shorts/([a-zA-Z0-9_-]{11})"
    ]
    
    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return f"Video ID: {match.group(1)}"
    
    return "Could not extract video ID from URL"


def build_youtube_agent(
    model: str = "gpt-4o",
    enable_memory: bool = True
) -> AgentExecutor:
    """
    Build a YouTube summarization agent with tool use.
    """
    llm = ChatOpenAI(model=model, temperature=0)
    
    tools = [summarize_video, extract_video_id]
    
    system_message = """You are a helpful assistant that specializes in summarizing 
YouTube videos. When given a YouTube URL, you use the summarize_video tool to 
generate a comprehensive summary. 

You can also:
- Explain specific parts of a video when asked
- Compare key points from multiple videos
- Extract the video ID from a URL
- Generate summaries in different formats (standard, bullets, structured)

Always use the summarize_video tool when the user provides a YouTube URL.
Be helpful and informative in your responses."""

    if enable_memory:
        prompt = ChatPromptTemplate.from_messages([
            ("system", system_message),
            MessagesPlaceholder("chat_history", optional=True),
            ("human", "{input}"),
            MessagesPlaceholder("agent_scratchpad")
        ])
        
        memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True
        )
    else:
        prompt = ChatPromptTemplate.from_messages([
            ("system", system_message),
            ("human", "{input}"),
            MessagesPlaceholder("agent_scratchpad")
        ])
        memory = None
    
    agent = create_openai_tools_agent(llm, tools, prompt)
    
    return AgentExecutor(
        agent=agent,
        tools=tools,
        memory=memory,
        verbose=True,
        max_iterations=3,
        handle_parsing_errors=True
    )


# Run the agent
if __name__ == "__main__":
    agent = build_youtube_agent(model="gpt-4o")
    
    # Single video summary
    response = agent.invoke({
        "input": "Please summarize this video: https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
    })
    print(response["output"])
    
    # Follow-up question (uses memory)
    response2 = agent.invoke({
        "input": "Can you give me just the bullet points from that video?"
    })
    print(response2["output"])

The agent uses ConversationBufferMemory so follow-up questions work naturally — the user can say "give me bullet points" after getting a summary, and the agent knows which video they're referring to.

For more on building agents with memory and tool use, the AI agent memory and planning post covers advanced memory patterns.

Step 7: Handling Long Videos with Custom Map-Reduce

For very long videos (3+ hours, 30,000+ word transcripts), the standard MapReduceDocumentsChain produces too many chunk summaries for a good final combine step. A hierarchical map-reduce handles this better:

from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.schema import Document


def hierarchical_summarize(
    chunks: list,
    map_model: str = "gpt-4o-mini",
    combine_model: str = "gpt-4o",
    intermediate_batch_size: int = 5
) -> str:
    """
    Hierarchical map-reduce for very long videos (3+ hours).
    
    Stage 1: Summarize chunks in batches of intermediate_batch_size
    Stage 2: Combine batch summaries into section summaries
    Stage 3: Combine section summaries into final summary
    """
    fast_llm = ChatOpenAI(model=map_model, temperature=0)
    smart_llm = ChatOpenAI(model=combine_model, temperature=0)
    
    map_prompt = PromptTemplate(
        template="Summarize this transcript segment in 3-5 sentences:\n\n{text}\n\nSummary:",
        input_variables=["text"]
    )
    
    batch_combine_prompt = PromptTemplate(
        template="""Combine these segment summaries into a coherent section summary (5-8 sentences):

{text}

Section summary:""",
        input_variables=["text"]
    )
    
    final_combine_prompt = PromptTemplate(
        template="""You are writing the final summary of a YouTube video from section summaries.

Section summaries:
{text}

Write a comprehensive, well-structured summary with:
**Overview**: What this video covers and why it matters
**Key Points**: The 5-7 most important takeaways
**Tools/Techniques Mentioned**: Specific tools, frameworks, or methods discussed
**Conclusion**: Main recommendation or conclusion

Final summary:""",
        input_variables=["text"]
    )
    
    print(f"Stage 1: Summarizing {len(chunks)} chunks in batches of {intermediate_batch_size}...")
    
    # Stage 1: Map each chunk
    chunk_summaries = []
    for chunk in chunks:
        chain = load_summarize_chain(fast_llm, chain_type="stuff", prompt=map_prompt)
        result = chain.invoke({"input_documents": [chunk]})
        chunk_summaries.append(result["output_text"])
    
    # Stage 2: Combine chunks into batches
    print(f"Stage 2: Combining into {len(chunk_summaries) // intermediate_batch_size + 1} sections...")
    
    section_summaries = []
    for i in range(0, len(chunk_summaries), intermediate_batch_size):
        batch = chunk_summaries[i:i + intermediate_batch_size]
        batch_text = "\n\n".join(batch)
        batch_doc = Document(page_content=batch_text)
        
        chain = load_summarize_chain(fast_llm, chain_type="stuff", prompt=batch_combine_prompt)
        result = chain.invoke({"input_documents": [batch_doc]})
        section_summaries.append(result["output_text"])
    
    # Stage 3: Final combine
    print("Stage 3: Generating final summary...")
    final_text = "\n\n".join(section_summaries)
    final_doc = Document(page_content=final_text)
    
    chain = load_summarize_chain(smart_llm, chain_type="stuff", prompt=final_combine_prompt)
    result = chain.invoke({"input_documents": [final_doc]})
    
    return result["output_text"]

This three-stage approach keeps each LLM call within a manageable context window regardless of video length. A 3-hour keynote becomes 30 chunk summaries → 6 section summaries → 1 final summary.

Step 8: Complete Deployable FastAPI Endpoint

Here's the full production deployment as a FastAPI application:

import os
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, HttpUrl
from typing import Optional, Literal
import uvicorn
from datetime import datetime
import uuid


app = FastAPI(
    title="YouTube Summarizer API",
    description="Summarize YouTube videos using LangChain and GPT-4o",
    version="1.0.0"
)

# In-memory job store (use Redis in production)
jobs = {}


class SummarizeRequest(BaseModel):
    url: str
    language: str = "en"
    output_format: Literal["standard", "bullets", "structured"] = "structured"
    whisper_model: str = "base"


class SummarizeResponse(BaseModel):
    job_id: str
    status: str
    created_at: str


class JobResult(BaseModel):
    job_id: str
    status: str
    result: Optional[dict] = None
    error: Optional[str] = None


def run_summarization_job(job_id: str, request: SummarizeRequest):
    """Background task that runs the summarization pipeline."""
    try:
        jobs[job_id]["status"] = "processing"
        
        result = summarize_youtube_video(
            url=request.url,
            language=request.language,
            output_format=request.output_format,
            whisper_model=request.whisper_model
        )
        
        jobs[job_id]["status"] = "completed"
        jobs[job_id]["result"] = result
        
    except Exception as e:
        jobs[job_id]["status"] = "failed"
        jobs[job_id]["error"] = str(e)


@app.post("/summarize", response_model=SummarizeResponse)
async def create_summary_job(
    request: SummarizeRequest,
    background_tasks: BackgroundTasks
):
    """
    Submit a YouTube video for summarization.
    Returns a job ID to poll for results.
    """
    job_id = str(uuid.uuid4())[:8]
    
    jobs[job_id] = {
        "status": "queued",
        "created_at": datetime.utcnow().isoformat(),
        "result": None,
        "error": None
    }
    
    background_tasks.add_task(run_summarization_job, job_id, request)
    
    return SummarizeResponse(
        job_id=job_id,
        status="queued",
        created_at=jobs[job_id]["created_at"]
    )


@app.get("/jobs/{job_id}", response_model=JobResult)
async def get_job_status(job_id: str):
    """
    Get the status and result of a summarization job.
    Poll this endpoint until status is 'completed' or 'failed'.
    """
    if job_id not in jobs:
        raise HTTPException(status_code=404, detail=f"Job {job_id} not found")
    
    job = jobs[job_id]
    
    return JobResult(
        job_id=job_id,
        status=job["status"],
        result=job.get("result"),
        error=job.get("error")
    )


@app.post("/summarize/sync")
async def summarize_sync(request: SummarizeRequest):
    """
    Synchronous summarization endpoint.
    Blocks until complete — use for videos under 30 minutes.
    """
    try:
        result = summarize_youtube_video(
            url=request.url,
            language=request.language,
            output_format=request.output_format,
            whisper_model=request.whisper_model
        )
        return {"status": "success", "data": result}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.get("/health")
async def health_check():
    return {"status": "ok", "active_jobs": len(jobs)}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run it with:

uvicorn app:app --reload --port 8000

Then call it:

# Submit a job
curl -X POST http://localhost:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=YOUR_ID", "output_format": "structured"}'

# Poll for result
curl http://localhost:8000/jobs/YOUR_JOB_ID

For deploying this to production, the deploy AI model to production post covers containerization with Docker and hosting on cloud providers.

Optimizing for Cost and Speed

Running this at scale means managing token cost carefully. Here's what works in production:

Use gpt-4o-mini for map steps. The per-chunk summaries are mechanical — compress this section into 3–5 sentences. GPT-4o-mini handles this at a fraction of the cost.

# Cost comparison for a 1-hour video (~8,000 words, 8 chunks)
# gpt-4o map + combine:      ~$0.08 per video
# gpt-4o-mini map + gpt-4o combine: ~$0.02 per video
# gpt-4o-mini both:          ~$0.005 per video

# For high volume (1000 videos/day):
# All gpt-4o:     $80/day
# Mini map + 4o combine: $20/day
# All mini:       $5/day

Cache transcripts, not summaries. The transcript itself is stable — once you've loaded it, store it. Different users might want different summary formats from the same video, so caching the raw transcript and re-running only the LLM step is much cheaper than running the full pipeline twice.

import hashlib
import json
from pathlib import Path

TRANSCRIPT_CACHE_DIR = Path("./transcript_cache")
TRANSCRIPT_CACHE_DIR.mkdir(exist_ok=True)

def get_cached_transcript(url: str) -> Optional[tuple]:
    """Return cached transcript if available."""
    url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
    cache_path = TRANSCRIPT_CACHE_DIR / f"{url_hash}.json"
    
    if cache_path.exists():
        with open(cache_path) as f:
            data = json.load(f)
        print(f"Transcript cache hit: {url_hash}")
        return data["text"], data["metadata"], data["source"]
    
    return None

def cache_transcript(url: str, text: str, metadata: dict, source: str):
    """Save transcript to cache."""
    url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
    cache_path = TRANSCRIPT_CACHE_DIR / f"{url_hash}.json"
    
    with open(cache_path, "w") as f:
        json.dump({"text": text, "metadata": metadata, "source": source}, f)
    
    print(f"Transcript cached: {url_hash}")

For the broader picture on AI agent cost management, the AI API cost management post covers token budgeting across multi-step pipelines.

Extending the Agent: Q&A Over Video Content

Once you have the transcript as Documents, you can do more than summarize — you can run a full RAG pipeline over the video content:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI


def build_video_qa_system(chunks: list) -> RetrievalQA:
    """
    Build a Q&A system over a video's transcript.
    Lets users ask specific questions about video content.
    """
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = FAISS.from_documents(chunks, embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )
    
    return chain


# Use it
chunks = chunk_transcript(transcript_text, metadata)
qa_chain = build_video_qa_system(chunks)

# Ask specific questions
response = qa_chain.invoke({"query": "What library did the presenter use for authentication?"})
print("Answer:", response["result"])

This Q&A capability is the bridge between video summarization and a full RAG system tutorial-style knowledge base. You can ingest hundreds of videos, store all their transcripts in a persistent vector database, and let users query across all of them.

For more on building this kind of multi-document search system, the OpenAI API integration post covers embedding optimization for large content libraries.

Handling Edge Cases

A few problems you'll hit in production and how to handle them:

Transcript language mismatch. Auto-generated captions are sometimes in the wrong language. Check metadata["language"] after loading and retry with translation="en" if needed.

Very short videos (under 2 minutes). These sometimes have no transcript at all. Add a minimum duration check:

if metadata.get("length", 0) < 60:
    return {"error": "Video too short to summarize (under 1 minute)"}

Live streams and premieres. YouTube live streams use a different transcript API endpoint. YoutubeLoader handles this, but Whisper fallback won't work on a live stream since there's no complete audio file to download.

Rate limits from transcript API. The youtube-transcript-api can hit rate limits for batch processing. Add exponential backoff:

import time
from typing import Callable, Any

def retry_with_backoff(
    func: Callable,
    max_retries: int = 3,
    base_delay: float = 1.0
) -> Any:
    """Retry a function with exponential backoff."""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
            time.sleep(delay)

For agent patterns that handle failures gracefully in multi-step pipelines, the AI research agent build post shows robust error handling across multiple tool calls.

Conclusion

A YouTube summarization agent covers a lot of LangChain ground in a single project: document loaders, text splitters, chain types, tool definition, agent orchestration, and async deployment. The combination of YoutubeLoader for fast caption access and a Whisper fallback for uncaptioned videos makes the system work reliably across the full range of YouTube content.

The most important architectural decision is using MapReduceDocumentsChain instead of naive stuff chaining. Long videos produce transcripts that simply don't fit in a single LLM context window. Map-reduce handles this cleanly without any special-casing.

The FastAPI deployment pattern gives you a production-ready API in about 50 lines. Add Redis for job storage, Docker for containerization, and the deployment checklist from deploy AI model to production, and you have a scalable summarization service.

The extension to full Q&A over video transcripts — a RAG pipeline using the same chunks — turns this from a summarization tool into a knowledge base. If you're building anything research-related, that extension is worth the extra 20 lines of code.

FAQs

Does YoutubeLoader work on private or age-restricted YouTube videos? No. YoutubeLoader uses the youtube-transcript-api library, which can only access transcripts on publicly available videos that have captions enabled. For private or age-restricted videos, you need to download the audio file separately and run it through Whisper or another transcription service.

How long does it take to summarize a one-hour YouTube video? With YoutubeLoader (transcript already available), loading takes 2–5 seconds. MapReduceDocumentsChain summarization depends on the number of chunks and LLM speed. For a 1-hour video with roughly 8,000 words of transcript, expect 20–40 seconds using gpt-4o-mini on the map step. Using gpt-4o for both steps takes 40–90 seconds.

What's the difference between map-reduce and refine summarization strategies? Map-reduce summarizes each chunk independently (map step) then combines those summaries into a final output (reduce step). It parallelizes well and handles very long documents. Refine passes the running summary through each chunk sequentially, updating it at each step — this produces more coherent summaries but can't be parallelized and is slower for long documents.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

No. YoutubeLoader uses the youtube-transcript-api library, which can only access transcripts on publicly available videos that have captions enabled. For private or age-restricted videos, you need to download the audio file separately and run it through Whisper or another transcription service.

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

NotesAI Agent Development Notes NotesRAG: Retrieval-Augmented Generation Guide BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide BookContent Creation with AI CourseAI Agent Development Course

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

Build a LangChain Agent That Summarizes YouTube Videos

⚡ Quick Answer

Build a full LangChain agent that loads YouTube transcripts, falls back to Whisper, and summarizes long videos with MapReduceDocumentsChain and GPT-4o.

AiTechWorlds Team May 31, 2026 21 min read

#LangChain #YouTube #summarization #Whisper #agent

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

The full agent at the end is production-ready. You can drop it into a FastAPI endpoint, a Slack bot, or a scheduled job.

For background on the broader agent framework this fits into, the build AI agent with LangChain post covers the foundational patterns.

The Architecture Before Any Code

The summarization pipeline has four stages:

Transcript acquisition — Try YouTube's built-in caption API via YoutubeLoader. If that fails (no captions, language unsupported, age restriction), fall back to downloading the audio and running Whisper.
Chunking — Split the transcript into overlapping chunks that fit within the LLM's context window.
Summarization — Use MapReduceDocumentsChain: summarize each chunk independently (map), then combine chunk summaries into a final output (reduce).
Agent orchestration — Wrap the pipeline as a tool that a LangChain agent can call based on user input.

This design handles videos of any length. A 10-minute tutorial and a 3-hour conference keynote go through the same pipeline.

Comparison: Summarization Strategies

Strategy	Speed	Quality	Max Length	Cost
`stuff` chain	Fastest	Excellent	~16k tokens	Low
`map_reduce` chain	Fast (parallel map)	Very Good	Unlimited	Medium
`refine` chain	Slow (sequential)	Excellent	Unlimited	Medium-High
`map_rerank` chain	Medium	Good (ranks chunks)	Unlimited	Medium
Rolling window (custom)	Medium	Good	Unlimited	Medium

Installation

pip install langchain langchain-community langchain-openai
pip install youtube-transcript-api pytube
pip install openai-whisper  # For Whisper fallback
pip install yt-dlp           # Better than pytube for audio download

Optional for the full agent deployment:

pip install fastapi uvicorn python-dotenv

Step 1: Loading YouTube Transcripts with YoutubeLoader

from langchain_community.document_loaders import YoutubeLoader

def load_youtube_transcript(
    url: str,
    language: str = "en",
    translation: str = None
) -> list:
    """
    Load transcript from a YouTube URL.
    
    Args:
        url: Full YouTube URL or video ID
        language: Transcript language code (e.g., 'en', 'es', 'fr')
        translation: Translate to this language if original not available
    
    Returns:
        List of Document objects with transcript text
    """
    loader = YoutubeLoader.from_youtube_url(
        url,
        add_video_info=True,     # Includes title, author, length in metadata
        language=[language],
        translation=translation
    )
    
    documents = loader.load()
    
    if not documents:
        raise ValueError(f"No transcript found for: {url}")
    
    print(f"Loaded transcript: {len(documents[0].page_content)} characters")
    print(f"Video: {documents[0].metadata.get('title', 'Unknown')}")
    print(f"Author: {documents[0].metadata.get('author', 'Unknown')}")
    print(f"Length: {documents[0].metadata.get('length', 0)} seconds")
    
    return documents


# Try it
try:
    docs = load_youtube_transcript(
        "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
        language="en"
    )
    print("Transcript loaded successfully")
except Exception as e:
    print(f"Transcript loading failed: {e}")

The add_video_info=True flag pulls in the video title, author, length, and thumbnail URL from the YouTube oEmbed API. This metadata is useful for generating attribution in summaries.

Step 2: Whisper Fallback for Videos Without Captions

import os
import tempfile
import whisper
import yt_dlp


def download_audio(url: str, output_dir: str = None) -> str:
    """
    Download audio from a YouTube URL using yt-dlp.
    Returns the path to the downloaded audio file.
    """
    if output_dir is None:
        output_dir = tempfile.mkdtemp()
    
    output_template = os.path.join(output_dir, "%(id)s.%(ext)s")
    
    ydl_opts = {
        "format": "bestaudio/best",
        "outtmpl": output_template,
        "postprocessors": [{
            "key": "FFmpegExtractAudio",
            "preferredcodec": "mp3",
            "preferredquality": "192",
        }],
        "quiet": True,
        "no_warnings": True,
    }
    
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=True)
        video_id = info.get("id", "audio")
        audio_path = os.path.join(output_dir, f"{video_id}.mp3")
    
    print(f"Audio downloaded: {audio_path}")
    return audio_path


def transcribe_with_whisper(
    audio_path: str,
    model_size: str = "base",
    language: str = None
) -> str:
    """
    Transcribe audio file using OpenAI Whisper.
    
    model_size options: tiny, base, small, medium, large
    Larger models are more accurate but slower and use more memory.
    """
    print(f"Loading Whisper model: {model_size}")
    model = whisper.load_model(model_size)
    
    print("Transcribing audio...")
    transcribe_opts = {"verbose": False}
    if language:
        transcribe_opts["language"] = language
    
    result = model.transcribe(audio_path, **transcribe_opts)
    
    transcript = result["text"]
    detected_language = result.get("language", "unknown")
    
    print(f"Transcription complete: {len(transcript)} characters")
    print(f"Detected language: {detected_language}")
    
    return transcript


def load_transcript_with_fallback(
    url: str,
    language: str = "en",
    whisper_model: str = "base",
    cleanup_audio: bool = True
) -> tuple:
    """
    Load transcript via YoutubeLoader with Whisper fallback.
    
    Returns: (transcript_text, metadata_dict, source)
    source is either 'youtube_captions' or 'whisper'
    """
    # Try YouTube captions first
    try:
        loader = YoutubeLoader.from_youtube_url(
            url,
            add_video_info=True,
            language=[language]
        )
        docs = loader.load()
        
        if docs and docs[0].page_content.strip():
            return (
                docs[0].page_content,
                docs[0].metadata,
                "youtube_captions"
            )
        
        raise ValueError("Empty transcript returned")
        
    except Exception as e:
        print(f"YouTube captions unavailable ({e}). Falling back to Whisper...")
    
    # Whisper fallback
    audio_path = None
    try:
        audio_path = download_audio(url)
        transcript_text = transcribe_with_whisper(
            audio_path,
            model_size=whisper_model,
            language=language if language != "en" else None
        )
        
        # Extract basic metadata via yt-dlp (no download)
        with yt_dlp.YoutubeDL({"quiet": True}) as ydl:
            info = ydl.extract_info(url, download=False)
            metadata = {
                "title": info.get("title", "Unknown"),
                "author": info.get("uploader", "Unknown"),
                "length": info.get("duration", 0),
                "source": url
            }
        
        return transcript_text, metadata, "whisper"
        
    finally:
        if cleanup_audio and audio_path and os.path.exists(audio_path):
            os.remove(audio_path)
            print("Audio file cleaned up")

Step 3: Chunking the Transcript

Transcripts are one long string — there are no paragraph breaks, headings, or natural document structure. RecursiveCharacterTextSplitter handles this well by splitting on sentence boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document


def chunk_transcript(
    transcript_text: str,
    metadata: dict,
    chunk_size: int = 4000,
    chunk_overlap: int = 200
) -> list:
    """
    Split a transcript into overlapping chunks suitable for map-reduce.
    
    chunk_size: characters per chunk (4000 chars ≈ 1000 tokens)
    chunk_overlap: characters of overlap between adjacent chunks
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=[". ", "? ", "! ", "\n", " ", ""]
    )
    
    texts = splitter.split_text(transcript_text)
    
    # Wrap in Document objects with metadata
    chunks = [
        Document(
            page_content=text,
            metadata={
                **metadata,
                "chunk_index": i,
                "total_chunks": len(texts)
            }
        )
        for i, text in enumerate(texts)
    ]
    
    print(f"Split transcript into {len(chunks)} chunks")
    print(f"Avg chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
    
    return chunks

For a 1-hour video, a transcript typically runs 8,000–12,000 words. With chunk_size=4000, that produces 5–10 chunks — a comfortable workload for the map step.

Step 4: MapReduceDocumentsChain Summarization

MapReduceDocumentsChain is the right tool for long transcripts. The map step summarizes each chunk independently, and the reduce step combines those partial summaries into the final output.

from langchain.chains.summarize import load_summarize_chain
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate


def build_summarization_chain(
    model: str = "gpt-4o-mini",
    temperature: float = 0,
    chain_type: str = "map_reduce",
    output_format: str = "standard"
):
    """
    Build a summarization chain for YouTube transcripts.
    
    chain_type: 'map_reduce' (long videos), 'stuff' (short videos), 'refine'
    output_format: 'standard', 'bullets', 'structured'
    """
    llm = ChatOpenAI(model=model, temperature=temperature)
    
    if chain_type == "stuff":
        # For short videos where the full transcript fits in context
        prompt = PromptTemplate(
            template="""You are summarizing a YouTube video transcript.

Transcript:
{text}

Write a clear, informative summary that covers:
1. The main topic and purpose of the video
2. Key points and takeaways
3. Any specific techniques, tools, or conclusions mentioned

Summary:""",
            input_variables=["text"]
        )
        return load_summarize_chain(
            llm,
            chain_type="stuff",
            prompt=prompt
        )
    
    # Map-reduce for long videos
    map_prompt = PromptTemplate(
        template="""Summarize this section of a YouTube video transcript.
Be concise but capture all important points, facts, and conclusions.

Transcript section:
{text}

Section summary:""",
        input_variables=["text"]
    )
    
    if output_format == "bullets":
        combine_prompt = PromptTemplate(
            template="""You are combining section summaries from a YouTube video into a final summary.

Section summaries:
{text}

Write a final summary as bullet points grouped by topic.
Format: Start each group with a bold topic heading, followed by 2-4 bullet points.

Final summary:""",
            input_variables=["text"]
        )
    
    elif output_format == "structured":
        combine_prompt = PromptTemplate(
            template="""You are combining section summaries from a YouTube video into a structured summary.

Section summaries:
{text}

Write a final summary with these sections:
**Overview**: 2-3 sentences on what the video is about
**Key Points**: numbered list of the 5-7 most important takeaways
**Conclusion**: what the video recommends or concludes

Final summary:""",
            input_variables=["text"]
        )
    
    else:
        combine_prompt = PromptTemplate(
            template="""You are combining section summaries from a YouTube video into a final summary.

Section summaries:
{text}

Write a cohesive, well-structured summary covering the video's main points.
Aim for 3-5 paragraphs. Be specific about techniques, tools, or data mentioned.

Final summary:""",
            input_variables=["text"]
        )
    
    return load_summarize_chain(
        llm,
        chain_type="map_reduce",
        map_prompt=map_prompt,
        combine_prompt=combine_prompt,
        verbose=False
    )


def summarize_transcript(
    chunks: list,
    model: str = "gpt-4o-mini",
    chain_type: str = "map_reduce",
    output_format: str = "structured"
) -> str:
    """
    Summarize transcript chunks using the specified chain type.
    """
    # For very short transcripts, use 'stuff' directly
    total_chars = sum(len(c.page_content) for c in chunks)
    if total_chars < 6000 and chain_type == "map_reduce":
        print("Short transcript detected, using 'stuff' chain")
        chain_type = "stuff"
    
    chain = build_summarization_chain(
        model=model,
        chain_type=chain_type,
        output_format=output_format
    )
    
    print(f"Summarizing {len(chunks)} chunks with {chain_type} chain...")
    result = chain.invoke({"input_documents": chunks})
    
    summary = result.get("output_text", result.get("text", ""))
    print(f"Summary generated: {len(summary)} characters")
    
    return summary

For more on how chain types work in LangChain, the LangChain tutorial 2025 has a dedicated section on chain architectures.

Step 5: Full Pipeline Function

Before building the agent, let's wrap everything into a single pipeline function:

from typing import Optional
import time


def summarize_youtube_video(
    url: str,
    language: str = "en",
    whisper_model: str = "base",
    summarization_model: str = "gpt-4o-mini",
    combine_model: str = "gpt-4o",
    output_format: str = "structured",
    chunk_size: int = 4000,
    chunk_overlap: int = 200
) -> dict:
    """
    Full pipeline: URL → transcript → chunks → summary.
    
    Returns dict with summary, metadata, timing, and source info.
    """
    start_time = time.time()
    
    # Step 1: Load transcript
    print(f"\n{'='*50}")
    print(f"Processing: {url}")
    print(f"{'='*50}")
    
    transcript_text, metadata, source = load_transcript_with_fallback(
        url,
        language=language,
        whisper_model=whisper_model
    )
    
    load_time = time.time() - start_time
    print(f"Transcript loaded in {load_time:.1f}s (source: {source})")
    
    # Step 2: Chunk transcript
    chunks = chunk_transcript(
        transcript_text,
        metadata,
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    # Step 3: Determine chain type
    total_chars = sum(len(c.page_content) for c in chunks)
    chain_type = "stuff" if total_chars < 6000 else "map_reduce"
    
    # Step 4: Summarize
    summary = summarize_transcript(
        chunks,
        model=summarization_model,
        chain_type=chain_type,
        output_format=output_format
    )
    
    total_time = time.time() - start_time
    
    return {
        "url": url,
        "title": metadata.get("title", "Unknown"),
        "author": metadata.get("author", "Unknown"),
        "duration_seconds": metadata.get("length", 0),
        "transcript_source": source,
        "num_chunks": len(chunks),
        "chain_type": chain_type,
        "summary": summary,
        "processing_time_seconds": round(total_time, 1)
    }


# Test run
if __name__ == "__main__":
    result = summarize_youtube_video(
        "https://www.youtube.com/watch?v=YOUR_VIDEO_ID",
        output_format="structured"
    )
    
    print(f"\nVideo: {result['title']}")
    print(f"Author: {result['author']}")
    print(f"Duration: {result['duration_seconds'] // 60} minutes")
    print(f"Processed in: {result['processing_time_seconds']}s")
    print(f"\n{'='*50}")
    print("SUMMARY:")
    print('='*50)
    print(result['summary'])

Step 6: Building the Agent

from langchain.tools import tool
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from typing import Optional
import re


@tool
def summarize_video(url: str, output_format: str = "structured") -> str:
    """
    Summarize a YouTube video given its URL.
    
    Use this tool when the user provides a YouTube URL and asks for a summary,
    key points, or overview of the video content.
    
    Args:
        url: Full YouTube URL (e.g., https://www.youtube.com/watch?v=...)
        output_format: 'standard', 'bullets', or 'structured' (default: structured)
    
    Returns:
        Structured summary of the video content
    """
    try:
        result = summarize_youtube_video(
            url=url,
            output_format=output_format
        )
        
        output = f"""**Video**: {result['title']}
**Author**: {result['author']}
**Duration**: {result['duration_seconds'] // 60} minutes
**Transcript source**: {result['transcript_source']}

{result['summary']}

*Processed in {result['processing_time_seconds']}s using {result['chain_type']} summarization*"""
        
        return output
        
    except Exception as e:
        return f"Failed to summarize video: {str(e)}"


@tool
def extract_video_id(url: str) -> str:
    """
    Extract the video ID from a YouTube URL.
    
    Use this when the user provides a YouTube URL and you need the video ID.
    """
    patterns = [
        r"youtube\.com/watch\?v=([a-zA-Z0-9_-]{11})",
        r"youtu\.be/([a-zA-Z0-9_-]{11})",
        r"youtube\.com/embed/([a-zA-Z0-9_-]{11})",
        r"youtube\.com/shorts/([a-zA-Z0-9_-]{11})"
    ]
    
    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return f"Video ID: {match.group(1)}"
    
    return "Could not extract video ID from URL"


def build_youtube_agent(
    model: str = "gpt-4o",
    enable_memory: bool = True
) -> AgentExecutor:
    """
    Build a YouTube summarization agent with tool use.
    """
    llm = ChatOpenAI(model=model, temperature=0)
    
    tools = [summarize_video, extract_video_id]
    
    system_message = """You are a helpful assistant that specializes in summarizing 
YouTube videos. When given a YouTube URL, you use the summarize_video tool to 
generate a comprehensive summary. 

You can also:
- Explain specific parts of a video when asked
- Compare key points from multiple videos
- Extract the video ID from a URL
- Generate summaries in different formats (standard, bullets, structured)

Always use the summarize_video tool when the user provides a YouTube URL.
Be helpful and informative in your responses."""

    if enable_memory:
        prompt = ChatPromptTemplate.from_messages([
            ("system", system_message),
            MessagesPlaceholder("chat_history", optional=True),
            ("human", "{input}"),
            MessagesPlaceholder("agent_scratchpad")
        ])
        
        memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True
        )
    else:
        prompt = ChatPromptTemplate.from_messages([
            ("system", system_message),
            ("human", "{input}"),
            MessagesPlaceholder("agent_scratchpad")
        ])
        memory = None
    
    agent = create_openai_tools_agent(llm, tools, prompt)
    
    return AgentExecutor(
        agent=agent,
        tools=tools,
        memory=memory,
        verbose=True,
        max_iterations=3,
        handle_parsing_errors=True
    )


# Run the agent
if __name__ == "__main__":
    agent = build_youtube_agent(model="gpt-4o")
    
    # Single video summary
    response = agent.invoke({
        "input": "Please summarize this video: https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
    })
    print(response["output"])
    
    # Follow-up question (uses memory)
    response2 = agent.invoke({
        "input": "Can you give me just the bullet points from that video?"
    })
    print(response2["output"])

For more on building agents with memory and tool use, the AI agent memory and planning post covers advanced memory patterns.

Step 7: Handling Long Videos with Custom Map-Reduce

from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.schema import Document


def hierarchical_summarize(
    chunks: list,
    map_model: str = "gpt-4o-mini",
    combine_model: str = "gpt-4o",
    intermediate_batch_size: int = 5
) -> str:
    """
    Hierarchical map-reduce for very long videos (3+ hours).
    
    Stage 1: Summarize chunks in batches of intermediate_batch_size
    Stage 2: Combine batch summaries into section summaries
    Stage 3: Combine section summaries into final summary
    """
    fast_llm = ChatOpenAI(model=map_model, temperature=0)
    smart_llm = ChatOpenAI(model=combine_model, temperature=0)
    
    map_prompt = PromptTemplate(
        template="Summarize this transcript segment in 3-5 sentences:\n\n{text}\n\nSummary:",
        input_variables=["text"]
    )
    
    batch_combine_prompt = PromptTemplate(
        template="""Combine these segment summaries into a coherent section summary (5-8 sentences):

{text}

Section summary:""",
        input_variables=["text"]
    )
    
    final_combine_prompt = PromptTemplate(
        template="""You are writing the final summary of a YouTube video from section summaries.

Section summaries:
{text}

Write a comprehensive, well-structured summary with:
**Overview**: What this video covers and why it matters
**Key Points**: The 5-7 most important takeaways
**Tools/Techniques Mentioned**: Specific tools, frameworks, or methods discussed
**Conclusion**: Main recommendation or conclusion

Final summary:""",
        input_variables=["text"]
    )
    
    print(f"Stage 1: Summarizing {len(chunks)} chunks in batches of {intermediate_batch_size}...")
    
    # Stage 1: Map each chunk
    chunk_summaries = []
    for chunk in chunks:
        chain = load_summarize_chain(fast_llm, chain_type="stuff", prompt=map_prompt)
        result = chain.invoke({"input_documents": [chunk]})
        chunk_summaries.append(result["output_text"])
    
    # Stage 2: Combine chunks into batches
    print(f"Stage 2: Combining into {len(chunk_summaries) // intermediate_batch_size + 1} sections...")
    
    section_summaries = []
    for i in range(0, len(chunk_summaries), intermediate_batch_size):
        batch = chunk_summaries[i:i + intermediate_batch_size]
        batch_text = "\n\n".join(batch)
        batch_doc = Document(page_content=batch_text)
        
        chain = load_summarize_chain(fast_llm, chain_type="stuff", prompt=batch_combine_prompt)
        result = chain.invoke({"input_documents": [batch_doc]})
        section_summaries.append(result["output_text"])
    
    # Stage 3: Final combine
    print("Stage 3: Generating final summary...")
    final_text = "\n\n".join(section_summaries)
    final_doc = Document(page_content=final_text)
    
    chain = load_summarize_chain(smart_llm, chain_type="stuff", prompt=final_combine_prompt)
    result = chain.invoke({"input_documents": [final_doc]})
    
    return result["output_text"]

This three-stage approach keeps each LLM call within a manageable context window regardless of video length. A 3-hour keynote becomes 30 chunk summaries → 6 section summaries → 1 final summary.

Step 8: Complete Deployable FastAPI Endpoint

Here's the full production deployment as a FastAPI application:

import os
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, HttpUrl
from typing import Optional, Literal
import uvicorn
from datetime import datetime
import uuid


app = FastAPI(
    title="YouTube Summarizer API",
    description="Summarize YouTube videos using LangChain and GPT-4o",
    version="1.0.0"
)

# In-memory job store (use Redis in production)
jobs = {}


class SummarizeRequest(BaseModel):
    url: str
    language: str = "en"
    output_format: Literal["standard", "bullets", "structured"] = "structured"
    whisper_model: str = "base"


class SummarizeResponse(BaseModel):
    job_id: str
    status: str
    created_at: str


class JobResult(BaseModel):
    job_id: str
    status: str
    result: Optional[dict] = None
    error: Optional[str] = None


def run_summarization_job(job_id: str, request: SummarizeRequest):
    """Background task that runs the summarization pipeline."""
    try:
        jobs[job_id]["status"] = "processing"
        
        result = summarize_youtube_video(
            url=request.url,
            language=request.language,
            output_format=request.output_format,
            whisper_model=request.whisper_model
        )
        
        jobs[job_id]["status"] = "completed"
        jobs[job_id]["result"] = result
        
    except Exception as e:
        jobs[job_id]["status"] = "failed"
        jobs[job_id]["error"] = str(e)


@app.post("/summarize", response_model=SummarizeResponse)
async def create_summary_job(
    request: SummarizeRequest,
    background_tasks: BackgroundTasks
):
    """
    Submit a YouTube video for summarization.
    Returns a job ID to poll for results.
    """
    job_id = str(uuid.uuid4())[:8]
    
    jobs[job_id] = {
        "status": "queued",
        "created_at": datetime.utcnow().isoformat(),
        "result": None,
        "error": None
    }
    
    background_tasks.add_task(run_summarization_job, job_id, request)
    
    return SummarizeResponse(
        job_id=job_id,
        status="queued",
        created_at=jobs[job_id]["created_at"]
    )


@app.get("/jobs/{job_id}", response_model=JobResult)
async def get_job_status(job_id: str):
    """
    Get the status and result of a summarization job.
    Poll this endpoint until status is 'completed' or 'failed'.
    """
    if job_id not in jobs:
        raise HTTPException(status_code=404, detail=f"Job {job_id} not found")
    
    job = jobs[job_id]
    
    return JobResult(
        job_id=job_id,
        status=job["status"],
        result=job.get("result"),
        error=job.get("error")
    )


@app.post("/summarize/sync")
async def summarize_sync(request: SummarizeRequest):
    """
    Synchronous summarization endpoint.
    Blocks until complete — use for videos under 30 minutes.
    """
    try:
        result = summarize_youtube_video(
            url=request.url,
            language=request.language,
            output_format=request.output_format,
            whisper_model=request.whisper_model
        )
        return {"status": "success", "data": result}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.get("/health")
async def health_check():
    return {"status": "ok", "active_jobs": len(jobs)}


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run it with:

uvicorn app:app --reload --port 8000

Then call it:

# Submit a job
curl -X POST http://localhost:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.youtube.com/watch?v=YOUR_ID", "output_format": "structured"}'

# Poll for result
curl http://localhost:8000/jobs/YOUR_JOB_ID

For deploying this to production, the deploy AI model to production post covers containerization with Docker and hosting on cloud providers.

Optimizing for Cost and Speed

Running this at scale means managing token cost carefully. Here's what works in production:

Use gpt-4o-mini for map steps. The per-chunk summaries are mechanical — compress this section into 3–5 sentences. GPT-4o-mini handles this at a fraction of the cost.

# Cost comparison for a 1-hour video (~8,000 words, 8 chunks)
# gpt-4o map + combine:      ~$0.08 per video
# gpt-4o-mini map + gpt-4o combine: ~$0.02 per video
# gpt-4o-mini both:          ~$0.005 per video

# For high volume (1000 videos/day):
# All gpt-4o:     $80/day
# Mini map + 4o combine: $20/day
# All mini:       $5/day

import hashlib
import json
from pathlib import Path

TRANSCRIPT_CACHE_DIR = Path("./transcript_cache")
TRANSCRIPT_CACHE_DIR.mkdir(exist_ok=True)

def get_cached_transcript(url: str) -> Optional[tuple]:
    """Return cached transcript if available."""
    url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
    cache_path = TRANSCRIPT_CACHE_DIR / f"{url_hash}.json"
    
    if cache_path.exists():
        with open(cache_path) as f:
            data = json.load(f)
        print(f"Transcript cache hit: {url_hash}")
        return data["text"], data["metadata"], data["source"]
    
    return None

def cache_transcript(url: str, text: str, metadata: dict, source: str):
    """Save transcript to cache."""
    url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
    cache_path = TRANSCRIPT_CACHE_DIR / f"{url_hash}.json"
    
    with open(cache_path, "w") as f:
        json.dump({"text": text, "metadata": metadata, "source": source}, f)
    
    print(f"Transcript cached: {url_hash}")

For the broader picture on AI agent cost management, the AI API cost management post covers token budgeting across multi-step pipelines.

Extending the Agent: Q&A Over Video Content

Once you have the transcript as Documents, you can do more than summarize — you can run a full RAG pipeline over the video content:

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI


def build_video_qa_system(chunks: list) -> RetrievalQA:
    """
    Build a Q&A system over a video's transcript.
    Lets users ask specific questions about video content.
    """
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = FAISS.from_documents(chunks, embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )
    
    return chain


# Use it
chunks = chunk_transcript(transcript_text, metadata)
qa_chain = build_video_qa_system(chunks)

# Ask specific questions
response = qa_chain.invoke({"query": "What library did the presenter use for authentication?"})
print("Answer:", response["result"])

For more on building this kind of multi-document search system, the OpenAI API integration post covers embedding optimization for large content libraries.

Handling Edge Cases

A few problems you'll hit in production and how to handle them:

Transcript language mismatch. Auto-generated captions are sometimes in the wrong language. Check metadata["language"] after loading and retry with translation="en" if needed.

Very short videos (under 2 minutes). These sometimes have no transcript at all. Add a minimum duration check:

if metadata.get("length", 0) < 60:
    return {"error": "Video too short to summarize (under 1 minute)"}

Rate limits from transcript API. The youtube-transcript-api can hit rate limits for batch processing. Add exponential backoff:

import time
from typing import Callable, Any

def retry_with_backoff(
    func: Callable,
    max_retries: int = 3,
    base_delay: float = 1.0
) -> Any:
    """Retry a function with exponential backoff."""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay}s...")
            time.sleep(delay)

For agent patterns that handle failures gracefully in multi-step pipelines, the AI research agent build post shows robust error handling across multiple tool calls.

Conclusion

FAQs

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Build a LangChain Agent That Summarizes YouTube Videos

The Architecture Before Any Code

Comparison: Summarization Strategies

Installation

Step 1: Loading YouTube Transcripts with YoutubeLoader

Step 2: Whisper Fallback for Videos Without Captions

Step 3: Chunking the Transcript

Step 4: MapReduceDocumentsChain Summarization

Step 5: Full Pipeline Function

Step 6: Building the Agent

Step 7: Handling Long Videos with Custom Map-Reduce

Step 8: Complete Deployable FastAPI Endpoint

Optimizing for Cost and Speed

Extending the Agent: Q&A Over Video Content

Handling Edge Cases

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

Build a LangChain Agent That Summarizes YouTube Videos

The Architecture Before Any Code

Comparison: Summarization Strategies

Installation

Step 1: Loading YouTube Transcripts with YoutubeLoader

Step 2: Whisper Fallback for Videos Without Captions

Step 3: Chunking the Transcript

Step 4: MapReduceDocumentsChain Summarization

Step 5: Full Pipeline Function

Step 6: Building the Agent

Step 7: Handling Long Videos with Custom Map-Reduce

Step 8: Complete Deployable FastAPI Endpoint

Optimizing for Cost and Speed

Extending the Agent: Q&A Over Video Content

Handling Edge Cases

Conclusion

FAQs

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily