Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

Build a Research Agent: End-to-End Autonomous Research Tool in Python

Build a complete AI research agent in Python — web search, source validation, synthesis, and report generation. Production patterns with LangGraph and real code.

A
AiTechWorlds Team
May 27, 2026 12 min read
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Build a Research Agent: End-to-End Autonomous Research Tool in Python

I needed to research "the current state of edge AI inference" for a technical report. My options: spend 4 hours reading papers and articles myself, or spend an afternoon building a research agent that could do it in 3 minutes.

I built the agent. Three months later, it's handled hundreds of research tasks, and the version I'll show you is cleaner than my first attempt — which hallucinated citations, looped endlessly on unclear queries, and produced reports that mixed facts from 2022 with facts from 2025 without distinguishing them.

This is the version that actually works: a research agent with real search, real source tracking, real synthesis, and no hallucinated citations.


Architecture Overview

Research Agent Architecture:

User Query
    ↓
[Planning Node]
  → Generate diverse search queries
  → Identify research dimensions
    ↓
[Search Execution Node]
  → Run queries in parallel
  → Extract content from URLs
  → Store in vector memory
    ↓
[Gap Analysis Node]
  → What's covered?
  → What's missing?
  → Generate follow-up queries (if needed)
    ↓
[Synthesis Node]
  → Analyze all retrieved content
  → Extract key findings
  → Identify conflicts/contradictions
    ↓
[Report Generation Node]
  → Write structured report
  → Cite only retrieved sources
  → Format with sections
    ↓
Final Report with Citations

This five-node structure prevents the main failure modes: unbounded search loops (gap analysis has a max), citation hallucination (only retrieved URLs allowed), and shallow coverage (planning generates diverse queries upfront).


Setup and Dependencies

pip install langchain langchain-openai langgraph tavily-python
pip install beautifulsoup4 requests pydantic tiktoken
import os
from typing import TypedDict, Annotated, List, Optional
import operator

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.graph import StateGraph, END
from pydantic import BaseModel, Field
from tavily import TavilyClient
import requests
from bs4 import BeautifulSoup
import tiktoken

# Models
llm_fast = ChatOpenAI(model="gpt-4o-mini", temperature=0)
llm_smart = ChatOpenAI(model="gpt-4o", temperature=0.1)

# Search client
tavily = TavilyClient(api_key=os.environ["TAVILY_API_KEY"])

# Token counter for context management
enc = tiktoken.encoding_for_model("gpt-4o")

State Definition

The state machine needs to track everything from initial query through final report:

class Source(BaseModel):
    url: str
    title: str
    content: str
    search_query: str
    relevance_score: float = 0.0

class ResearchState(TypedDict):
    # Input
    query: str
    research_depth: str  # "quick", "standard", "deep"
    
    # Planning
    search_queries: List[str]
    research_dimensions: List[str]
    
    # Execution
    sources: Annotated[List[Source], operator.add]
    search_iterations: int
    
    # Analysis
    key_findings: str
    coverage_gaps: List[str]
    needs_more_research: bool
    
    # Output
    report: str
    citations: List[str]
    
    # Control
    max_iterations: int
    error: Optional[str]

Node 1: Research Planning

The planning node generates diverse, specific queries rather than one broad search:

class ResearchPlan(BaseModel):
    search_queries: List[str] = Field(
        description="5-7 specific search queries that cover different angles of the topic"
    )
    research_dimensions: List[str] = Field(
        description="Key dimensions to cover: technical, historical, practical, comparative, etc."
    )
    
structured_planner = llm_fast.with_structured_output(ResearchPlan)

def planning_node(state: ResearchState) -> ResearchState:
    depth_instructions = {
        "quick": "Generate 3 focused queries for a quick overview.",
        "standard": "Generate 5 queries covering technical details, examples, and comparisons.",
        "deep": "Generate 7 queries including edge cases, criticisms, and recent developments."
    }
    
    depth_note = depth_instructions.get(state["research_depth"], depth_instructions["standard"])
    
    plan = structured_planner.invoke([
        SystemMessage(content=f"""You are a research planning expert. 
        {depth_note}
        Each query should target a different aspect of the topic.
        Make queries specific enough to return useful results (not just the topic name).
        Include queries for: current state, comparisons, practical examples, limitations."""),
        HumanMessage(content=f"Create a research plan for: {state['query']}")
    ])
    
    return {
        "search_queries": plan.search_queries,
        "research_dimensions": plan.research_dimensions,
        "search_iterations": 0,
        "sources": [],
        "needs_more_research": False,
        "coverage_gaps": []
    }

Node 2: Search and Content Extraction

This node executes searches and extracts actual content from pages — not just snippets:

def extract_page_content(url: str, max_tokens: int = 1500) -> str:
    """Extract and clean content from a URL."""
    try:
        response = requests.get(url, timeout=10, headers={
            "User-Agent": "Mozilla/5.0 (compatible; ResearchBot/1.0)"
        })
        
        soup = BeautifulSoup(response.text, "html.parser")
        
        # Remove navigation, ads, scripts
        for tag in soup(["script", "style", "nav", "footer", "header", "aside"]):
            tag.decompose()
        
        # Extract main content
        main = soup.find("main") or soup.find("article") or soup.find("body")
        text = main.get_text(separator="\n", strip=True) if main else ""
        
        # Truncate to token limit
        tokens = enc.encode(text)
        if len(tokens) > max_tokens:
            text = enc.decode(tokens[:max_tokens])
        
        return text
        
    except Exception as e:
        return f"[Could not retrieve content: {e}]"

def search_execution_node(state: ResearchState) -> ResearchState:
    new_sources = []
    
    for query in state["search_queries"]:
        try:
            # Tavily returns structured results with content
            results = tavily.search(
                query=query,
                max_results=3,
                search_depth="advanced",  # More thorough than basic
                include_raw_content=True
            )
            
            for result in results["results"]:
                # Use Tavily's extracted content or fall back to scraping
                content = result.get("raw_content") or extract_page_content(result["url"])
                
                if len(content) < 100:
                    continue
                
                source = Source(
                    url=result["url"],
                    title=result.get("title", ""),
                    content=content[:2000],  # Cap per-source content
                    search_query=query,
                    relevance_score=result.get("score", 0.0)
                )
                new_sources.append(source)
                
        except Exception as e:
            print(f"Search failed for '{query}': {e}")
    
    # Deduplicate by URL
    existing_urls = {s.url for s in state["sources"]}
    unique_sources = [s for s in new_sources if s.url not in existing_urls]
    
    print(f"Found {len(unique_sources)} new sources (iteration {state['search_iterations'] + 1})")
    
    return {
        "sources": unique_sources,
        "search_iterations": state["search_iterations"] + 1
    }

Node 3: Gap Analysis

After searching, the agent evaluates coverage and decides whether to search more:

class CoverageAnalysis(BaseModel):
    covered_dimensions: List[str]
    missing_dimensions: List[str]
    follow_up_queries: List[str] = Field(
        description="2-3 specific queries to fill the most important gaps"
    )
    needs_more_research: bool

structured_gap_analyzer = llm_fast.with_structured_output(CoverageAnalysis)

def gap_analysis_node(state: ResearchState) -> ResearchState:
    # Don't search more than max_iterations
    if state["search_iterations"] >= state["max_iterations"]:
        return {
            "needs_more_research": False,
            "coverage_gaps": []
        }
    
    # Summarize what we have
    source_summary = "\n".join([
        f"- [{s.title}]({s.url}): {s.content[:200]}..."
        for s in state["sources"][:10]
    ])
    
    analysis = structured_gap_analyzer.invoke([
        SystemMessage(content="""Analyze research coverage. 
        Identify what dimensions are well-covered and what's missing.
        Only request more research if there are significant gaps that matter for the query.
        Be conservative — 10+ sources is usually sufficient."""),
        HumanMessage(content=f"""Query: {state['query']}
        
Intended dimensions: {', '.join(state['research_dimensions'])}
        
Sources retrieved ({len(state['sources'])} total):
{source_summary}

Is more research needed?""")
    ])
    
    # Update queries if follow-up needed
    if analysis.needs_more_research and analysis.follow_up_queries:
        return {
            "search_queries": analysis.follow_up_queries,
            "coverage_gaps": analysis.missing_dimensions,
            "needs_more_research": True
        }
    
    return {
        "needs_more_research": False,
        "coverage_gaps": analysis.missing_dimensions
    }

def should_search_more(state: ResearchState) -> str:
    if state["needs_more_research"] and state["search_iterations"] < state["max_iterations"]:
        return "search_more"
    return "synthesize"

Node 4: Synthesis

This node analyzes all retrieved content to extract structured findings:

def synthesis_node(state: ResearchState) -> ResearchState:
    # Build context from all sources
    source_context = []
    for i, source in enumerate(state["sources"]):
        source_context.append(
            f"Source {i+1}: {source.title}\nURL: {source.url}\n\n{source.content}\n"
        )
    
    # Keep within context limits (roughly 60k tokens max for gpt-4o)
    full_context = "\n---\n".join(source_context)
    tokens = enc.encode(full_context)
    if len(tokens) > 50000:
        # Trim least relevant sources
        sorted_sources = sorted(state["sources"], key=lambda s: s.relevance_score, reverse=True)
        top_sources = sorted_sources[:15]
        source_context = [
            f"Source {i+1}: {s.title}\nURL: {s.url}\n\n{s.content}\n"
            for i, s in enumerate(top_sources)
        ]
        full_context = "\n---\n".join(source_context)
    
    analysis = llm_smart.invoke([
        SystemMessage(content="""You are an expert research analyst.
        Analyze the provided sources and extract key findings.
        Focus on: main themes, conflicting information, data points, expert opinions.
        Note any outdated information (check dates where visible).
        Be specific — include numbers, names, and facts from the sources."""),
        HumanMessage(content=f"""Research query: {state['query']}

Sources:
{full_context}

Provide a detailed analysis of key findings across all sources.""")
    ])
    
    return {"key_findings": analysis.content}

Node 5: Report Generation

The final node writes the report, strictly citing only retrieved sources:

def report_generation_node(state: ResearchState) -> ResearchState:
    # Build citation index
    citation_map = {
        i + 1: source
        for i, source in enumerate(state["sources"])
    }
    
    citation_list = "\n".join([
        f"[{i}] {source.title} — {source.url}"
        for i, source in citation_map.items()
    ])
    
    report = llm_smart.invoke([
        SystemMessage(content=f"""You are a research report writer.
Write a comprehensive, well-structured research report.

CRITICAL CITATION RULES:
1. Only cite sources from the provided citation list below
2. Use [N] format for inline citations
3. Never cite URLs not in this list
4. If information isn't from a source, don't add a citation

Report structure:
- Executive Summary (2-3 sentences)
- Key Findings (3-5 bullet points with citations)
- Detailed Analysis (3-4 H2 sections, ~200 words each, with citations)
- Limitations and Gaps
- Sources

Available citations:
{citation_list}"""),
        HumanMessage(content=f"""Query: {state['query']}

Research findings:
{state['key_findings']}

Write the full research report now.""")
    ])
    
    citations = [
        f"[{i}] {source.title} — {source.url}"
        for i, source in citation_map.items()
    ]
    
    return {
        "report": report.content,
        "citations": citations
    }

Assembling the LangGraph Workflow

def build_research_agent(max_iterations: int = 2):
    workflow = StateGraph(ResearchState)
    
    workflow.add_node("planning", planning_node)
    workflow.add_node("search", search_execution_node)
    workflow.add_node("gap_analysis", gap_analysis_node)
    workflow.add_node("synthesis", synthesis_node)
    workflow.add_node("report", report_generation_node)
    
    workflow.set_entry_point("planning")
    workflow.add_edge("planning", "search")
    workflow.add_edge("search", "gap_analysis")
    
    workflow.add_conditional_edges(
        "gap_analysis",
        should_search_more,
        {
            "search_more": "search",
            "synthesize": "synthesis"
        }
    )
    
    workflow.add_edge("synthesis", "report")
    workflow.add_edge("report", END)
    
    return workflow.compile()

agent = build_research_agent(max_iterations=2)

def research(query: str, depth: str = "standard") -> dict:
    """Run the research agent."""
    depth_to_iterations = {"quick": 1, "standard": 2, "deep": 3}
    
    initial_state = {
        "query": query,
        "research_depth": depth,
        "search_queries": [],
        "research_dimensions": [],
        "sources": [],
        "search_iterations": 0,
        "key_findings": "",
        "coverage_gaps": [],
        "needs_more_research": False,
        "report": "",
        "citations": [],
        "max_iterations": depth_to_iterations.get(depth, 2),
        "error": None
    }
    
    result = agent.invoke(initial_state)
    
    return {
        "report": result["report"],
        "sources_used": len(result["sources"]),
        "citations": result["citations"],
        "coverage_gaps": result["coverage_gaps"]
    }

# Run it
result = research(
    "What are the current limitations of AI coding agents in 2025?",
    depth="standard"
)

print(result["report"])
print(f"\n{result['sources_used']} sources used")

Adding Streaming Progress Updates

For production use, stream progress updates to the user:

def research_with_streaming(query: str, depth: str = "standard"):
    """Stream research progress."""
    
    initial_state = {
        "query": query,
        "research_depth": depth,
        "search_queries": [],
        "research_dimensions": [],
        "sources": [],
        "search_iterations": 0,
        "key_findings": "",
        "coverage_gaps": [],
        "needs_more_research": False,
        "report": "",
        "citations": [],
        "max_iterations": 2,
        "error": None
    }
    
    for event in agent.stream(initial_state, stream_mode="updates"):
        for node, updates in event.items():
            if node == "planning":
                queries = updates.get("search_queries", [])
                print(f"Planning complete: {len(queries)} queries generated")
                for q in queries:
                    print(f"  → {q}")
            
            elif node == "search":
                new_sources = updates.get("sources", [])
                iteration = updates.get("search_iterations", 0)
                print(f"\nSearch iteration {iteration}: {len(new_sources)} new sources")
            
            elif node == "gap_analysis":
                gaps = updates.get("coverage_gaps", [])
                needs_more = updates.get("needs_more_research", False)
                if needs_more:
                    print(f"Gaps found: {gaps}. Searching more...")
                else:
                    print("Coverage sufficient. Moving to synthesis.")
            
            elif node == "synthesis":
                print("\nSynthesizing findings...")
            
            elif node == "report":
                report = updates.get("report", "")
                print(f"\nReport complete ({len(report)} characters)")
                print("\n" + "="*60)
                print(report[:500] + "...")  # Preview

Production Considerations

Rate Limiting and Cost Control

import time
from functools import wraps

def rate_limited(calls_per_minute: int):
    """Simple rate limiter for API calls."""
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limited(calls_per_minute=30)
def safe_search(query: str) -> dict:
    return tavily.search(query=query, max_results=3)

Caching Repeated Research

import hashlib
import json
from pathlib import Path

CACHE_DIR = Path("research_cache")
CACHE_DIR.mkdir(exist_ok=True)

def cached_research(query: str, depth: str = "standard", max_age_hours: int = 24) -> dict:
    """Cache research results to avoid duplicate API calls."""
    import time
    
    cache_key = hashlib.md5(f"{query}:{depth}".encode()).hexdigest()
    cache_file = CACHE_DIR / f"{cache_key}.json"
    
    # Check cache
    if cache_file.exists():
        cached = json.loads(cache_file.read_text())
        age_hours = (time.time() - cached["timestamp"]) / 3600
        if age_hours < max_age_hours:
            print(f"Cache hit (age: {age_hours:.1f}h)")
            return cached["result"]
    
    # Run research
    result = research(query, depth)
    
    # Save to cache
    cache_file.write_text(json.dumps({
        "query": query,
        "depth": depth,
        "timestamp": time.time(),
        "result": result
    }))
    
    return result

Conclusion

A working research agent requires five things that AutoGPT-style agents lacked: structured planning, bounded search iterations, content extraction (not just snippets), gap analysis with a termination condition, and citation tracking that prevents hallucination. LangGraph's explicit state machine makes each of these constraints enforceable rather than hoped-for.

The cost is roughly $0.10-$0.45 per research task with the configuration above — competitive with paying for a research assistant even at scale.

For the broader agent frameworks that power this pattern, see our LangGraph tutorial. For agent memory systems that let research agents learn across tasks, see our agent memory and planning guide.


Frequently Asked Questions

What tools does a research agent need to be useful?

At minimum: a web search API (Tavily is best for agents — structured JSON results, relevance scores), a URL content extractor, and an LLM for synthesis. For production: add a vector store to deduplicate and cache content across searches, a citation tracker, and a rate limiter. Tavily's search_depth="advanced" mode significantly improves result quality over basic search.

How do I prevent a research agent from hallucinating citations?

Track every source URL retrieved from tools. Build a numbered citation index before the report generation step. Instruct the LLM to cite only numbers from that index. Verify in post-processing that every [N] reference in the report maps to a real retrieved URL. Never let the model generate a URL it didn't receive from a tool call — this single rule eliminates 90% of citation hallucinations.

What is the difference between a research agent and a RAG system?

RAG retrieves from a static pre-indexed knowledge base. A research agent dynamically searches at query time, which means it handles current events, novel topics, and queries outside any pre-built index. Research agents are slower and more expensive per query; RAG is faster for known domains. Production systems often combine both: the agent searches the web and caches results in a vector store that RAG queries for follow-ups.

How many search iterations should a research agent do?

Two iterations (initial plan + gap fill) covers 90% of research tasks well. The first iteration covers the main topic dimensions; the second fills specific gaps identified in gap analysis. Three iterations is occasionally needed for deep technical topics. Beyond three, additional searches rarely improve report quality and significantly increase cost. Cap max_iterations at 2-3 and rely on query diversity in planning rather than iteration count.

How much does running a research agent cost?

A standard research task (5 queries, 10-15 pages, one 2000-word report): Tavily API ~$0.10-0.25, GPT-4o-mini for planning ~$0.01-0.03, GPT-4o for synthesis ~$0.05-0.15. Total: ~$0.15-0.45 per task. Using GPT-4o-mini for all intermediate steps (planning, gap analysis, synthesis) and only GPT-4o for final report writing reduces costs by ~70% with minimal quality impact.

Share this article:

Frequently Asked Questions

A practical research agent needs: a web search tool (Tavily, SerpAPI, or DuckDuckGo) for finding current information, a URL reader/scraper to extract content from pages, a vector store for deduplication and memory across searches, and an LLM for synthesis and report generation. Optional additions: citation tracker, source credibility scorer, and a structured output formatter. The most common failure point is poor search quality — using Tavily or Exa over generic Google scraping significantly improves result relevance.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!