Build a Research Agent with AutoGPT (Web Search + Summarize)
Build an autonomous research agent with AutoGPT that searches the web, extracts key information, and produces structured summaries with configurable output formats.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
A research agent that actually works — not one that loops for 40 cycles and produces a garbled paragraph — requires careful goal design, the right search configuration, and a clear output format. This guide builds one end-to-end.
The target: an autonomous agent that takes a research topic, searches the web, reads relevant pages, extracts key findings, and writes a structured report. No human intervention after setup.
For context on what makes autonomous agents different from regular LLM calls, AI agents explained is a good primer. And if you want to compare this AutoGPT approach to what you can build with LangChain, AI research agent build covers the LangChain side.
What You Need Before Starting
- AutoGPT installed (version 0.5+)
- OpenAI API key (GPT-4 or GPT-4o recommended)
- Search API key: SerpAPI, Google Custom Search, or Bing Search API
- Python 3.10+
- Optional: Docker (for safer execution)
# Install AutoGPT
git clone https://github.com/Significant-Gravitas/AutoGPT
cd AutoGPT/autogpts/autogpt
# Install dependencies
poetry install
# Copy and configure environment
cp .env.template .env
Configuring the Search Plugin
AutoGPT supports multiple search backends. SerpAPI is the most reliable for research tasks:
# .env file — search configuration
OPENAI_API_KEY=sk-your-openai-key
# Search backend — choose one
GOOGLE_API_KEY=your-google-api-key
CUSTOM_SEARCH_ENGINE_ID=your-search-engine-id
# OR use SerpAPI
SERPAPI_API_KEY=your-serpapi-key
# OR use DuckDuckGo (free but rate-limited)
# No key needed, just set:
SEARCH_BACKEND=duckduckgo
# Browser for reading web pages
USE_WEB_BROWSER=selenium
HEADLESS_BROWSER=True
# Model configuration
SMART_LLM_MODEL=gpt-4o
FAST_LLM_MODEL=gpt-4o-mini
# Output and memory
MEMORY_BACKEND=local
WORKSPACE_BACKEND=local
RESTRICT_TO_WORKSPACE=True
# Cost control
CONTINUOUS_LIMIT=25 # max actions per run
SerpAPI costs roughly $0.005 per search and gives you structured results. DuckDuckGo is free but has rate limits that can stall research runs. For serious research tasks, SerpAPI is worth the cost.
Designing the Research Goal
The goal structure makes or breaks the agent. Here is the template:
# research_agent.yaml
ai_name: ResearchAgent
ai_role: >
An autonomous research assistant that searches the web, reads sources,
and produces structured reports with citations.
ai_goals:
- >
Search for "{TOPIC}" using web search. Find at least 5 credible sources
published in {YEAR_RANGE}. Prefer: academic papers, official documentation,
major tech publications (TechCrunch, Wired, MIT Technology Review).
- >
For each source found, browse the URL and extract:
(1) Key claim or finding, (2) Supporting evidence or data,
(3) Publication date, (4) Author or organization.
- >
Write a structured research report to research_report.md containing:
(1) Executive summary (3-5 sentences),
(2) Key findings as bullet points with citations,
(3) Comparison table if multiple items are being compared,
(4) Conclusion with 2-3 actionable insights.
- >
Verify that research_report.md exists and contains at least 500 words
with citations for at least 3 sources.
- >
Task is COMPLETE when research_report.md is written and verified.
Do not search for more sources after the file is written.
api_budget: 3.00
Notice the pattern: each goal is specific, the output format is defined, and there is an explicit termination condition. This prevents the most common failure mode — infinite refinement loops.
The Research Agent in Action
Here is a concrete run using LangChain's AutoGPT implementation, which gives you more control than the CLI:
from langchain_experimental.autonomous_agents import AutoGPT
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.tools import Tool
from langchain_community.tools import DuckDuckGoSearchRun
from langchain.tools.file_management import WriteFileTool, ReadFileTool
from langchain_community.document_loaders import WebBaseLoader
import json
# Initialize components
llm = ChatOpenAI(model="gpt-4o", temperature=0.2, max_tokens=4000)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(["initial"], embeddings)
# Search tool
search = DuckDuckGoSearchRun()
# Web reader tool — fetches and summarizes web pages
def read_webpage(url: str) -> str:
"""Read and extract text content from a URL."""
try:
loader = WebBaseLoader(url)
docs = loader.load()
if docs:
# Return first 3000 characters to avoid token overflow
content = docs[0].page_content[:3000]
return f"Content from {url}:\n{content}"
return f"Could not load content from {url}"
except Exception as e:
return f"Error reading {url}: {str(e)}"
webpage_tool = Tool(
name="read_webpage",
func=read_webpage,
description="""Read the full content of a webpage.
Input: a complete URL starting with http:// or https://
Output: extracted text content from the page.""",
)
# Citation tracker
citations = []
def add_citation(citation_json: str) -> str:
"""Save a citation. Input: JSON with 'source', 'title', 'finding' keys."""
try:
citation = json.loads(citation_json)
citations.append(citation)
return f"Citation saved. Total citations: {len(citations)}"
except:
return "Citation format error. Use JSON: {\"source\": \"url\", \"title\": \"title\", \"finding\": \"key finding\"}"
citation_tool = Tool(
name="save_citation",
func=add_citation,
description="Save a citation from a source. Use after reading each relevant webpage.",
)
tools = [
search,
webpage_tool,
citation_tool,
WriteFileTool(),
ReadFileTool(),
]
# Create the AutoGPT agent
agent = AutoGPT.from_llm_and_tools(
ai_name="ResearchBot",
ai_role="""An autonomous research assistant that produces well-cited reports.
Always search first, then read sources, then write the report.
Never make up statistics or claims — only use information from sources you have read.""",
tools=tools,
llm=llm,
memory=vectorstore.as_retriever(),
)
agent.chain.verbose = True
# Run the research agent
research_topic = "the impact of vector databases on enterprise AI applications in 2024-2025"
agent.run([
f"Search for: {research_topic}",
"Read the top 4-5 most relevant results. Use save_citation for each key finding.",
"Write a structured research report to vector_db_report.md with: summary, key findings with citations, and conclusion.",
"Stop after the report is written. Do not continue searching.",
])
print(f"\nCollected {len(citations)} citations")
for i, c in enumerate(citations, 1):
print(f"{i}. {c.get('title', 'Unknown')} — {c.get('source', '')}")
Adding a Summarization Chain
The raw content from web pages is often messy. Adding an explicit summarization step improves report quality:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
summarize_prompt = PromptTemplate(
input_variables=["content", "topic"],
template="""
You are a research assistant. Summarize the following content in the context of: {topic}
Content:
{content}
Provide a summary with:
1. Main claim or finding (1-2 sentences)
2. Key supporting data or evidence
3. Relevance to the topic
Summary:"""
)
summarize_chain = LLMChain(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0), # cheaper model for summarization
prompt=summarize_prompt,
)
def summarize_webpage(url_and_topic: str) -> str:
"""
Summarize a webpage in context of the research topic.
Input format: 'URL|||RESEARCH_TOPIC'
"""
parts = url_and_topic.split("|||")
if len(parts) != 2:
return "Format error. Use: URL|||research topic"
url, topic = parts[0].strip(), parts[1].strip()
# Fetch content
content = read_webpage(url)
# Summarize
result = summarize_chain.run(content=content, topic=topic)
return result
summarize_tool = Tool(
name="summarize_webpage",
func=summarize_webpage,
description="""Fetch and summarize a webpage in context of your research topic.
Input format: 'https://url.com|||your research topic'
Better than read_webpage for extracting relevant information.""",
)
This two-step pattern (fetch + summarize) is more expensive in tokens but produces much cleaner source material for the final report.
Output Format Configuration
The structure of the final report is determined by what you put in the goal. Here are templates for different output formats:
Technical Brief Format:
report_format_goal = """
Write research_report.md with exactly this structure:
# [Topic] Research Brief
*Date: [today's date]*
## Executive Summary
[3-5 sentences summarizing the key findings]
## Key Findings
### [Finding 1 Title]
[2-3 sentences with evidence]
*Source: [URL]*
### [Finding 2 Title]
[2-3 sentences with evidence]
*Source: [URL]*
[Continue for all findings]
## Comparison Table
| Aspect | Option A | Option B | Option C |
|--------|----------|----------|----------|
| [row] | [value] | [value] | [value] |
## Recommendations
1. [Actionable recommendation]
2. [Actionable recommendation]
3. [Actionable recommendation]
## Sources
- [Title] — [URL] — [Accessed date]
"""
JSON Format (for programmatic use):
json_format_goal = """
Save research results to research_results.json with this structure:
{
"topic": "string",
"date": "YYYY-MM-DD",
"summary": "string",
"findings": [
{
"title": "string",
"content": "string",
"source_url": "string",
"relevance_score": 1-5
}
],
"recommendations": ["string"],
"sources": [{"title": "string", "url": "string"}]
}
"""
Running a Multi-Topic Research Session
For research across multiple related topics, use sequential AutoGPT runs or a loop:
research_topics = [
"Pinecone vector database pricing and performance 2025",
"Weaviate open source deployment options 2025",
"Qdrant vs Pinecone performance benchmarks 2025",
]
results = {}
for topic in research_topics:
print(f"\nResearching: {topic}")
# Fresh agent for each topic to avoid context contamination
topic_agent = AutoGPT.from_llm_and_tools(
ai_name="ResearchBot",
ai_role="Precise research assistant. Only report verified facts from sources.",
tools=tools,
llm=llm,
memory=FAISS.from_texts(["initial"], embeddings).as_retriever(),
)
safe_filename = topic.replace(" ", "_")[:50]
topic_agent.run([
f"Search for: {topic}",
f"Read the top 3 results. Extract key facts.",
f"Save findings to {safe_filename}.json as a JSON object with 'topic', 'findings', 'sources' keys.",
"Stop after saving the file.",
])
results[topic] = safe_filename
print("\nAll topics researched. Files saved:")
for topic, filename in results.items():
print(f" {filename}.json — {topic}")
Combining with a Summarization Agent
The most powerful pattern pairs AutoGPT's web research with a separate summarization agent:
import autogen
# AutoGPT handles raw research and saves to files
# AutoGen handles synthesis and final report generation
config_list = [{"model": "gpt-4o", "api_key": "YOUR_KEY"}]
llm_config = {"config_list": config_list}
synthesizer = autogen.AssistantAgent(
name="Synthesizer",
system_message="""You synthesize research from multiple files into a cohesive report.
When given a list of JSON research files, read each one and combine the findings.
Produce a final report that avoids duplication and highlights consensus findings.""",
llm_config=llm_config,
)
editor = autogen.AssistantAgent(
name="Editor",
system_message="""You edit and polish research reports.
Check for: logical flow, repetition, unsupported claims, and readability.
Return a polished version ready for publication.""",
llm_config=llm_config,
)
user = autogen.UserProxyAgent(
name="User",
human_input_mode="NEVER",
code_execution_config={"work_dir": "workspace"},
)
# After AutoGPT research phase
user.initiate_chat(
synthesizer,
message="""Read the research files: pinecone_research.json, weaviate_research.json,
qdrant_research.json. Synthesize a comprehensive comparison report.""",
)
This hybrid approach uses each tool for what it does best: AutoGPT for autonomous web research, AutoGen for structured synthesis.
For more on AutoGen orchestration patterns, AutoGen group chat patterns shows how to coordinate multiple synthesis agents. And for a LangChain-only approach to the same research pattern, Build AI agent with LangChain is worth comparing.
Common Failure Modes and Fixes
Agent loops on search: Add to goals — "After finding 5 sources, stop searching and move to writing."
Off-topic research: Add to goals — "Only research [SPECIFIC TOPIC]. Do not follow links about unrelated subjects."
Empty or thin reports: Increase CONTINUOUS_LIMIT and add a goal — "The report must contain at least 600 words and 3 citations."
Browser errors: Switch from Selenium to Playwright in .env. For headless environments, set HEADLESS_BROWSER=True.
Token cost overruns: Use FAST_LLM_MODEL=gpt-4o-mini for simple browsing and summarization steps, reserving SMART_LLM_MODEL=gpt-4o for planning and report writing.
Cost Estimation
A typical research run producing a 500-word report with 5 sources costs:
| Component | Approx. Cost |
|---|---|
| 5 search queries (SerpAPI) | $0.025 |
| 5 web page reads (tokens) | $0.08 |
| Planning and reasoning steps | $0.15 |
| Report generation | $0.06 |
| Total | ~$0.32 |
This scales roughly linearly with the number of sources. A deep research project covering 20 sources with a full 2000-word report typically runs $1.00–$1.50. Set api_budget accordingly.
For agents that need to work with existing document collections rather than live web search, Vector database guide explains how to build retrieval systems that let your agent search pre-indexed content instead of browsing the web each time.
Frequently Asked Questions
How does AutoGPT search the web? AutoGPT uses search APIs (Google Custom Search, SerpAPI, or DuckDuckGo) to retrieve search results, then uses a browser command (Selenium or Playwright) to fetch and read the full content of relevant URLs. You configure the search provider in your .env file.
Can AutoGPT summarize PDFs and documents? AutoGPT can summarize web pages directly. For PDFs, it depends on whether the PDF is accessible via URL — it can fetch and extract text from web-hosted PDFs. For local PDFs, you need to combine AutoGPT with a document processing tool or use LangChain's document loaders instead.
How do I prevent the research agent from going off-topic? Use specific, bounded goals with explicit constraints. Specify the exact topics, time ranges, and number of sources. Include a termination condition (e.g., "stop after finding 5 sources"). You can also add a constraint in the goal: "Only research [TOPIC]. Do not follow unrelated links."
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
10 AutoGPT Command Line Arguments (Continuous Mode, Speak)
Complete reference for AutoGPT's 10 most powerful CLI arguments. Master continuous mode, headless operation, and CI/CD integration for automated agent workflows.
10 AutoGPT Configuration Tweaks for Better Performance
10 proven AutoGPT configuration tweaks to improve speed, cut costs, and boost task success. Model selection, temperature, token limits, and workspace settings.
Build a Content Research Agent with AutoGPT (Trends, Outlines)
Build an AutoGPT content research agent that finds trending topics, analyzes SERPs, and generates SEO-ready outlines automatically — full workflow inside.
Build a Data Analysis Agent with AutoGPT (CSV, SQL, Plots)
Build a data analysis agent using AutoGPT that reads CSVs, queries SQL databases, and generates plots automatically. Full code with pandas and matplotlib.