Build a Data Analysis Agent with AutoGPT (CSV, SQL, Plots)
Build a data analysis agent using AutoGPT that reads CSVs, queries SQL databases, and generates plots automatically. Full code with pandas and matplotlib.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Most data analysis workflows follow the same pattern: load data, clean it, compute some statistics, make a chart, write a summary. A human analyst does this hundreds of times. The steps are well-defined. The judgment required for routine analysis is modest.
That's exactly the kind of task an AutoGPT-style data agent handles well. You describe what you want to know about your data — the agent writes the code, runs it, interprets results, and delivers an answer.
This guide builds a complete analytics automation agent from scratch: CSV loading, SQL querying, plot generation, and a written summary — all from a goal string.
What We're Building
The agent takes a goal like "analyze sales data from q4_sales.csv and identify the top performing product categories with a monthly trend chart" and produces:
- Summary statistics table
- A matplotlib bar chart saved as PNG
- Trend line charts for time-series data
- A written summary in plain English
No manual pandas code. No chart configuration. Just a goal and results.
Setting Up the Data Analysis Environment
pip install openai pandas matplotlib seaborn sqlalchemy
Create a workspace directory for the agent:
mkdir agent_workspace
mkdir agent_workspace/data
mkdir agent_workspace/output
Place your CSV and SQLite database files in agent_workspace/data/.
The Core Agent Architecture
Instead of the full AutoGPT stack, this builds the essential pattern: an LLM that generates Python code, a safe executor that runs it, and a loop that feeds results back.
import openai
import json
import subprocess
import os
import sys
from pathlib import Path
client = openai.OpenAI()
WORKSPACE = Path("agent_workspace")
SYSTEM_PROMPT = """You are a data analysis agent. Your job is to analyze data and answer questions.
You write Python code that uses pandas, matplotlib, seaborn, and sqlalchemy.
Always respond with valid JSON in this format:
{
"thought": "What I'm planning to do and why",
"action": "execute_python" or "write_file" or "read_file" or "done",
"code": "Python code to execute (if action is execute_python)",
"filename": "file path (if action is write_file or read_file)",
"content": "file content (if action is write_file)",
"result_summary": "summary of findings so far (if action is done)",
"is_complete": false
}
Rules:
- Save all charts to workspace/output/ as PNG files
- Save summaries to workspace/output/summary.txt
- Use pandas for data manipulation
- Always print() results so I can see them
- Handle errors gracefully — if code fails, try a different approach
- End with is_complete: true when analysis is done
"""
def execute_python_safely(code: str) -> str:
"""Execute Python code in isolated subprocess."""
# Write code to temp file
temp_file = WORKSPACE / "temp_analysis.py"
full_code = f"""
import sys
sys.path.insert(0, '{WORKSPACE}')
import pandas as pd
import matplotlib
matplotlib.use('Agg') # Non-interactive backend
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import sqlite3
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings('ignore')
WORKSPACE = '{WORKSPACE}'
DATA_DIR = '{WORKSPACE / "data"}'
OUTPUT_DIR = '{WORKSPACE / "output"}'
{code}
plt.close('all') # Clean up matplotlib state
"""
with open(temp_file, "w") as f:
f.write(full_code)
try:
result = subprocess.run(
[sys.executable, str(temp_file)],
capture_output=True,
text=True,
timeout=60,
cwd=str(WORKSPACE)
)
output = result.stdout + (f"\nERROR: {result.stderr}" if result.stderr else "")
return output[:3000] if len(output) > 3000 else output
except subprocess.TimeoutExpired:
return "ERROR: Code execution timed out (60s limit)"
except Exception as e:
return f"ERROR: {str(e)}"
def run_analysis_agent(goal: str, max_steps: int = 15) -> str:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"GOAL: {goal}\n\nData files are in {WORKSPACE}/data/"}
]
final_summary = ""
for step in range(max_steps):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={"type": "json_object"},
temperature=0.1
)
output = json.loads(response.choices[0].message.content)
print(f"\n--- Step {step + 1} ---")
print(f"Thought: {output.get('thought', '')[:100]}")
print(f"Action: {output.get('action', '')}")
# Handle actions
if output.get("action") == "execute_python":
code = output.get("code", "")
print(f"Executing {len(code.split(chr(10)))} lines of Python...")
result = execute_python_safely(code)
print(f"Result preview: {result[:200]}")
messages.append({"role": "assistant", "content": json.dumps(output)})
messages.append({"role": "user", "content": f"Code execution result:\n{result}"})
elif output.get("action") == "write_file":
filepath = WORKSPACE / output.get("filename", "output/result.txt")
filepath.parent.mkdir(parents=True, exist_ok=True)
with open(filepath, "w") as f:
f.write(output.get("content", ""))
result = f"File written to {filepath}"
messages.append({"role": "assistant", "content": json.dumps(output)})
messages.append({"role": "user", "content": result})
elif output.get("action") == "read_file":
filepath = WORKSPACE / output.get("filename", "")
try:
with open(filepath, "r") as f:
content = f.read()[:2000]
result = f"File contents:\n{content}"
except Exception as e:
result = f"Error reading file: {e}"
messages.append({"role": "assistant", "content": json.dumps(output)})
messages.append({"role": "user", "content": result})
if output.get("is_complete"):
final_summary = output.get("result_summary", "Analysis complete.")
print(f"\n\nFINAL SUMMARY:\n{final_summary}")
break
return final_summary
Creating Sample Data for Testing
import pandas as pd
import numpy as np
from pathlib import Path
# Generate realistic sales data
np.random.seed(42)
dates = pd.date_range("2024-01-01", "2024-12-31", freq="D")
products = ["Electronics", "Clothing", "Food", "Books", "Sports"]
records = []
for date in dates:
for product in products:
base_sales = {"Electronics": 5000, "Clothing": 3000,
"Food": 8000, "Books": 1500, "Sports": 2500}
seasonal_factor = 1 + 0.3 * np.sin(2 * np.pi * date.month / 12)
noise = np.random.normal(1, 0.1)
records.append({
"date": date.strftime("%Y-%m-%d"),
"category": product,
"revenue": round(base_sales[product] * seasonal_factor * noise, 2),
"units_sold": int(np.random.randint(10, 200) * seasonal_factor),
"customer_count": int(np.random.randint(5, 100))
})
df = pd.DataFrame(records)
Path("agent_workspace/data").mkdir(parents=True, exist_ok=True)
df.to_csv("agent_workspace/data/sales_2024.csv", index=False)
print(f"Created {len(df)} rows of sales data")
Running Analysis Goals
# Goal 1: Basic category analysis
run_analysis_agent("""
Analyze sales_2024.csv and produce:
1. Total revenue by product category (bar chart)
2. Monthly revenue trend for top 2 categories (line chart)
3. Summary statistics table
4. Written findings saved to output/sales_analysis.txt
""")
# Goal 2: Seasonal patterns
run_analysis_agent("""
Identify seasonal patterns in sales_2024.csv:
- Which categories peak in which months?
- Calculate month-over-month growth rates
- Create a heatmap of category performance by month
- Save chart as output/seasonal_heatmap.png
""")
The Plot Generation in Practice
Here's what the agent typically generates for the bar chart goal:
# Agent-generated code (from thought → execute_python action)
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df = pd.read_csv(f'{DATA_DIR}/sales_2024.csv')
df['date'] = pd.to_datetime(df['date'])
# Total revenue by category
category_totals = df.groupby('category')['revenue'].sum().sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(category_totals.index, category_totals.values,
color=['#2196F3', '#4CAF50', '#FF9800', '#9C27B0', '#F44336'])
ax.set_title('2024 Total Revenue by Category', fontsize=16, fontweight='bold')
ax.set_xlabel('Product Category', fontsize=12)
ax.set_ylabel('Revenue ($)', fontsize=12)
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, p: f'${x:,.0f}'))
# Add value labels on bars
for bar, val in zip(bars, category_totals.values):
ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 10000,
f'${val:,.0f}', ha='center', va='bottom', fontsize=10)
plt.tight_layout()
plt.savefig(f'{OUTPUT_DIR}/category_revenue.png', dpi=150, bbox_inches='tight')
print(f"Chart saved. Top category: {category_totals.index[0]} (${category_totals.values[0]:,.0f})")
SQL Database Analysis
For SQL-based analysis, the agent handles query generation automatically:
import sqlite3
import pandas as pd
from pathlib import Path
# Create a SQLite database from the CSV
conn = sqlite3.connect("agent_workspace/data/sales.db")
df = pd.read_csv("agent_workspace/data/sales_2024.csv")
df.to_sql("sales", conn, if_exists="replace", index=False)
conn.close()
# Now run the SQL-aware analysis goal
run_analysis_agent("""
Connect to sales.db (SQLite). Using SQL queries:
1. Find the top 5 days by total revenue
2. Calculate average revenue per category per quarter
3. Find categories where Q4 revenue exceeded Q1 by more than 20%
4. Export results to output/sql_analysis.txt
""")
The agent generates SQL like this automatically:
# Agent-generated SQL analysis code
import sqlite3
import pandas as pd
conn = sqlite3.connect(f'{DATA_DIR}/sales.db')
# Q1: Top 5 days by revenue
query_top_days = """
SELECT date, SUM(revenue) as total_revenue
FROM sales
GROUP BY date
ORDER BY total_revenue DESC
LIMIT 5
"""
top_days = pd.read_sql(query_top_days, conn)
print("Top 5 Revenue Days:")
print(top_days.to_string(index=False))
# Q2: Quarterly category averages
query_quarterly = """
SELECT category,
CASE
WHEN CAST(SUBSTR(date, 6, 2) AS INTEGER) BETWEEN 1 AND 3 THEN 'Q1'
WHEN CAST(SUBSTR(date, 6, 2) AS INTEGER) BETWEEN 4 AND 6 THEN 'Q2'
WHEN CAST(SUBSTR(date, 6, 2) AS INTEGER) BETWEEN 7 AND 9 THEN 'Q3'
ELSE 'Q4'
END as quarter,
AVG(revenue) as avg_daily_revenue,
SUM(revenue) as total_revenue
FROM sales
GROUP BY category, quarter
ORDER BY category, quarter
"""
quarterly = pd.read_sql(query_quarterly, conn)
print("\nQuarterly Performance by Category:")
print(quarterly.to_string(index=False))
conn.close()
Output Examples
After a full analysis run, your agent_workspace/output/ directory contains:
output/
category_revenue.png # Bar chart of total revenue
monthly_trends.png # Line chart of top categories
seasonal_heatmap.png # Category × month performance heatmap
sql_analysis.txt # SQL query results
sales_analysis.txt # Written summary
A typical written summary looks like:
2024 Sales Analysis Summary
===========================
Total Revenue: $12,847,320
Top Category: Food ($3,641,200 — 28.3% of total)
Fastest Growth: Electronics (+34% YoY estimated)
Peak Month: December (seasonal factor 1.28x average)
Key Findings:
- Food maintains consistent volume throughout the year
- Electronics shows strong Q4 seasonality (holiday effect)
- Books revenue declined 12% in H2 vs H1
- Sports equipment has the highest revenue per transaction
Agent Output Comparison: Manual vs Automated
| Task | Manual Analyst | Data Analysis Agent |
|---|---|---|
| Load and inspect CSV | 5 minutes | 30 seconds |
| Summary statistics | 10 minutes | 1 minute |
| Category breakdown chart | 20 minutes | 2 minutes |
| Trend analysis | 30 minutes | 3 minutes |
| SQL queries + formatting | 45 minutes | 3 minutes |
| Written summary | 20 minutes | 1 minute |
| Total | ~2 hours | ~10 minutes |
The agent isn't replacing the analyst — it's eliminating the mechanical parts so the analyst can focus on interpretation and decision-making.
For broader context on agent capabilities, the AI research agent build guide shows a similar pattern applied to information retrieval. The Build AI agent with LangChain tutorial demonstrates how to wrap these analysis capabilities in a production API.
The AutoGPT vs BabyAGI comparison shows how different agent architectures approach task decomposition, which matters when your analysis goals become more complex and multi-step.
Analytics automation is one of the strongest use cases for AutoGPT-style agents precisely because the task space is well-defined, the tools are stable, and the outputs are easy to verify. The agent can be wrong — always validate results — but it gets you to 80% of the answer in 10% of the time.
Frequently Asked Questions
Can AutoGPT analyze CSV files automatically?
Yes. AutoGPT can read CSV files, generate pandas code to analyze them, execute that code in its workspace, and produce summary statistics, trend analysis, and visualizations — all from a plain-text goal description without manual coding.
How does AutoGPT handle SQL databases for analysis?
AutoGPT can generate SQL queries based on a goal, execute them against a connected database using Python's sqlalchemy or sqlite3, and interpret the results. You provide the database connection details and goal; the agent handles query construction and result formatting.
What visualization libraries work with AutoGPT data analysis?
Matplotlib and seaborn work best with AutoGPT's code execution environment. The agent generates Python code that uses these libraries directly and saves plots as PNG files in the workspace. Plotly works too but requires additional configuration for headless rendering.
How do I give AutoGPT access to my data files?
Place data files in AutoGPT's workspace directory (configured in the .env file as WORKSPACE_DIRECTORY). The agent can read, write, and analyze files in this directory. For security, keep sensitive data in this isolated workspace and restrict the agent's file access to this path only.
Can the analytics automation agent handle large datasets?
AutoGPT-style agents work well with datasets up to a few hundred MB that fit in memory. For larger datasets, the generated pandas code can use chunked reading (pd.read_csv with chunksize) or connect to a database and use SQL aggregations before pulling results into Python.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
10 AutoGPT Command Line Arguments (Continuous Mode, Speak)
Complete reference for AutoGPT's 10 most powerful CLI arguments. Master continuous mode, headless operation, and CI/CD integration for automated agent workflows.
10 AutoGPT Configuration Tweaks for Better Performance
10 proven AutoGPT configuration tweaks to improve speed, cut costs, and boost task success. Model selection, temperature, token limits, and workspace settings.
Build a Content Research Agent with AutoGPT (Trends, Outlines)
Build an AutoGPT content research agent that finds trending topics, analyzes SERPs, and generates SEO-ready outlines automatically — full workflow inside.
10 AutoGPT Environment Variables You Need to Configure
Master AutoGPT configuration with these 10 essential environment variables. Set API keys, select models, control costs, and tune performance.