Build a LangChain Agent for Data Analysis with Pandas
Build a LangChain pandas agent that answers natural language questions about your data, generates charts, and writes Python code — full working tutorial.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Most data analysis workflows involve the same repetitive steps: load a CSV, run groupby aggregations, filter rows, compute statistics, generate charts. The LangChain pandas agent lets you describe what you want in plain English and handles the pandas code automatically.
This is not a toy. The pandas agent can answer complex questions like "What is the month-over-month revenue growth rate for the top 5 customers by total spend?" with a single natural language query. It writes, executes, and iterates on pandas code until it gets the right answer.
This tutorial builds a complete pandas data analysis agent: CSV loading, multi-step queries, chart generation, and a REST API wrapper. By the end you will have something your non-technical teammates can actually use.
What the Pandas Agent Does
The pandas agent uses the ReAct framework — it thinks, acts, and observes in a loop:
- Think — reads your question and the dataframe schema
- Act — writes Python/pandas code to answer the question
- Observe — reads the code output
- Repeat — if the output does not answer the question, tries again
This loop continues until the agent produces a final answer or hits the maximum iterations limit. It handles errors gracefully — if the first code attempt raises an exception, the agent reads the error and rewrites the code.
According to LangChain's internal benchmarks, GPT-4o achieves about 85% accuracy on pandas operations that require 2–4 steps, compared to 62% for GPT-3.5-turbo on the same tasks.
Setup
pip install langchain langchain-openai langchain-experimental pandas matplotlib seaborn tabulate
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"
Step 1: Basic Pandas Agent
import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
# Load your data
df = pd.read_csv("sales_data.csv")
# Create the agent
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
agent = create_pandas_dataframe_agent(
llm=llm,
df=df,
agent_type=AgentType.OPENAI_FUNCTIONS,
verbose=True, # shows the agent's thinking steps
allow_dangerous_code=True, # required — agent executes Python code
max_iterations=10,
max_execution_time=60,
)
# Ask questions in plain English
result = agent.invoke({"input": "What is the total revenue by product category?"})
print(result["output"])
For demonstration, let us create a realistic sample dataset:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
np.random.seed(42)
n = 1000
categories = ["Electronics", "Clothing", "Books", "Home & Garden", "Sports"]
regions = ["North", "South", "East", "West"]
products = {
"Electronics": ["Laptop", "Phone", "Tablet", "Headphones", "Camera"],
"Clothing": ["T-Shirt", "Jeans", "Dress", "Jacket", "Shoes"],
"Books": ["Fiction", "Non-Fiction", "Textbook", "Comic", "Cookbook"],
"Home & Garden": ["Chair", "Lamp", "Plant", "Rug", "Curtains"],
"Sports": ["Yoga Mat", "Dumbbells", "Running Shoes", "Bike Helmet", "Water Bottle"],
}
# Generate sales data
rows = []
start_date = datetime(2025, 1, 1)
for _ in range(n):
category = np.random.choice(categories)
product = np.random.choice(products[category])
region = np.random.choice(regions)
date = start_date + timedelta(days=np.random.randint(0, 365))
quantity = np.random.randint(1, 20)
unit_price = round(np.random.uniform(10, 500), 2)
discount = round(np.random.choice([0, 0.05, 0.10, 0.15, 0.20]), 2)
revenue = round(quantity * unit_price * (1 - discount), 2)
customer_id = f"CUST-{np.random.randint(1, 201):04d}"
rows.append({
"date": date.strftime("%Y-%m-%d"),
"customer_id": customer_id,
"product": product,
"category": category,
"region": region,
"quantity": quantity,
"unit_price": unit_price,
"discount": discount,
"revenue": revenue,
})
df = pd.DataFrame(rows)
df["date"] = pd.to_datetime(df["date"])
df.to_csv("sales_data.csv", index=False)
print(f"Created sample data: {df.shape}")
print(df.head())
Step 2: Advanced Queries
The pandas agent handles complex multi-step analyses:
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
import pandas as pd
df = pd.read_csv("sales_data.csv", parse_dates=["date"])
llm = ChatOpenAI(model="gpt-4o", temperature=0) # GPT-4o for complex queries
agent = create_pandas_dataframe_agent(
llm=llm,
df=df,
agent_type=AgentType.OPENAI_FUNCTIONS,
verbose=True,
allow_dangerous_code=True,
max_iterations=15,
)
# Complex multi-step queries
queries = [
"What is the month-over-month revenue growth rate for each quarter?",
"Which 5 customers have the highest total revenue, and what is their average discount?",
"Is there a correlation between discount percentage and order quantity?",
"What is the revenue breakdown by region and category? Show as a percentage.",
"Identify any products with declining revenue trend over the last 3 months.",
]
for query in queries:
print(f"\n{'='*60}")
print(f"Query: {query}")
print('='*60)
result = agent.invoke({"input": query})
print(f"Answer: {result['output']}")
Step 3: Chart Generation
The pandas agent can generate and save matplotlib visualizations:
import matplotlib
matplotlib.use("Agg") # non-interactive backend for server use
import matplotlib.pyplot as plt
import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
df = pd.read_csv("sales_data.csv", parse_dates=["date"])
llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = create_pandas_dataframe_agent(
llm=llm,
df=df,
agent_type=AgentType.OPENAI_FUNCTIONS,
verbose=True,
allow_dangerous_code=True,
)
# Chart generation query
chart_query = """
Create a bar chart showing total revenue by product category.
Use different colors for each bar.
Add a title "Revenue by Category", label the axes, and save the chart to 'revenue_by_category.png'.
Print the filename when done.
"""
result = agent.invoke({"input": chart_query})
print(result["output"])
# More complex visualization
trend_query = """
Create a line chart showing monthly revenue trends for each region (North, South, East, West).
Each region should be a different colored line.
Add a legend, title "Monthly Revenue by Region", and save to 'monthly_revenue_trends.png'.
"""
result = agent.invoke({"input": trend_query})
print(result["output"])
Step 4: Multiple DataFrames
The pandas agent can work with multiple dataframes at once, enabling join-based analyses:
import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
# Create related dataframes
sales_df = pd.read_csv("sales_data.csv", parse_dates=["date"])
# Customer profile data
customer_data = {
"customer_id": [f"CUST-{i:04d}" for i in range(1, 201)],
"customer_segment": np.random.choice(["Enterprise", "SMB", "Consumer"], 200),
"acquisition_channel": np.random.choice(["Organic", "Paid", "Referral", "Email"], 200),
"country": np.random.choice(["USA", "Canada", "UK", "Australia"], 200),
"account_age_days": np.random.randint(30, 1000, 200),
}
customers_df = pd.DataFrame(customer_data)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Pass multiple dataframes as a list
multi_agent = create_pandas_dataframe_agent(
llm=llm,
df=[sales_df, customers_df], # list of dataframes
agent_type=AgentType.OPENAI_FUNCTIONS,
verbose=True,
allow_dangerous_code=True,
number_of_head_rows=5, # how many rows to show in schema description
)
cross_df_query = """
Join the sales data with customer data on customer_id.
What is the average revenue per order for Enterprise customers vs SMB customers?
Also show the total number of orders for each segment.
"""
result = multi_agent.invoke({"input": cross_df_query})
print(result["output"])
Step 5: Custom System Prompt
Add domain-specific context to improve the agent's accuracy:
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
df = pd.read_csv("sales_data.csv", parse_dates=["date"])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
custom_prefix = """
You are a senior data analyst working with sales transaction data.
The dataframe contains the following columns:
- date: transaction date (datetime)
- customer_id: unique customer identifier (string, format CUST-XXXX)
- product: product name (string)
- category: product category — one of: Electronics, Clothing, Books, Home & Garden, Sports
- region: sales region — one of: North, South, East, West
- quantity: units sold (integer)
- unit_price: price per unit in USD (float)
- discount: discount rate applied (float, 0.0 to 0.20)
- revenue: actual revenue after discount (float)
Important notes:
- Revenue = quantity * unit_price * (1 - discount)
- Fiscal quarters: Q1 = Jan-Mar, Q2 = Apr-Jun, Q3 = Jul-Sep, Q4 = Oct-Dec
- Always format currency values with 2 decimal places and a dollar sign
- For percentage calculations, round to 1 decimal place
"""
agent = create_pandas_dataframe_agent(
llm=llm,
df=df,
agent_type=AgentType.OPENAI_FUNCTIONS,
verbose=True,
allow_dangerous_code=True,
prefix=custom_prefix,
)
result = agent.invoke({"input": "What is the Q1 vs Q2 revenue comparison by category?"})
print(result["output"])
Step 6: FastAPI Wrapper
Expose the pandas agent as a REST endpoint for use by frontend applications:
from fastapi import FastAPI, HTTPException, UploadFile, File
from pydantic import BaseModel
import pandas as pd
import io
import tempfile
import os
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
app = FastAPI(title="Pandas Data Analysis API")
# In-memory dataframe store (use Redis + pickle in production)
dataframe_store: dict[str, pd.DataFrame] = {}
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
class QueryRequest(BaseModel):
session_id: str
question: str
@app.post("/upload-csv")
async def upload_csv(session_id: str, file: UploadFile = File(...)):
"""Upload a CSV file and create an analysis session."""
if not file.filename.endswith(".csv"):
raise HTTPException(status_code=400, detail="Only CSV files are supported")
content = await file.read()
df = pd.read_csv(io.StringIO(content.decode("utf-8")))
dataframe_store[session_id] = df
return {
"session_id": session_id,
"rows": len(df),
"columns": list(df.columns),
"dtypes": df.dtypes.astype(str).to_dict(),
"sample": df.head(3).to_dict(orient="records"),
}
@app.post("/analyze")
async def analyze_data(request: QueryRequest):
"""Run a natural language query against the uploaded dataframe."""
if request.session_id not in dataframe_store:
raise HTTPException(status_code=404, detail="Session not found. Upload a CSV first.")
df = dataframe_store[request.session_id]
try:
agent = create_pandas_dataframe_agent(
llm=llm,
df=df,
agent_type=AgentType.OPENAI_FUNCTIONS,
verbose=False,
allow_dangerous_code=True,
max_iterations=10,
max_execution_time=30,
)
result = agent.invoke({"input": request.question})
return {
"question": request.question,
"answer": result["output"],
"session_id": request.session_id,
}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Analysis failed: {str(e)}")
@app.get("/sessions/{session_id}/info")
async def get_dataframe_info(session_id: str):
"""Get information about a stored dataframe."""
if session_id not in dataframe_store:
raise HTTPException(status_code=404, detail="Session not found")
df = dataframe_store[session_id]
return {
"shape": df.shape,
"columns": list(df.columns),
"dtypes": df.dtypes.astype(str).to_dict(),
"null_counts": df.isnull().sum().to_dict(),
"numeric_stats": df.describe().to_dict(),
}
@app.delete("/sessions/{session_id}")
async def clear_session(session_id: str):
"""Remove a stored dataframe."""
if session_id in dataframe_store:
del dataframe_store[session_id]
return {"message": "Session cleared"}
raise HTTPException(status_code=404, detail="Session not found")
Run the API:
uvicorn api:app --reload --port 8000
Comparison Table: Data Analysis Agent Options
| Tool | Setup | Natural Language | Code Execution | Multi-DataFrame | Charts |
|---|---|---|---|---|---|
| LangChain Pandas Agent | Easy | Yes | Yes (local) | Yes | Yes |
| LangChain CSV Agent | Easy | Yes | Yes (local) | No | Limited |
| OpenAI Code Interpreter | Very easy | Yes | Sandboxed | Limited | Yes |
| Jupyter + LLM | Manual | Via plugins | Yes | Yes | Yes |
| Custom Tool Agent | Complex | Custom | Custom | Custom | Custom |
Error Handling and Safety
The pandas agent executes Python code in your Python environment. This is powerful but requires some precautions:
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
import pandas as pd
import contextlib
import io
def safe_agent_query(df: pd.DataFrame, question: str, timeout: int = 30) -> dict:
"""
Run a pandas agent query with safety measures.
Returns the answer and any error information.
"""
# Create a copy to prevent modifications to original data
df_copy = df.copy()
# Remove sensitive columns before passing to agent
sensitive_columns = ["password", "ssn", "credit_card", "api_key", "secret"]
columns_to_remove = [
col for col in df_copy.columns
if any(s in col.lower() for s in sensitive_columns)
]
if columns_to_remove:
df_copy = df_copy.drop(columns=columns_to_remove)
print(f"Removed sensitive columns: {columns_to_remove}")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
agent = create_pandas_dataframe_agent(
llm=llm,
df=df_copy,
agent_type=AgentType.OPENAI_FUNCTIONS,
verbose=False,
allow_dangerous_code=True,
max_iterations=8,
max_execution_time=timeout,
handle_parsing_errors=True,
)
try:
result = agent.invoke({"input": question})
return {
"success": True,
"answer": result["output"],
"question": question,
}
except Exception as e:
return {
"success": False,
"error": str(e),
"question": question,
"answer": "I was unable to answer this question. Please rephrase or simplify your query.",
}
# Test the safe wrapper
df = pd.read_csv("sales_data.csv", parse_dates=["date"])
result = safe_agent_query(df, "What is the average revenue per category?")
print(result["answer"])
Prompting Tips for Better Results
The pandas agent performs much better with a few prompting adjustments:
# Be specific about output format
good_query = "Show total revenue by category. Return the results as a sorted table with Category and Revenue columns."
vague_query = "Show revenue by category." # agent may return code instead of a table
# Specify the metric you want
good_query = "Calculate the median (not mean) order value for each region."
ambiguous_query = "What is the average order value by region?" # could mean mean or median
# Reference column names directly if you know them
good_query = "What is the sum of the 'revenue' column grouped by 'category'?"
generic_query = "What is total sales by product type?" # agent has to guess column names
# For charts, specify file format and dimensions
good_query = "Create a 10x6 inch bar chart of revenue by category. Save as 'chart.png' with 150 DPI."
vague_query = "Make a chart of revenue by category."
Integration with AI Agents
The pandas agent is itself a LangChain agent. You can use it as a tool inside a larger agent that also has web search, file I/O, or API access. See Build AI agent with LangChain for agent composition patterns.
For automating data reports that combine web data with local CSV analysis, AI research agent build shows how to connect multiple agents. The patterns there work directly with the pandas agent as one of the sub-agents.
For deployment patterns for the FastAPI wrapper above, see Deploy AI model to production for infrastructure guidance specific to data analysis APIs.
Frequently Asked Questions
Is the LangChain pandas agent safe to use with sensitive data?
The pandas agent executes generated Python code locally, which means it can read and modify your dataframe. Do not pass it credentials, PII, or dataframes with sensitive columns unless you have reviewed the generated code. For production use, run the agent in a sandboxed environment with restricted filesystem and network access. At minimum, always pass a copy of the dataframe (not the original) and strip sensitive columns before creating the agent.
Can the LangChain pandas agent generate matplotlib charts?
Yes. The agent can generate matplotlib and seaborn charts when instructed to visualize data. Set allow_dangerous_code=True and ensure matplotlib is installed. The agent writes and executes the plotting code in your Python environment. For server use, set matplotlib.use("Agg") before importing pyplot to prevent GUI window creation.
What LLM works best for the LangChain pandas agent?
GPT-4o and GPT-4o-mini both perform well for most analyses. GPT-4o produces more accurate pandas code for complex multi-step analyses involving joins, window functions, or custom aggregations. For simple queries on small dataframes, GPT-4o-mini is sufficient and significantly cheaper. Avoid models weaker than GPT-3.5-turbo for anything beyond basic aggregations — they make frequent pandas syntax errors that cause the agent to hit its iteration limit.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
7 AutoGen Termination Conditions (Max Rounds, Human Approval)
Master all 7 AutoGen termination conditions including is_termination_msg, max_turns, and human approval patterns to stop agent loops reliably and safely.
AutoGen Tutorial: Microsoft's Multi-Agent Framework (2026)
Learn Microsoft AutoGen from scratch in 2026 — install, first agent conversation, GroupChat, and a full comparison of AutoGen 0.2 vs 0.4 features.
AutoGen vs LangChain: Which for Multi-Agent Systems in 2026?
AutoGen vs LangChain for multi-agent systems in 2026 — feature comparison, same use case in both frameworks, and an honest verdict on when each wins.
How to Use AutoGen with Tools (Web Scraper, Calculator, File)
Learn how to equip AutoGen agents with custom tools like web scrapers, calculators, and file handlers using register_for_llm and register_for_execution.