AI Tips Prompting Python AI Tools Web Dev ChatGPT LLM Agent Dev Reviews Notes Free Books

AiTechWorlds

data analysis charts from pandas dataframe — LangChain pandas agent data analysis

Build a LangChain Agent for Data Analysis with Pandas

Q: What LLM works best for the LangChain pandas agent?

GPT-4o and GPT-4o-mini both perform well. GPT-4o produces more accurate pandas code for complex multi-step analyses. For simple queries on small dataframes, GPT-4o-mini is sufficient and significantly cheaper. Avoid GPT-3.5 for anything beyond basic aggregations — it makes more pandas syntax errors.

⚡ Quick Answer

Build a LangChain pandas agent that answers natural language questions about your data, generates charts, and writes Python code — full working tutorial.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #pandas agent #data analysis #Python #AI data tools

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Most data analysis workflows involve the same repetitive steps: load a CSV, run groupby aggregations, filter rows, compute statistics, generate charts. The LangChain pandas agent lets you describe what you want in plain English and handles the pandas code automatically.

This is not a toy. The pandas agent can answer complex questions like "What is the month-over-month revenue growth rate for the top 5 customers by total spend?" with a single natural language query. It writes, executes, and iterates on pandas code until it gets the right answer.

This tutorial builds a complete pandas data analysis agent: CSV loading, multi-step queries, chart generation, and a REST API wrapper. By the end you will have something your non-technical teammates can actually use.

What the Pandas Agent Does

The pandas agent uses the ReAct framework — it thinks, acts, and observes in a loop:

Think — reads your question and the dataframe schema
Act — writes Python/pandas code to answer the question
Observe — reads the code output
Repeat — if the output does not answer the question, tries again

This loop continues until the agent produces a final answer or hits the maximum iterations limit. It handles errors gracefully — if the first code attempt raises an exception, the agent reads the error and rewrites the code.

According to LangChain's internal benchmarks, GPT-4o achieves about 85% accuracy on pandas operations that require 2–4 steps, compared to 62% for GPT-3.5-turbo on the same tasks.

Setup

pip install langchain langchain-openai langchain-experimental pandas matplotlib seaborn tabulate

import os
os.environ["OPENAI_API_KEY"] = "your-key-here"

Step 1: Basic Pandas Agent

import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

# Load your data
df = pd.read_csv("sales_data.csv")

# Create the agent
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

agent = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,           # shows the agent's thinking steps
    allow_dangerous_code=True,  # required — agent executes Python code
    max_iterations=10,
    max_execution_time=60,
)

# Ask questions in plain English
result = agent.invoke({"input": "What is the total revenue by product category?"})
print(result["output"])

For demonstration, let us create a realistic sample dataset:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

np.random.seed(42)
n = 1000

categories = ["Electronics", "Clothing", "Books", "Home & Garden", "Sports"]
regions = ["North", "South", "East", "West"]
products = {
    "Electronics": ["Laptop", "Phone", "Tablet", "Headphones", "Camera"],
    "Clothing": ["T-Shirt", "Jeans", "Dress", "Jacket", "Shoes"],
    "Books": ["Fiction", "Non-Fiction", "Textbook", "Comic", "Cookbook"],
    "Home & Garden": ["Chair", "Lamp", "Plant", "Rug", "Curtains"],
    "Sports": ["Yoga Mat", "Dumbbells", "Running Shoes", "Bike Helmet", "Water Bottle"],
}

# Generate sales data
rows = []
start_date = datetime(2025, 1, 1)

for _ in range(n):
    category = np.random.choice(categories)
    product = np.random.choice(products[category])
    region = np.random.choice(regions)
    date = start_date + timedelta(days=np.random.randint(0, 365))
    quantity = np.random.randint(1, 20)
    unit_price = round(np.random.uniform(10, 500), 2)
    discount = round(np.random.choice([0, 0.05, 0.10, 0.15, 0.20]), 2)
    revenue = round(quantity * unit_price * (1 - discount), 2)
    customer_id = f"CUST-{np.random.randint(1, 201):04d}"

    rows.append({
        "date": date.strftime("%Y-%m-%d"),
        "customer_id": customer_id,
        "product": product,
        "category": category,
        "region": region,
        "quantity": quantity,
        "unit_price": unit_price,
        "discount": discount,
        "revenue": revenue,
    })

df = pd.DataFrame(rows)
df["date"] = pd.to_datetime(df["date"])
df.to_csv("sales_data.csv", index=False)
print(f"Created sample data: {df.shape}")
print(df.head())

Step 2: Advanced Queries

The pandas agent handles complex multi-step analyses:

from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
import pandas as pd

df = pd.read_csv("sales_data.csv", parse_dates=["date"])

llm = ChatOpenAI(model="gpt-4o", temperature=0)  # GPT-4o for complex queries

agent = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    allow_dangerous_code=True,
    max_iterations=15,
)

# Complex multi-step queries
queries = [
    "What is the month-over-month revenue growth rate for each quarter?",
    "Which 5 customers have the highest total revenue, and what is their average discount?",
    "Is there a correlation between discount percentage and order quantity?",
    "What is the revenue breakdown by region and category? Show as a percentage.",
    "Identify any products with declining revenue trend over the last 3 months.",
]

for query in queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print('='*60)
    result = agent.invoke({"input": query})
    print(f"Answer: {result['output']}")

Step 3: Chart Generation

The pandas agent can generate and save matplotlib visualizations:

import matplotlib
matplotlib.use("Agg")  # non-interactive backend for server use
import matplotlib.pyplot as plt
import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

df = pd.read_csv("sales_data.csv", parse_dates=["date"])
llm = ChatOpenAI(model="gpt-4o", temperature=0)

agent = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    allow_dangerous_code=True,
)

# Chart generation query
chart_query = """
Create a bar chart showing total revenue by product category.
Use different colors for each bar.
Add a title "Revenue by Category", label the axes, and save the chart to 'revenue_by_category.png'.
Print the filename when done.
"""

result = agent.invoke({"input": chart_query})
print(result["output"])

# More complex visualization
trend_query = """
Create a line chart showing monthly revenue trends for each region (North, South, East, West).
Each region should be a different colored line.
Add a legend, title "Monthly Revenue by Region", and save to 'monthly_revenue_trends.png'.
"""

result = agent.invoke({"input": trend_query})
print(result["output"])

Step 4: Multiple DataFrames

The pandas agent can work with multiple dataframes at once, enabling join-based analyses:

import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

# Create related dataframes
sales_df = pd.read_csv("sales_data.csv", parse_dates=["date"])

# Customer profile data
customer_data = {
    "customer_id": [f"CUST-{i:04d}" for i in range(1, 201)],
    "customer_segment": np.random.choice(["Enterprise", "SMB", "Consumer"], 200),
    "acquisition_channel": np.random.choice(["Organic", "Paid", "Referral", "Email"], 200),
    "country": np.random.choice(["USA", "Canada", "UK", "Australia"], 200),
    "account_age_days": np.random.randint(30, 1000, 200),
}
customers_df = pd.DataFrame(customer_data)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Pass multiple dataframes as a list
multi_agent = create_pandas_dataframe_agent(
    llm=llm,
    df=[sales_df, customers_df],  # list of dataframes
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    allow_dangerous_code=True,
    number_of_head_rows=5,  # how many rows to show in schema description
)

cross_df_query = """
Join the sales data with customer data on customer_id.
What is the average revenue per order for Enterprise customers vs SMB customers?
Also show the total number of orders for each segment.
"""

result = multi_agent.invoke({"input": cross_df_query})
print(result["output"])

Step 5: Custom System Prompt

Add domain-specific context to improve the agent's accuracy:

from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

df = pd.read_csv("sales_data.csv", parse_dates=["date"])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

custom_prefix = """
You are a senior data analyst working with sales transaction data.
The dataframe contains the following columns:
- date: transaction date (datetime)
- customer_id: unique customer identifier (string, format CUST-XXXX)
- product: product name (string)
- category: product category — one of: Electronics, Clothing, Books, Home & Garden, Sports
- region: sales region — one of: North, South, East, West
- quantity: units sold (integer)
- unit_price: price per unit in USD (float)
- discount: discount rate applied (float, 0.0 to 0.20)
- revenue: actual revenue after discount (float)

Important notes:
- Revenue = quantity * unit_price * (1 - discount)
- Fiscal quarters: Q1 = Jan-Mar, Q2 = Apr-Jun, Q3 = Jul-Sep, Q4 = Oct-Dec
- Always format currency values with 2 decimal places and a dollar sign
- For percentage calculations, round to 1 decimal place
"""

agent = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    allow_dangerous_code=True,
    prefix=custom_prefix,
)

result = agent.invoke({"input": "What is the Q1 vs Q2 revenue comparison by category?"})
print(result["output"])

Step 6: FastAPI Wrapper

Expose the pandas agent as a REST endpoint for use by frontend applications:

from fastapi import FastAPI, HTTPException, UploadFile, File
from pydantic import BaseModel
import pandas as pd
import io
import tempfile
import os
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

app = FastAPI(title="Pandas Data Analysis API")

# In-memory dataframe store (use Redis + pickle in production)
dataframe_store: dict[str, pd.DataFrame] = {}
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)


class QueryRequest(BaseModel):
    session_id: str
    question: str


@app.post("/upload-csv")
async def upload_csv(session_id: str, file: UploadFile = File(...)):
    """Upload a CSV file and create an analysis session."""
    if not file.filename.endswith(".csv"):
        raise HTTPException(status_code=400, detail="Only CSV files are supported")

    content = await file.read()
    df = pd.read_csv(io.StringIO(content.decode("utf-8")))

    dataframe_store[session_id] = df

    return {
        "session_id": session_id,
        "rows": len(df),
        "columns": list(df.columns),
        "dtypes": df.dtypes.astype(str).to_dict(),
        "sample": df.head(3).to_dict(orient="records"),
    }


@app.post("/analyze")
async def analyze_data(request: QueryRequest):
    """Run a natural language query against the uploaded dataframe."""
    if request.session_id not in dataframe_store:
        raise HTTPException(status_code=404, detail="Session not found. Upload a CSV first.")

    df = dataframe_store[request.session_id]

    try:
        agent = create_pandas_dataframe_agent(
            llm=llm,
            df=df,
            agent_type=AgentType.OPENAI_FUNCTIONS,
            verbose=False,
            allow_dangerous_code=True,
            max_iterations=10,
            max_execution_time=30,
        )

        result = agent.invoke({"input": request.question})

        return {
            "question": request.question,
            "answer": result["output"],
            "session_id": request.session_id,
        }

    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Analysis failed: {str(e)}")


@app.get("/sessions/{session_id}/info")
async def get_dataframe_info(session_id: str):
    """Get information about a stored dataframe."""
    if session_id not in dataframe_store:
        raise HTTPException(status_code=404, detail="Session not found")

    df = dataframe_store[session_id]
    return {
        "shape": df.shape,
        "columns": list(df.columns),
        "dtypes": df.dtypes.astype(str).to_dict(),
        "null_counts": df.isnull().sum().to_dict(),
        "numeric_stats": df.describe().to_dict(),
    }


@app.delete("/sessions/{session_id}")
async def clear_session(session_id: str):
    """Remove a stored dataframe."""
    if session_id in dataframe_store:
        del dataframe_store[session_id]
        return {"message": "Session cleared"}
    raise HTTPException(status_code=404, detail="Session not found")

Run the API:

uvicorn api:app --reload --port 8000

Comparison Table: Data Analysis Agent Options

Tool	Setup	Natural Language	Code Execution	Multi-DataFrame	Charts
LangChain Pandas Agent	Easy	Yes	Yes (local)	Yes	Yes
LangChain CSV Agent	Easy	Yes	Yes (local)	No	Limited
OpenAI Code Interpreter	Very easy	Yes	Sandboxed	Limited	Yes
Jupyter + LLM	Manual	Via plugins	Yes	Yes	Yes
Custom Tool Agent	Complex	Custom	Custom	Custom	Custom

Error Handling and Safety

The pandas agent executes Python code in your Python environment. This is powerful but requires some precautions:

from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
import pandas as pd
import contextlib
import io


def safe_agent_query(df: pd.DataFrame, question: str, timeout: int = 30) -> dict:
    """
    Run a pandas agent query with safety measures.
    Returns the answer and any error information.
    """
    # Create a copy to prevent modifications to original data
    df_copy = df.copy()

    # Remove sensitive columns before passing to agent
    sensitive_columns = ["password", "ssn", "credit_card", "api_key", "secret"]
    columns_to_remove = [
        col for col in df_copy.columns
        if any(s in col.lower() for s in sensitive_columns)
    ]
    if columns_to_remove:
        df_copy = df_copy.drop(columns=columns_to_remove)
        print(f"Removed sensitive columns: {columns_to_remove}")

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    agent = create_pandas_dataframe_agent(
        llm=llm,
        df=df_copy,
        agent_type=AgentType.OPENAI_FUNCTIONS,
        verbose=False,
        allow_dangerous_code=True,
        max_iterations=8,
        max_execution_time=timeout,
        handle_parsing_errors=True,
    )

    try:
        result = agent.invoke({"input": question})
        return {
            "success": True,
            "answer": result["output"],
            "question": question,
        }
    except Exception as e:
        return {
            "success": False,
            "error": str(e),
            "question": question,
            "answer": "I was unable to answer this question. Please rephrase or simplify your query.",
        }


# Test the safe wrapper
df = pd.read_csv("sales_data.csv", parse_dates=["date"])
result = safe_agent_query(df, "What is the average revenue per category?")
print(result["answer"])

Prompting Tips for Better Results

The pandas agent performs much better with a few prompting adjustments:

# Be specific about output format
good_query = "Show total revenue by category. Return the results as a sorted table with Category and Revenue columns."
vague_query = "Show revenue by category."  # agent may return code instead of a table

# Specify the metric you want
good_query = "Calculate the median (not mean) order value for each region."
ambiguous_query = "What is the average order value by region?"  # could mean mean or median

# Reference column names directly if you know them
good_query = "What is the sum of the 'revenue' column grouped by 'category'?"
generic_query = "What is total sales by product type?"  # agent has to guess column names

# For charts, specify file format and dimensions
good_query = "Create a 10x6 inch bar chart of revenue by category. Save as 'chart.png' with 150 DPI."
vague_query = "Make a chart of revenue by category."

Integration with AI Agents

The pandas agent is itself a LangChain agent. You can use it as a tool inside a larger agent that also has web search, file I/O, or API access. See Build AI agent with LangChain for agent composition patterns.

For automating data reports that combine web data with local CSV analysis, AI research agent build shows how to connect multiple agents. The patterns there work directly with the pandas agent as one of the sub-agents.

For deployment patterns for the FastAPI wrapper above, see Deploy AI model to production for infrastructure guidance specific to data analysis APIs.

Frequently Asked Questions

Is the LangChain pandas agent safe to use with sensitive data?

The pandas agent executes generated Python code locally, which means it can read and modify your dataframe. Do not pass it credentials, PII, or dataframes with sensitive columns unless you have reviewed the generated code. For production use, run the agent in a sandboxed environment with restricted filesystem and network access. At minimum, always pass a copy of the dataframe (not the original) and strip sensitive columns before creating the agent.

Can the LangChain pandas agent generate matplotlib charts?

Yes. The agent can generate matplotlib and seaborn charts when instructed to visualize data. Set allow_dangerous_code=True and ensure matplotlib is installed. The agent writes and executes the plotting code in your Python environment. For server use, set matplotlib.use("Agg") before importing pyplot to prevent GUI window creation.

What LLM works best for the LangChain pandas agent?

GPT-4o and GPT-4o-mini both perform well for most analyses. GPT-4o produces more accurate pandas code for complex multi-step analyses involving joins, window functions, or custom aggregations. For simple queries on small dataframes, GPT-4o-mini is sufficient and significantly cheaper. Avoid models weaker than GPT-3.5-turbo for anything beyond basic aggregations — they make frequent pandas syntax errors that cause the agent to hit its iteration limit.

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

search relevance ranking showing scores — LangChain advanced RAG retrieval strategies

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

AI agent architecture with memory and tool connections — LangChain agent memory tools

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

developer coding AI agent decision loop — LangChain agent types ZeroShot ReAct Conversational

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

FastAPI server running LangChain endpoint — deploy LangChain FastAPI REST streaming

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

InterviewPython BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizPython Basics QuizPython OOP Concepts

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Langchain

Build a LangChain Agent for Data Analysis with Pandas

⚡ Quick Answer

Build a LangChain pandas agent that answers natural language questions about your data, generates charts, and writes Python code — full working tutorial.

AiTechWorlds Team May 31, 2026 12 min read

#LangChain #pandas agent #data analysis #Python #AI data tools

📚Part of the Langchain guide — explore all Langchain articles→

Share:Facebook Twitter/X LinkedIn Telegram WhatsApp

📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

What the Pandas Agent Does

The pandas agent uses the ReAct framework — it thinks, acts, and observes in a loop:

Think — reads your question and the dataframe schema
Act — writes Python/pandas code to answer the question
Observe — reads the code output
Repeat — if the output does not answer the question, tries again

According to LangChain's internal benchmarks, GPT-4o achieves about 85% accuracy on pandas operations that require 2–4 steps, compared to 62% for GPT-3.5-turbo on the same tasks.

Setup

pip install langchain langchain-openai langchain-experimental pandas matplotlib seaborn tabulate

import os
os.environ["OPENAI_API_KEY"] = "your-key-here"

Step 1: Basic Pandas Agent

import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

# Load your data
df = pd.read_csv("sales_data.csv")

# Create the agent
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

agent = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,           # shows the agent's thinking steps
    allow_dangerous_code=True,  # required — agent executes Python code
    max_iterations=10,
    max_execution_time=60,
)

# Ask questions in plain English
result = agent.invoke({"input": "What is the total revenue by product category?"})
print(result["output"])

For demonstration, let us create a realistic sample dataset:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

np.random.seed(42)
n = 1000

categories = ["Electronics", "Clothing", "Books", "Home & Garden", "Sports"]
regions = ["North", "South", "East", "West"]
products = {
    "Electronics": ["Laptop", "Phone", "Tablet", "Headphones", "Camera"],
    "Clothing": ["T-Shirt", "Jeans", "Dress", "Jacket", "Shoes"],
    "Books": ["Fiction", "Non-Fiction", "Textbook", "Comic", "Cookbook"],
    "Home & Garden": ["Chair", "Lamp", "Plant", "Rug", "Curtains"],
    "Sports": ["Yoga Mat", "Dumbbells", "Running Shoes", "Bike Helmet", "Water Bottle"],
}

# Generate sales data
rows = []
start_date = datetime(2025, 1, 1)

for _ in range(n):
    category = np.random.choice(categories)
    product = np.random.choice(products[category])
    region = np.random.choice(regions)
    date = start_date + timedelta(days=np.random.randint(0, 365))
    quantity = np.random.randint(1, 20)
    unit_price = round(np.random.uniform(10, 500), 2)
    discount = round(np.random.choice([0, 0.05, 0.10, 0.15, 0.20]), 2)
    revenue = round(quantity * unit_price * (1 - discount), 2)
    customer_id = f"CUST-{np.random.randint(1, 201):04d}"

    rows.append({
        "date": date.strftime("%Y-%m-%d"),
        "customer_id": customer_id,
        "product": product,
        "category": category,
        "region": region,
        "quantity": quantity,
        "unit_price": unit_price,
        "discount": discount,
        "revenue": revenue,
    })

df = pd.DataFrame(rows)
df["date"] = pd.to_datetime(df["date"])
df.to_csv("sales_data.csv", index=False)
print(f"Created sample data: {df.shape}")
print(df.head())

Step 2: Advanced Queries

The pandas agent handles complex multi-step analyses:

from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
import pandas as pd

df = pd.read_csv("sales_data.csv", parse_dates=["date"])

llm = ChatOpenAI(model="gpt-4o", temperature=0)  # GPT-4o for complex queries

agent = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    allow_dangerous_code=True,
    max_iterations=15,
)

# Complex multi-step queries
queries = [
    "What is the month-over-month revenue growth rate for each quarter?",
    "Which 5 customers have the highest total revenue, and what is their average discount?",
    "Is there a correlation between discount percentage and order quantity?",
    "What is the revenue breakdown by region and category? Show as a percentage.",
    "Identify any products with declining revenue trend over the last 3 months.",
]

for query in queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print('='*60)
    result = agent.invoke({"input": query})
    print(f"Answer: {result['output']}")

Step 3: Chart Generation

The pandas agent can generate and save matplotlib visualizations:

import matplotlib
matplotlib.use("Agg")  # non-interactive backend for server use
import matplotlib.pyplot as plt
import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

df = pd.read_csv("sales_data.csv", parse_dates=["date"])
llm = ChatOpenAI(model="gpt-4o", temperature=0)

agent = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    allow_dangerous_code=True,
)

# Chart generation query
chart_query = """
Create a bar chart showing total revenue by product category.
Use different colors for each bar.
Add a title "Revenue by Category", label the axes, and save the chart to 'revenue_by_category.png'.
Print the filename when done.
"""

result = agent.invoke({"input": chart_query})
print(result["output"])

# More complex visualization
trend_query = """
Create a line chart showing monthly revenue trends for each region (North, South, East, West).
Each region should be a different colored line.
Add a legend, title "Monthly Revenue by Region", and save to 'monthly_revenue_trends.png'.
"""

result = agent.invoke({"input": trend_query})
print(result["output"])

Step 4: Multiple DataFrames

The pandas agent can work with multiple dataframes at once, enabling join-based analyses:

import pandas as pd
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

# Create related dataframes
sales_df = pd.read_csv("sales_data.csv", parse_dates=["date"])

# Customer profile data
customer_data = {
    "customer_id": [f"CUST-{i:04d}" for i in range(1, 201)],
    "customer_segment": np.random.choice(["Enterprise", "SMB", "Consumer"], 200),
    "acquisition_channel": np.random.choice(["Organic", "Paid", "Referral", "Email"], 200),
    "country": np.random.choice(["USA", "Canada", "UK", "Australia"], 200),
    "account_age_days": np.random.randint(30, 1000, 200),
}
customers_df = pd.DataFrame(customer_data)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Pass multiple dataframes as a list
multi_agent = create_pandas_dataframe_agent(
    llm=llm,
    df=[sales_df, customers_df],  # list of dataframes
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    allow_dangerous_code=True,
    number_of_head_rows=5,  # how many rows to show in schema description
)

cross_df_query = """
Join the sales data with customer data on customer_id.
What is the average revenue per order for Enterprise customers vs SMB customers?
Also show the total number of orders for each segment.
"""

result = multi_agent.invoke({"input": cross_df_query})
print(result["output"])

Step 5: Custom System Prompt

Add domain-specific context to improve the agent's accuracy:

from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

df = pd.read_csv("sales_data.csv", parse_dates=["date"])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

custom_prefix = """
You are a senior data analyst working with sales transaction data.
The dataframe contains the following columns:
- date: transaction date (datetime)
- customer_id: unique customer identifier (string, format CUST-XXXX)
- product: product name (string)
- category: product category — one of: Electronics, Clothing, Books, Home & Garden, Sports
- region: sales region — one of: North, South, East, West
- quantity: units sold (integer)
- unit_price: price per unit in USD (float)
- discount: discount rate applied (float, 0.0 to 0.20)
- revenue: actual revenue after discount (float)

Important notes:
- Revenue = quantity * unit_price * (1 - discount)
- Fiscal quarters: Q1 = Jan-Mar, Q2 = Apr-Jun, Q3 = Jul-Sep, Q4 = Oct-Dec
- Always format currency values with 2 decimal places and a dollar sign
- For percentage calculations, round to 1 decimal place
"""

agent = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    agent_type=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    allow_dangerous_code=True,
    prefix=custom_prefix,
)

result = agent.invoke({"input": "What is the Q1 vs Q2 revenue comparison by category?"})
print(result["output"])

Step 6: FastAPI Wrapper

Expose the pandas agent as a REST endpoint for use by frontend applications:

from fastapi import FastAPI, HTTPException, UploadFile, File
from pydantic import BaseModel
import pandas as pd
import io
import tempfile
import os
from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

app = FastAPI(title="Pandas Data Analysis API")

# In-memory dataframe store (use Redis + pickle in production)
dataframe_store: dict[str, pd.DataFrame] = {}
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)


class QueryRequest(BaseModel):
    session_id: str
    question: str


@app.post("/upload-csv")
async def upload_csv(session_id: str, file: UploadFile = File(...)):
    """Upload a CSV file and create an analysis session."""
    if not file.filename.endswith(".csv"):
        raise HTTPException(status_code=400, detail="Only CSV files are supported")

    content = await file.read()
    df = pd.read_csv(io.StringIO(content.decode("utf-8")))

    dataframe_store[session_id] = df

    return {
        "session_id": session_id,
        "rows": len(df),
        "columns": list(df.columns),
        "dtypes": df.dtypes.astype(str).to_dict(),
        "sample": df.head(3).to_dict(orient="records"),
    }


@app.post("/analyze")
async def analyze_data(request: QueryRequest):
    """Run a natural language query against the uploaded dataframe."""
    if request.session_id not in dataframe_store:
        raise HTTPException(status_code=404, detail="Session not found. Upload a CSV first.")

    df = dataframe_store[request.session_id]

    try:
        agent = create_pandas_dataframe_agent(
            llm=llm,
            df=df,
            agent_type=AgentType.OPENAI_FUNCTIONS,
            verbose=False,
            allow_dangerous_code=True,
            max_iterations=10,
            max_execution_time=30,
        )

        result = agent.invoke({"input": request.question})

        return {
            "question": request.question,
            "answer": result["output"],
            "session_id": request.session_id,
        }

    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Analysis failed: {str(e)}")


@app.get("/sessions/{session_id}/info")
async def get_dataframe_info(session_id: str):
    """Get information about a stored dataframe."""
    if session_id not in dataframe_store:
        raise HTTPException(status_code=404, detail="Session not found")

    df = dataframe_store[session_id]
    return {
        "shape": df.shape,
        "columns": list(df.columns),
        "dtypes": df.dtypes.astype(str).to_dict(),
        "null_counts": df.isnull().sum().to_dict(),
        "numeric_stats": df.describe().to_dict(),
    }


@app.delete("/sessions/{session_id}")
async def clear_session(session_id: str):
    """Remove a stored dataframe."""
    if session_id in dataframe_store:
        del dataframe_store[session_id]
        return {"message": "Session cleared"}
    raise HTTPException(status_code=404, detail="Session not found")

Run the API:

uvicorn api:app --reload --port 8000

Comparison Table: Data Analysis Agent Options

Tool	Setup	Natural Language	Code Execution	Multi-DataFrame	Charts
LangChain Pandas Agent	Easy	Yes	Yes (local)	Yes	Yes
LangChain CSV Agent	Easy	Yes	Yes (local)	No	Limited
OpenAI Code Interpreter	Very easy	Yes	Sandboxed	Limited	Yes
Jupyter + LLM	Manual	Via plugins	Yes	Yes	Yes
Custom Tool Agent	Complex	Custom	Custom	Custom	Custom

Error Handling and Safety

The pandas agent executes Python code in your Python environment. This is powerful but requires some precautions:

from langchain_openai import ChatOpenAI
from langchain_experimental.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
import pandas as pd
import contextlib
import io


def safe_agent_query(df: pd.DataFrame, question: str, timeout: int = 30) -> dict:
    """
    Run a pandas agent query with safety measures.
    Returns the answer and any error information.
    """
    # Create a copy to prevent modifications to original data
    df_copy = df.copy()

    # Remove sensitive columns before passing to agent
    sensitive_columns = ["password", "ssn", "credit_card", "api_key", "secret"]
    columns_to_remove = [
        col for col in df_copy.columns
        if any(s in col.lower() for s in sensitive_columns)
    ]
    if columns_to_remove:
        df_copy = df_copy.drop(columns=columns_to_remove)
        print(f"Removed sensitive columns: {columns_to_remove}")

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    agent = create_pandas_dataframe_agent(
        llm=llm,
        df=df_copy,
        agent_type=AgentType.OPENAI_FUNCTIONS,
        verbose=False,
        allow_dangerous_code=True,
        max_iterations=8,
        max_execution_time=timeout,
        handle_parsing_errors=True,
    )

    try:
        result = agent.invoke({"input": question})
        return {
            "success": True,
            "answer": result["output"],
            "question": question,
        }
    except Exception as e:
        return {
            "success": False,
            "error": str(e),
            "question": question,
            "answer": "I was unable to answer this question. Please rephrase or simplify your query.",
        }


# Test the safe wrapper
df = pd.read_csv("sales_data.csv", parse_dates=["date"])
result = safe_agent_query(df, "What is the average revenue per category?")
print(result["answer"])

Prompting Tips for Better Results

The pandas agent performs much better with a few prompting adjustments:

# Be specific about output format
good_query = "Show total revenue by category. Return the results as a sorted table with Category and Revenue columns."
vague_query = "Show revenue by category."  # agent may return code instead of a table

# Specify the metric you want
good_query = "Calculate the median (not mean) order value for each region."
ambiguous_query = "What is the average order value by region?"  # could mean mean or median

# Reference column names directly if you know them
good_query = "What is the sum of the 'revenue' column grouped by 'category'?"
generic_query = "What is total sales by product type?"  # agent has to guess column names

# For charts, specify file format and dimensions
good_query = "Create a 10x6 inch bar chart of revenue by category. Save as 'chart.png' with 150 DPI."
vague_query = "Make a chart of revenue by category."

Integration with AI Agents

For deployment patterns for the FastAPI wrapper above, see Deploy AI model to production for infrastructure guidance specific to data analysis APIs.

Frequently Asked Questions

Is the LangChain pandas agent safe to use with sensitive data?

Can the LangChain pandas agent generate matplotlib charts?

What LLM works best for the LangChain pandas agent?

Share this article:Facebook Twitter/X LinkedIn Telegram WhatsApp

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

📱 Follow on Telegram 🐦 Follow on X Learn More →

Agent Development

10 LangChain Retrieval Strategies for Better RAG Results

Go beyond basic similarity search with ParentDocumentRetriever, MultiQueryRetriever, EnsembleRetriever, HyDE, and 6 more LangChain retrieval strategies — with code for each.

May 31, 2026 13 min read

Agent Development

Build a LangChain Agent with Memory and Tools (Full Example)

Build a complete LangChain conversational agent with persistent memory, multiple tools, and step-by-step trace — from setup to a production-ready implementation with code.

May 31, 2026 14 min read

Agent Development

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

Understand every major LangChain agent type — ZeroShotAgent, ReAct, ConversationalAgent, and more — with Python code and agent trace walkthroughs.

May 31, 2026 13 min read

Agent Development

How to Deploy a LangChain App as a FastAPI REST Endpoint

Serve a LangChain app as a production FastAPI REST endpoint with streaming, async chains, error handling, and Docker deployment — full Python code included.

May 31, 2026 11 min read

Go deeper on this topic

InterviewPython BookAI Agent Development Guide BookBuilding AI Apps: Developer's Guide CourseAI Agent Development Course QuizPython Basics QuizPython OOP Concepts

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources

Join Free Channel

No spam. Leave anytime.

Build a LangChain Agent for Data Analysis with Pandas

What the Pandas Agent Does

Setup

Step 1: Basic Pandas Agent

Step 2: Advanced Queries

Step 3: Chart Generation

Step 4: Multiple DataFrames

Step 5: Custom System Prompt

Step 6: FastAPI Wrapper

Comparison Table: Data Analysis Agent Options

Error Handling and Safety

Prompting Tips for Better Results

Integration with AI Agents

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily

Build a LangChain Agent for Data Analysis with Pandas

What the Pandas Agent Does

Setup

Step 1: Basic Pandas Agent

Step 2: Advanced Queries

Step 3: Chart Generation

Step 4: Multiple DataFrames

Step 5: Custom System Prompt

Step 6: FastAPI Wrapper

Comparison Table: Data Analysis Agent Options

Error Handling and Safety

Prompting Tips for Better Results

Integration with AI Agents

Frequently Asked Questions

💬 DiscussionPowered by GitHub Discussions

Frequently Asked Questions

AiTechWorlds Team

Related Articles

10 LangChain Retrieval Strategies for Better RAG Results

Build a LangChain Agent with Memory and Tools (Full Example)

5 LangChain Agent Types Explained (ZeroShot, ReAct, and More)

How to Deploy a LangChain App as a FastAPI REST Endpoint

Go deeper on this topic

Get Free AI Notes Daily