Do I need math to learn Pandas?

Basic math helps (mean, median, percentage) but you don't need advanced math. Pandas handles the calculations — you just need to know what to ask for.

What is Pandas used for?

Pandas is used for loading, cleaning, transforming, and analyzing tabular data (like spreadsheets). It's the foundation of almost all Python data science work.

Is Pandas better than Excel?

For large datasets (100K+ rows), complex transformations, reproducible analysis, and automation, Python Pandas is far superior. Excel is better for quick, one-off analysis by non-programmers.

Python Pandas Tutorial 2026 — Data Analysis for Beginners to Pro

Python Pandas Tutorial 2026 — Turn Raw Data Into Real Insights

Every data analyst has a turning point. For most, it happens the first time they open a messy 50,000-row CSV in Excel, watch it grind slowly, and realize there has to be a better way.

That better way is Pandas.

Pandas is the Python library that turns raw, messy data into clean, actionable insights. It is the backbone of data science, used by data analysts at every company from startups to Fortune 500s. And it is surprisingly enjoyable to learn.

This tutorial teaches you Pandas with real datasets and real analysis techniques, from loading your first CSV to building a complete data analysis project.

Why Pandas?

Before writing code, here is what makes Pandas worth learning:

Speed: Handles millions of rows without breaking a sweat
Flexibility: Works with CSV, Excel, JSON, SQL databases, APIs
Power: Filter, group, pivot, merge, reshape data with single lines
Ecosystem: Connects seamlessly with NumPy, Matplotlib, scikit-learn, and Plotly
Automation: Run the same analysis on new data every week automatically

Setup

pip install pandas numpy matplotlib openpyxl

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print(pd.__version__)  # Should be 2.x+

The Two Core Data Structures

Series — A Single Column

# A Series is like a labeled array
sales = pd.Series([1200, 1800, 950, 2300, 1650],
                  index=["Mon", "Tue", "Wed", "Thu", "Fri"])

print(sales)
print(f"Total: {sales.sum()}")
print(f"Average: {sales.mean():.0f}")
print(f"Best day: {sales.idxmax()} (${sales.max()})")

DataFrame — A Table

# A DataFrame is a table with labeled rows and columns
data = {
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "department": ["Engineering", "Marketing", "Engineering", "HR", "Marketing"],
    "salary": [95000, 72000, 88000, 65000, 78000],
    "years": [5, 3, 7, 2, 4],
    "remote": [True, False, True, True, False],
}

df = pd.DataFrame(data)
print(df)
print(f"\nShape: {df.shape}")  # (5, 5)
print(f"Columns: {list(df.columns)}")

Loading Data

From CSV

# Load a CSV file
df = pd.read_csv("sales_data.csv")

# Load with options
df = pd.read_csv(
    "sales_data.csv",
    parse_dates=["order_date"],     # Parse date columns automatically
    dtype={"product_id": str},     # Force column types
    na_values=["N/A", "unknown"],  # Treat these as NaN
)

print(df.head())        # First 5 rows
print(df.tail(3))       # Last 3 rows
print(df.shape)         # (rows, columns)
print(df.dtypes)        # Data type of each column
print(df.info())        # Summary with memory usage
print(df.describe())    # Statistical summary for numeric columns

From Excel

df = pd.read_excel("report.xlsx", sheet_name="Sales Q1")

From a SQL Database

import sqlite3

conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM orders WHERE year = 2026", conn)
conn.close()

Selecting and Filtering Data

# Select columns
names = df["name"]               # Single column → Series
subset = df[["name", "salary"]]  # Multiple columns → DataFrame

# Select rows by position
first_row = df.iloc[0]
first_three = df.iloc[:3]
last_two = df.iloc[-2:]

# Select rows by label (index)
row = df.loc[0]

# Filter rows with conditions
engineers = df[df["department"] == "Engineering"]
high_earners = df[df["salary"] > 80000]
senior_engineers = df[(df["department"] == "Engineering") & (df["years"] > 4)]
remote_or_senior = df[(df["remote"] == True) | (df["years"] > 5)]

# Query method — cleaner syntax
result = df.query("salary > 80000 and department == 'Engineering'")

Data Cleaning

Real-world data is messy. Cleaning it is 80% of the job.

# Check for missing values
print(df.isnull().sum())
print(f"Missing %: {df.isnull().mean() * 100}")

# Drop rows with any missing value
df_clean = df.dropna()

# Fill missing values
df["age"].fillna(df["age"].median(), inplace=True)         # Fill with median
df["category"].fillna("Unknown", inplace=True)             # Fill with constant
df["sales"].fillna(method="ffill", inplace=True)          # Forward fill

# Remove duplicates
df = df.drop_duplicates()
df = df.drop_duplicates(subset=["email"])  # Based on specific column

# Fix data types
df["order_date"] = pd.to_datetime(df["order_date"])
df["price"] = df["price"].str.replace("$", "").astype(float)
df["zip_code"] = df["zip_code"].astype(str).str.zfill(5)  # Pad to 5 digits

# Rename columns
df.rename(columns={"qty": "quantity", "amt": "amount"}, inplace=True)

# Strip whitespace from strings
df["name"] = df["name"].str.strip()
df["email"] = df["email"].str.lower().str.strip()

Transforming Data

# Add new columns
df["annual_bonus"] = df["salary"] * 0.10
df["total_comp"] = df["salary"] + df["annual_bonus"]
df["experience_level"] = df["years"].apply(
    lambda y: "Junior" if y < 3 else ("Mid" if y < 7 else "Senior")
)

# Apply a function to each row
def categorize_salary(row):
    if row["salary"] > 90000:
        return "High"
    elif row["salary"] > 70000:
        return "Medium"
    return "Low"

df["salary_band"] = df.apply(categorize_salary, axis=1)

# Map values
dept_codes = {"Engineering": "ENG", "Marketing": "MKT", "HR": "HR"}
df["dept_code"] = df["department"].map(dept_codes)

# String operations
df["first_name"] = df["name"].str.split(" ").str[0]
df["email_domain"] = df["email"].str.split("@").str[1]

Grouping and Aggregation

This is where Pandas truly shines — answering business questions instantly.

# Group by one column
dept_summary = df.groupby("department")["salary"].agg(["mean", "min", "max", "count"])
print(dept_summary)

# Multiple aggregations
summary = df.groupby("department").agg(
    avg_salary=("salary", "mean"),
    total_employees=("name", "count"),
    avg_years=("years", "mean"),
    max_salary=("salary", "max"),
)

# Group by multiple columns
pivot = df.groupby(["department", "remote"])["salary"].mean().unstack()

# Value counts — great for categorical analysis
print(df["department"].value_counts())
print(df["experience_level"].value_counts(normalize=True))  # Percentages

Merging DataFrames

# Join two DataFrames (like SQL JOIN)
orders = pd.DataFrame({
    "order_id": [1, 2, 3, 4],
    "customer_id": [101, 102, 101, 103],
    "amount": [250, 180, 320, 95]
})

customers = pd.DataFrame({
    "customer_id": [101, 102, 103],
    "name": ["Alice", "Bob", "Charlie"],
    "city": ["NYC", "LA", "Chicago"]
})

# Inner join (only matching rows)
merged = pd.merge(orders, customers, on="customer_id", how="inner")

# Left join (all orders, even without customer match)
merged_left = pd.merge(orders, customers, on="customer_id", how="left")

# Concatenate (stack DataFrames vertically)
q1 = pd.read_csv("q1_sales.csv")
q2 = pd.read_csv("q2_sales.csv")
full_year = pd.concat([q1, q2], ignore_index=True)

Working with Dates

df["order_date"] = pd.to_datetime(df["order_date"])

# Extract date components
df["year"] = df["order_date"].dt.year
df["month"] = df["order_date"].dt.month
df["day_of_week"] = df["order_date"].dt.day_name()
df["quarter"] = df["order_date"].dt.quarter

# Filter by date range
jan_orders = df[df["order_date"].dt.month == 1]
recent = df[df["order_date"] >= "2026-01-01"]

# Resample time series
monthly = df.set_index("order_date").resample("ME")["amount"].sum()
weekly = df.set_index("order_date").resample("W")["amount"].mean()

Visualization with Matplotlib

import matplotlib.pyplot as plt

# Bar chart
dept_salary = df.groupby("department")["salary"].mean()
dept_salary.plot(kind="bar", color="#4f46e5", figsize=(10, 6))
plt.title("Average Salary by Department")
plt.ylabel("Salary ($)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("salary_chart.png", dpi=150)
plt.show()

# Line chart for time series
monthly_sales = df.set_index("order_date").resample("ME")["amount"].sum()
monthly_sales.plot(kind="line", marker="o", figsize=(12, 5))
plt.title("Monthly Sales 2026")
plt.ylabel("Revenue ($)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Histogram
df["salary"].plot(kind="hist", bins=20, color="#0ea5e9", edgecolor="white")
plt.title("Salary Distribution")
plt.xlabel("Salary ($)")
plt.show()

Complete Mini-Project: Sales Analysis

import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv("ecommerce_sales.csv", parse_dates=["order_date"])

# Clean
df = df.dropna(subset=["amount", "category"])
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df = df[df["amount"] > 0]

# Analyze
print("=== Sales Summary ===")
print(f"Total Revenue: ${df['amount'].sum():,.2f}")
print(f"Total Orders: {len(df):,}")
print(f"Average Order: ${df['amount'].mean():.2f}")

print("\n=== Top Categories ===")
print(df.groupby("category")["amount"].sum().sort_values(ascending=False).head(5))

print("\n=== Monthly Trend ===")
monthly = df.set_index("order_date").resample("ME")["amount"].sum()
print(monthly)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
df.groupby("category")["amount"].sum().sort_values().plot(
    kind="barh", ax=axes[0], color="#4f46e5"
)
axes[0].set_title("Revenue by Category")

# Line chart
monthly.plot(kind="line", ax=axes[1], marker="o", color="#0ea5e9")
axes[1].set_title("Monthly Revenue")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("sales_analysis.png", dpi=150)
plt.show()

Where to Go Next

Once you can analyze data with Pandas, the natural next step is machine learning. Our Python for Machine Learning guide builds on everything here — you will use Pandas to prepare data for scikit-learn models.

For collecting your own data to analyze, check out the Python web scraping guide. And if you are still building your Python foundation, start with the Python beginners roadmap.

Pandas Cheat Sheet

Task	Code
Load CSV	`pd.read_csv("file.csv")`
First rows	`df.head()`
Shape	`df.shape`
Missing values	`df.isnull().sum()`
Filter rows	`df[df["col"] > 100]`
Group + aggregate	`df.groupby("col")["val"].mean()`
Sort	`df.sort_values("col", ascending=False)`
Add column	`df["new"] = df["a"] + df["b"]`
Merge DataFrames	`pd.merge(df1, df2, on="id")`
Save to CSV	`df.to_csv("output.csv", index=False)`

Pandas takes practice. The best approach: find a dataset you genuinely care about, load it, and start asking questions. Every analysis you do teaches you something new.

Get Pandas cheat sheets and practice datasets in the AiTechWorlds Telegram channel — free!