Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

Python Pandas Tutorial 2026 — Data Analysis for Beginners to Pro

Master Python Pandas for data analysis. Learn to load, clean, transform, and visualize data with hands-on examples. From CSV files to actionable insights — complete 2026 guide.

A
AiTechWorlds Team
May 3, 2026 7 min readUpdated May 15, 2026
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Python Pandas Tutorial 2026 — Turn Raw Data Into Real Insights

Every data analyst has a turning point. For most, it happens the first time they open a messy 50,000-row CSV in Excel, watch it grind slowly, and realize there has to be a better way.

That better way is Pandas.

Pandas is the Python library that turns raw, messy data into clean, actionable insights. It is the backbone of data science, used by data analysts at every company from startups to Fortune 500s. And it is surprisingly enjoyable to learn.

This tutorial teaches you Pandas with real datasets and real analysis techniques, from loading your first CSV to building a complete data analysis project.


Why Pandas?

Before writing code, here is what makes Pandas worth learning:

  • Speed: Handles millions of rows without breaking a sweat
  • Flexibility: Works with CSV, Excel, JSON, SQL databases, APIs
  • Power: Filter, group, pivot, merge, reshape data with single lines
  • Ecosystem: Connects seamlessly with NumPy, Matplotlib, scikit-learn, and Plotly
  • Automation: Run the same analysis on new data every week automatically

Setup

pip install pandas numpy matplotlib openpyxl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print(pd.__version__)  # Should be 2.x+

The Two Core Data Structures

Series — A Single Column

# A Series is like a labeled array
sales = pd.Series([1200, 1800, 950, 2300, 1650],
                  index=["Mon", "Tue", "Wed", "Thu", "Fri"])

print(sales)
print(f"Total: {sales.sum()}")
print(f"Average: {sales.mean():.0f}")
print(f"Best day: {sales.idxmax()} (${sales.max()})")

DataFrame — A Table

# A DataFrame is a table with labeled rows and columns
data = {
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "department": ["Engineering", "Marketing", "Engineering", "HR", "Marketing"],
    "salary": [95000, 72000, 88000, 65000, 78000],
    "years": [5, 3, 7, 2, 4],
    "remote": [True, False, True, True, False],
}

df = pd.DataFrame(data)
print(df)
print(f"\nShape: {df.shape}")  # (5, 5)
print(f"Columns: {list(df.columns)}")

Loading Data

From CSV

# Load a CSV file
df = pd.read_csv("sales_data.csv")

# Load with options
df = pd.read_csv(
    "sales_data.csv",
    parse_dates=["order_date"],     # Parse date columns automatically
    dtype={"product_id": str},     # Force column types
    na_values=["N/A", "unknown"],  # Treat these as NaN
)

print(df.head())        # First 5 rows
print(df.tail(3))       # Last 3 rows
print(df.shape)         # (rows, columns)
print(df.dtypes)        # Data type of each column
print(df.info())        # Summary with memory usage
print(df.describe())    # Statistical summary for numeric columns

From Excel

df = pd.read_excel("report.xlsx", sheet_name="Sales Q1")

From a SQL Database

import sqlite3

conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM orders WHERE year = 2026", conn)
conn.close()

Selecting and Filtering Data

# Select columns
names = df["name"]               # Single column → Series
subset = df[["name", "salary"]]  # Multiple columns → DataFrame

# Select rows by position
first_row = df.iloc[0]
first_three = df.iloc[:3]
last_two = df.iloc[-2:]

# Select rows by label (index)
row = df.loc[0]

# Filter rows with conditions
engineers = df[df["department"] == "Engineering"]
high_earners = df[df["salary"] > 80000]
senior_engineers = df[(df["department"] == "Engineering") & (df["years"] > 4)]
remote_or_senior = df[(df["remote"] == True) | (df["years"] > 5)]

# Query method — cleaner syntax
result = df.query("salary > 80000 and department == 'Engineering'")

Data Cleaning

Real-world data is messy. Cleaning it is 80% of the job.

# Check for missing values
print(df.isnull().sum())
print(f"Missing %: {df.isnull().mean() * 100}")

# Drop rows with any missing value
df_clean = df.dropna()

# Fill missing values
df["age"].fillna(df["age"].median(), inplace=True)         # Fill with median
df["category"].fillna("Unknown", inplace=True)             # Fill with constant
df["sales"].fillna(method="ffill", inplace=True)          # Forward fill

# Remove duplicates
df = df.drop_duplicates()
df = df.drop_duplicates(subset=["email"])  # Based on specific column

# Fix data types
df["order_date"] = pd.to_datetime(df["order_date"])
df["price"] = df["price"].str.replace("$", "").astype(float)
df["zip_code"] = df["zip_code"].astype(str).str.zfill(5)  # Pad to 5 digits

# Rename columns
df.rename(columns={"qty": "quantity", "amt": "amount"}, inplace=True)

# Strip whitespace from strings
df["name"] = df["name"].str.strip()
df["email"] = df["email"].str.lower().str.strip()

Transforming Data

# Add new columns
df["annual_bonus"] = df["salary"] * 0.10
df["total_comp"] = df["salary"] + df["annual_bonus"]
df["experience_level"] = df["years"].apply(
    lambda y: "Junior" if y < 3 else ("Mid" if y < 7 else "Senior")
)

# Apply a function to each row
def categorize_salary(row):
    if row["salary"] > 90000:
        return "High"
    elif row["salary"] > 70000:
        return "Medium"
    return "Low"

df["salary_band"] = df.apply(categorize_salary, axis=1)

# Map values
dept_codes = {"Engineering": "ENG", "Marketing": "MKT", "HR": "HR"}
df["dept_code"] = df["department"].map(dept_codes)

# String operations
df["first_name"] = df["name"].str.split(" ").str[0]
df["email_domain"] = df["email"].str.split("@").str[1]

Grouping and Aggregation

This is where Pandas truly shines — answering business questions instantly.

# Group by one column
dept_summary = df.groupby("department")["salary"].agg(["mean", "min", "max", "count"])
print(dept_summary)

# Multiple aggregations
summary = df.groupby("department").agg(
    avg_salary=("salary", "mean"),
    total_employees=("name", "count"),
    avg_years=("years", "mean"),
    max_salary=("salary", "max"),
)

# Group by multiple columns
pivot = df.groupby(["department", "remote"])["salary"].mean().unstack()

# Value counts — great for categorical analysis
print(df["department"].value_counts())
print(df["experience_level"].value_counts(normalize=True))  # Percentages

Merging DataFrames

# Join two DataFrames (like SQL JOIN)
orders = pd.DataFrame({
    "order_id": [1, 2, 3, 4],
    "customer_id": [101, 102, 101, 103],
    "amount": [250, 180, 320, 95]
})

customers = pd.DataFrame({
    "customer_id": [101, 102, 103],
    "name": ["Alice", "Bob", "Charlie"],
    "city": ["NYC", "LA", "Chicago"]
})

# Inner join (only matching rows)
merged = pd.merge(orders, customers, on="customer_id", how="inner")

# Left join (all orders, even without customer match)
merged_left = pd.merge(orders, customers, on="customer_id", how="left")

# Concatenate (stack DataFrames vertically)
q1 = pd.read_csv("q1_sales.csv")
q2 = pd.read_csv("q2_sales.csv")
full_year = pd.concat([q1, q2], ignore_index=True)

Working with Dates

df["order_date"] = pd.to_datetime(df["order_date"])

# Extract date components
df["year"] = df["order_date"].dt.year
df["month"] = df["order_date"].dt.month
df["day_of_week"] = df["order_date"].dt.day_name()
df["quarter"] = df["order_date"].dt.quarter

# Filter by date range
jan_orders = df[df["order_date"].dt.month == 1]
recent = df[df["order_date"] >= "2026-01-01"]

# Resample time series
monthly = df.set_index("order_date").resample("ME")["amount"].sum()
weekly = df.set_index("order_date").resample("W")["amount"].mean()

Visualization with Matplotlib

import matplotlib.pyplot as plt

# Bar chart
dept_salary = df.groupby("department")["salary"].mean()
dept_salary.plot(kind="bar", color="#4f46e5", figsize=(10, 6))
plt.title("Average Salary by Department")
plt.ylabel("Salary ($)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("salary_chart.png", dpi=150)
plt.show()

# Line chart for time series
monthly_sales = df.set_index("order_date").resample("ME")["amount"].sum()
monthly_sales.plot(kind="line", marker="o", figsize=(12, 5))
plt.title("Monthly Sales 2026")
plt.ylabel("Revenue ($)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Histogram
df["salary"].plot(kind="hist", bins=20, color="#0ea5e9", edgecolor="white")
plt.title("Salary Distribution")
plt.xlabel("Salary ($)")
plt.show()

Complete Mini-Project: Sales Analysis

import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv("ecommerce_sales.csv", parse_dates=["order_date"])

# Clean
df = df.dropna(subset=["amount", "category"])
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df = df[df["amount"] > 0]

# Analyze
print("=== Sales Summary ===")
print(f"Total Revenue: ${df['amount'].sum():,.2f}")
print(f"Total Orders: {len(df):,}")
print(f"Average Order: ${df['amount'].mean():.2f}")

print("\n=== Top Categories ===")
print(df.groupby("category")["amount"].sum().sort_values(ascending=False).head(5))

print("\n=== Monthly Trend ===")
monthly = df.set_index("order_date").resample("ME")["amount"].sum()
print(monthly)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
df.groupby("category")["amount"].sum().sort_values().plot(
    kind="barh", ax=axes[0], color="#4f46e5"
)
axes[0].set_title("Revenue by Category")

# Line chart
monthly.plot(kind="line", ax=axes[1], marker="o", color="#0ea5e9")
axes[1].set_title("Monthly Revenue")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("sales_analysis.png", dpi=150)
plt.show()

Where to Go Next

Once you can analyze data with Pandas, the natural next step is machine learning. Our Python for Machine Learning guide builds on everything here — you will use Pandas to prepare data for scikit-learn models.

For collecting your own data to analyze, check out the Python web scraping guide. And if you are still building your Python foundation, start with the Python beginners roadmap.


Pandas Cheat Sheet

TaskCode
Load CSVpd.read_csv("file.csv")
First rowsdf.head()
Shapedf.shape
Missing valuesdf.isnull().sum()
Filter rowsdf[df["col"] > 100]
Group + aggregatedf.groupby("col")["val"].mean()
Sortdf.sort_values("col", ascending=False)
Add columndf["new"] = df["a"] + df["b"]
Merge DataFramespd.merge(df1, df2, on="id")
Save to CSVdf.to_csv("output.csv", index=False)

Pandas takes practice. The best approach: find a dataset you genuinely care about, load it, and start asking questions. Every analysis you do teaches you something new.

Get Pandas cheat sheets and practice datasets in the AiTechWorlds Telegram channel — free!

Share this article:

Frequently Asked Questions

Basic math helps (mean, median, percentage) but you don't need advanced math. Pandas handles the calculations — you just need to know what to ask for.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!