Python Pandas Tutorial 2026 — Data Analysis for Beginners to Pro
Master Python Pandas for data analysis. Learn to load, clean, transform, and visualize data with hands-on examples. From CSV files to actionable insights — complete 2026 guide.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Python Pandas Tutorial 2026 — Turn Raw Data Into Real Insights
Every data analyst has a turning point. For most, it happens the first time they open a messy 50,000-row CSV in Excel, watch it grind slowly, and realize there has to be a better way.
That better way is Pandas.
Pandas is the Python library that turns raw, messy data into clean, actionable insights. It is the backbone of data science, used by data analysts at every company from startups to Fortune 500s. And it is surprisingly enjoyable to learn.
This tutorial teaches you Pandas with real datasets and real analysis techniques, from loading your first CSV to building a complete data analysis project.
Why Pandas?
Before writing code, here is what makes Pandas worth learning:
- Speed: Handles millions of rows without breaking a sweat
- Flexibility: Works with CSV, Excel, JSON, SQL databases, APIs
- Power: Filter, group, pivot, merge, reshape data with single lines
- Ecosystem: Connects seamlessly with NumPy, Matplotlib, scikit-learn, and Plotly
- Automation: Run the same analysis on new data every week automatically
Setup
pip install pandas numpy matplotlib openpyxl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print(pd.__version__) # Should be 2.x+
The Two Core Data Structures
Series — A Single Column
# A Series is like a labeled array
sales = pd.Series([1200, 1800, 950, 2300, 1650],
index=["Mon", "Tue", "Wed", "Thu", "Fri"])
print(sales)
print(f"Total: {sales.sum()}")
print(f"Average: {sales.mean():.0f}")
print(f"Best day: {sales.idxmax()} (${sales.max()})")
DataFrame — A Table
# A DataFrame is a table with labeled rows and columns
data = {
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"department": ["Engineering", "Marketing", "Engineering", "HR", "Marketing"],
"salary": [95000, 72000, 88000, 65000, 78000],
"years": [5, 3, 7, 2, 4],
"remote": [True, False, True, True, False],
}
df = pd.DataFrame(data)
print(df)
print(f"\nShape: {df.shape}") # (5, 5)
print(f"Columns: {list(df.columns)}")
Loading Data
From CSV
# Load a CSV file
df = pd.read_csv("sales_data.csv")
# Load with options
df = pd.read_csv(
"sales_data.csv",
parse_dates=["order_date"], # Parse date columns automatically
dtype={"product_id": str}, # Force column types
na_values=["N/A", "unknown"], # Treat these as NaN
)
print(df.head()) # First 5 rows
print(df.tail(3)) # Last 3 rows
print(df.shape) # (rows, columns)
print(df.dtypes) # Data type of each column
print(df.info()) # Summary with memory usage
print(df.describe()) # Statistical summary for numeric columns
From Excel
df = pd.read_excel("report.xlsx", sheet_name="Sales Q1")
From a SQL Database
import sqlite3
conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM orders WHERE year = 2026", conn)
conn.close()
Selecting and Filtering Data
# Select columns
names = df["name"] # Single column → Series
subset = df[["name", "salary"]] # Multiple columns → DataFrame
# Select rows by position
first_row = df.iloc[0]
first_three = df.iloc[:3]
last_two = df.iloc[-2:]
# Select rows by label (index)
row = df.loc[0]
# Filter rows with conditions
engineers = df[df["department"] == "Engineering"]
high_earners = df[df["salary"] > 80000]
senior_engineers = df[(df["department"] == "Engineering") & (df["years"] > 4)]
remote_or_senior = df[(df["remote"] == True) | (df["years"] > 5)]
# Query method — cleaner syntax
result = df.query("salary > 80000 and department == 'Engineering'")
Data Cleaning
Real-world data is messy. Cleaning it is 80% of the job.
# Check for missing values
print(df.isnull().sum())
print(f"Missing %: {df.isnull().mean() * 100}")
# Drop rows with any missing value
df_clean = df.dropna()
# Fill missing values
df["age"].fillna(df["age"].median(), inplace=True) # Fill with median
df["category"].fillna("Unknown", inplace=True) # Fill with constant
df["sales"].fillna(method="ffill", inplace=True) # Forward fill
# Remove duplicates
df = df.drop_duplicates()
df = df.drop_duplicates(subset=["email"]) # Based on specific column
# Fix data types
df["order_date"] = pd.to_datetime(df["order_date"])
df["price"] = df["price"].str.replace("$", "").astype(float)
df["zip_code"] = df["zip_code"].astype(str).str.zfill(5) # Pad to 5 digits
# Rename columns
df.rename(columns={"qty": "quantity", "amt": "amount"}, inplace=True)
# Strip whitespace from strings
df["name"] = df["name"].str.strip()
df["email"] = df["email"].str.lower().str.strip()
Transforming Data
# Add new columns
df["annual_bonus"] = df["salary"] * 0.10
df["total_comp"] = df["salary"] + df["annual_bonus"]
df["experience_level"] = df["years"].apply(
lambda y: "Junior" if y < 3 else ("Mid" if y < 7 else "Senior")
)
# Apply a function to each row
def categorize_salary(row):
if row["salary"] > 90000:
return "High"
elif row["salary"] > 70000:
return "Medium"
return "Low"
df["salary_band"] = df.apply(categorize_salary, axis=1)
# Map values
dept_codes = {"Engineering": "ENG", "Marketing": "MKT", "HR": "HR"}
df["dept_code"] = df["department"].map(dept_codes)
# String operations
df["first_name"] = df["name"].str.split(" ").str[0]
df["email_domain"] = df["email"].str.split("@").str[1]
Grouping and Aggregation
This is where Pandas truly shines — answering business questions instantly.
# Group by one column
dept_summary = df.groupby("department")["salary"].agg(["mean", "min", "max", "count"])
print(dept_summary)
# Multiple aggregations
summary = df.groupby("department").agg(
avg_salary=("salary", "mean"),
total_employees=("name", "count"),
avg_years=("years", "mean"),
max_salary=("salary", "max"),
)
# Group by multiple columns
pivot = df.groupby(["department", "remote"])["salary"].mean().unstack()
# Value counts — great for categorical analysis
print(df["department"].value_counts())
print(df["experience_level"].value_counts(normalize=True)) # Percentages
Merging DataFrames
# Join two DataFrames (like SQL JOIN)
orders = pd.DataFrame({
"order_id": [1, 2, 3, 4],
"customer_id": [101, 102, 101, 103],
"amount": [250, 180, 320, 95]
})
customers = pd.DataFrame({
"customer_id": [101, 102, 103],
"name": ["Alice", "Bob", "Charlie"],
"city": ["NYC", "LA", "Chicago"]
})
# Inner join (only matching rows)
merged = pd.merge(orders, customers, on="customer_id", how="inner")
# Left join (all orders, even without customer match)
merged_left = pd.merge(orders, customers, on="customer_id", how="left")
# Concatenate (stack DataFrames vertically)
q1 = pd.read_csv("q1_sales.csv")
q2 = pd.read_csv("q2_sales.csv")
full_year = pd.concat([q1, q2], ignore_index=True)
Working with Dates
df["order_date"] = pd.to_datetime(df["order_date"])
# Extract date components
df["year"] = df["order_date"].dt.year
df["month"] = df["order_date"].dt.month
df["day_of_week"] = df["order_date"].dt.day_name()
df["quarter"] = df["order_date"].dt.quarter
# Filter by date range
jan_orders = df[df["order_date"].dt.month == 1]
recent = df[df["order_date"] >= "2026-01-01"]
# Resample time series
monthly = df.set_index("order_date").resample("ME")["amount"].sum()
weekly = df.set_index("order_date").resample("W")["amount"].mean()
Visualization with Matplotlib
import matplotlib.pyplot as plt
# Bar chart
dept_salary = df.groupby("department")["salary"].mean()
dept_salary.plot(kind="bar", color="#4f46e5", figsize=(10, 6))
plt.title("Average Salary by Department")
plt.ylabel("Salary ($)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("salary_chart.png", dpi=150)
plt.show()
# Line chart for time series
monthly_sales = df.set_index("order_date").resample("ME")["amount"].sum()
monthly_sales.plot(kind="line", marker="o", figsize=(12, 5))
plt.title("Monthly Sales 2026")
plt.ylabel("Revenue ($)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Histogram
df["salary"].plot(kind="hist", bins=20, color="#0ea5e9", edgecolor="white")
plt.title("Salary Distribution")
plt.xlabel("Salary ($)")
plt.show()
Complete Mini-Project: Sales Analysis
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv("ecommerce_sales.csv", parse_dates=["order_date"])
# Clean
df = df.dropna(subset=["amount", "category"])
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df = df[df["amount"] > 0]
# Analyze
print("=== Sales Summary ===")
print(f"Total Revenue: ${df['amount'].sum():,.2f}")
print(f"Total Orders: {len(df):,}")
print(f"Average Order: ${df['amount'].mean():.2f}")
print("\n=== Top Categories ===")
print(df.groupby("category")["amount"].sum().sort_values(ascending=False).head(5))
print("\n=== Monthly Trend ===")
monthly = df.set_index("order_date").resample("ME")["amount"].sum()
print(monthly)
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Bar chart
df.groupby("category")["amount"].sum().sort_values().plot(
kind="barh", ax=axes[0], color="#4f46e5"
)
axes[0].set_title("Revenue by Category")
# Line chart
monthly.plot(kind="line", ax=axes[1], marker="o", color="#0ea5e9")
axes[1].set_title("Monthly Revenue")
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("sales_analysis.png", dpi=150)
plt.show()
Where to Go Next
Once you can analyze data with Pandas, the natural next step is machine learning. Our Python for Machine Learning guide builds on everything here — you will use Pandas to prepare data for scikit-learn models.
For collecting your own data to analyze, check out the Python web scraping guide. And if you are still building your Python foundation, start with the Python beginners roadmap.
Pandas Cheat Sheet
| Task | Code |
|---|---|
| Load CSV | pd.read_csv("file.csv") |
| First rows | df.head() |
| Shape | df.shape |
| Missing values | df.isnull().sum() |
| Filter rows | df[df["col"] > 100] |
| Group + aggregate | df.groupby("col")["val"].mean() |
| Sort | df.sort_values("col", ascending=False) |
| Add column | df["new"] = df["a"] + df["b"] |
| Merge DataFrames | pd.merge(df1, df2, on="id") |
| Save to CSV | df.to_csv("output.csv", index=False) |
Pandas takes practice. The best approach: find a dataset you genuinely care about, load it, and start asking questions. Every analysis you do teaches you something new.
Get Pandas cheat sheets and practice datasets in the AiTechWorlds Telegram channel — free!
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Python Async Programming Guide 2026 — asyncio, aiohttp & Concurrency
Master async programming in Python with asyncio. Learn concurrent programming, aiohttp for async HTTP, async database operations, and build high-performance Python applications.
Python OOP Complete Guide 2026 — Object-Oriented Programming Mastery
Master Python object-oriented programming from basics to advanced. Classes, inheritance, polymorphism, SOLID principles, dataclasses — everything you need to write professional Python.
Python Error Handling & Debugging 2026 — Write Bulletproof Code
Master Python error handling and debugging techniques. Learn try/except, custom exceptions, logging, pdb, and professional debugging strategies to write robust Python code.
Python Decorators and Generators — Advanced Python Made Simple 2026
Master Python decorators and generators — two of Python's most powerful features. Clear explanations, real-world examples, and practical patterns you'll actually use.