NumPy & Pandas for ML | Machine Learning Fundamentals | AiTechWorlds

NumPy & Pandas for Machine Learning

Every machine learning project starts with data — and in Python, that means NumPy and Pandas. These two libraries are the foundation of the entire ML ecosystem. Scikit-learn, TensorFlow, and PyTorch all expect data in NumPy or Pandas format.

This lesson teaches you exactly what you need to know — no more, no less.

NumPy: Fast Arrays for ML

NumPy's core is the ndarray — a fast, memory-efficient multi-dimensional array.

import numpy as np

# Create arrays
arr1d = np.array([1, 2, 3, 4, 5])
arr2d = np.array([[1, 2, 3], [4, 5, 6]])

print(arr2d.shape)    # (2, 3) — 2 rows, 3 columns
print(arr2d.dtype)    # int64
print(arr2d.ndim)     # 2 — two-dimensional

Why NumPy Arrays Beat Python Lists

import time

# Python list — slow
py_list = list(range(1_000_000))
start = time.time()
result = [x * 2 for x in py_list]
print(f"List: {time.time() - start:.3f}s")  # ~0.08s

# NumPy array — fast (vectorized C operations)
np_arr = np.arange(1_000_000)
start = time.time()
result = np_arr * 2
print(f"NumPy: {time.time() - start:.3f}s")  # ~0.001s

NumPy is 50-100x faster because it uses vectorized C operations instead of Python loops.

Essential NumPy Operations for ML

data = np.array([2.1, 3.5, 1.8, 4.2, 3.9, 2.7, 5.1])

# Statistics
print(np.mean(data))    # 3.328
print(np.std(data))     # 1.019
print(np.min(data))     # 1.8
print(np.max(data))     # 5.1
print(np.median(data))  # 3.5

# Reshaping — critical for ML model inputs
X = np.arange(12)
X_reshaped = X.reshape(4, 3)  # 4 samples, 3 features
print(X_reshaped.shape)        # (4, 3)

# Transpose — rows become columns
print(X_reshaped.T.shape)      # (3, 4)

# Array slicing
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr[0, :])     # First row: [1, 2, 3]
print(arr[:, 1])     # Second column: [2, 5, 8]
print(arr[1:, 1:])   # Bottom-right 2x2

Creating Common Arrays

np.zeros((3, 4))          # All zeros, shape (3, 4)
np.ones((2, 5))           # All ones
np.eye(4)                 # 4x4 identity matrix
np.random.seed(42)
np.random.randn(100, 5)   # 100 samples, 5 features, normal distribution
np.linspace(0, 1, 100)    # 100 evenly spaced values from 0 to 1

Pandas: DataFrames for Real-World Data

Real ML data comes in CSV files, databases, and APIs — messy, mixed types, with missing values. Pandas handles all of this.

import pandas as pd

# Load a dataset
df = pd.read_csv('housing.csv')

# First look
print(df.shape)        # (1000, 8) — 1000 rows, 8 columns
print(df.head())       # First 5 rows
print(df.tail(3))      # Last 3 rows
print(df.info())       # Column names, types, non-null counts
print(df.describe())   # Statistical summary for numeric columns

Selecting Data

# Single column → Series
prices = df['price']

# Multiple columns → DataFrame
subset = df[['size', 'bedrooms', 'price']]

# Filter rows
expensive = df[df['price'] > 500000]
large = df[(df['size'] > 2000) & (df['bedrooms'] >= 3)]

# Select by position
df.iloc[0]           # First row
df.iloc[0:5, 0:3]   # First 5 rows, first 3 columns

# Select by label
df.loc[df['city'] == 'NYC', ['size', 'price']]

Handling Missing Values

# Check missing
print(df.isnull().sum())        # Count nulls per column
print(df.isnull().sum() / len(df) * 100)  # As percentages

# Drop rows with any missing value
df_clean = df.dropna()

# Fill missing values
df['bedrooms'].fillna(df['bedrooms'].median(), inplace=True)
df['garage'].fillna('None', inplace=True)

# Drop columns with too many missing values (>40%)
threshold = len(df) * 0.6
df_clean = df.dropna(thresh=threshold, axis=1)

Feature Engineering with Pandas

# Create new features
df['price_per_sqft'] = df['price'] / df['size']
df['age'] = 2026 - df['year_built']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']

# Encode categorical variables
df['has_garage'] = (df['garage'] != 'None').astype(int)
df['city_encoded'] = pd.factorize(df['city'])[0]

# One-hot encoding
df = pd.get_dummies(df, columns=['city'], drop_first=True)

From Pandas to NumPy (The Bridge to Scikit-Learn)

from sklearn.model_selection import train_test_split

# Select features and target
features = ['size', 'bedrooms', 'bathrooms', 'age']
X = df[features].values   # .values converts to NumPy array
y = df['price'].values

print(type(X))   # <class 'numpy.ndarray'>
print(X.shape)   # (1000, 4)

# Split for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(X_train.shape)  # (800, 4)
print(X_test.shape)   # (200, 4)

Mini Project: Real Data Pipeline

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Load data
df = pd.read_csv('housing.csv')

# Clean
df.dropna(inplace=True)
df['age'] = 2026 - df['year_built']

# Prepare features
X = df[['size', 'bedrooms', 'age']].values
y = df['price'].values

# Split
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
print(f"R² Score: {r2_score(y_test, model.predict(X_test)):.3f}")

Key Takeaways

NumPy gives you fast vectorized operations — always prefer array operations over loops
Pandas handles real-world messy data — missing values, mixed types, filtering
The pipeline is always: CSV → DataFrame (Pandas) → NumPy arrays → ML model
Master .shape, .dtype, .describe(), .isnull(), .fillna(), and indexing — you'll use them in every project

Next lesson: Exploratory Data Analysis — understanding your data before modeling.