NumPy & Pandas for ML
NumPy & Pandas for Machine Learning
Every machine learning project starts with data — and in Python, that means NumPy and Pandas. These two libraries are the foundation of the entire ML ecosystem. Scikit-learn, TensorFlow, and PyTorch all expect data in NumPy or Pandas format.
This lesson teaches you exactly what you need to know — no more, no less.
NumPy: Fast Arrays for ML
NumPy's core is the ndarray — a fast, memory-efficient multi-dimensional array.
import numpy as np
# Create arrays
arr1d = np.array([1, 2, 3, 4, 5])
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2d.shape) # (2, 3) — 2 rows, 3 columns
print(arr2d.dtype) # int64
print(arr2d.ndim) # 2 — two-dimensional
Why NumPy Arrays Beat Python Lists
import time
# Python list — slow
py_list = list(range(1_000_000))
start = time.time()
result = [x * 2 for x in py_list]
print(f"List: {time.time() - start:.3f}s") # ~0.08s
# NumPy array — fast (vectorized C operations)
np_arr = np.arange(1_000_000)
start = time.time()
result = np_arr * 2
print(f"NumPy: {time.time() - start:.3f}s") # ~0.001s
NumPy is 50-100x faster because it uses vectorized C operations instead of Python loops.
Essential NumPy Operations for ML
data = np.array([2.1, 3.5, 1.8, 4.2, 3.9, 2.7, 5.1])
# Statistics
print(np.mean(data)) # 3.328
print(np.std(data)) # 1.019
print(np.min(data)) # 1.8
print(np.max(data)) # 5.1
print(np.median(data)) # 3.5
# Reshaping — critical for ML model inputs
X = np.arange(12)
X_reshaped = X.reshape(4, 3) # 4 samples, 3 features
print(X_reshaped.shape) # (4, 3)
# Transpose — rows become columns
print(X_reshaped.T.shape) # (3, 4)
# Array slicing
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr[0, :]) # First row: [1, 2, 3]
print(arr[:, 1]) # Second column: [2, 5, 8]
print(arr[1:, 1:]) # Bottom-right 2x2
Creating Common Arrays
np.zeros((3, 4)) # All zeros, shape (3, 4)
np.ones((2, 5)) # All ones
np.eye(4) # 4x4 identity matrix
np.random.seed(42)
np.random.randn(100, 5) # 100 samples, 5 features, normal distribution
np.linspace(0, 1, 100) # 100 evenly spaced values from 0 to 1
Pandas: DataFrames for Real-World Data
Real ML data comes in CSV files, databases, and APIs — messy, mixed types, with missing values. Pandas handles all of this.
import pandas as pd
# Load a dataset
df = pd.read_csv('housing.csv')
# First look
print(df.shape) # (1000, 8) — 1000 rows, 8 columns
print(df.head()) # First 5 rows
print(df.tail(3)) # Last 3 rows
print(df.info()) # Column names, types, non-null counts
print(df.describe()) # Statistical summary for numeric columns
Selecting Data
# Single column → Series
prices = df['price']
# Multiple columns → DataFrame
subset = df[['size', 'bedrooms', 'price']]
# Filter rows
expensive = df[df['price'] > 500000]
large = df[(df['size'] > 2000) & (df['bedrooms'] >= 3)]
# Select by position
df.iloc[0] # First row
df.iloc[0:5, 0:3] # First 5 rows, first 3 columns
# Select by label
df.loc[df['city'] == 'NYC', ['size', 'price']]
Handling Missing Values
# Check missing
print(df.isnull().sum()) # Count nulls per column
print(df.isnull().sum() / len(df) * 100) # As percentages
# Drop rows with any missing value
df_clean = df.dropna()
# Fill missing values
df['bedrooms'].fillna(df['bedrooms'].median(), inplace=True)
df['garage'].fillna('None', inplace=True)
# Drop columns with too many missing values (>40%)
threshold = len(df) * 0.6
df_clean = df.dropna(thresh=threshold, axis=1)
Feature Engineering with Pandas
# Create new features
df['price_per_sqft'] = df['price'] / df['size']
df['age'] = 2026 - df['year_built']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
# Encode categorical variables
df['has_garage'] = (df['garage'] != 'None').astype(int)
df['city_encoded'] = pd.factorize(df['city'])[0]
# One-hot encoding
df = pd.get_dummies(df, columns=['city'], drop_first=True)
From Pandas to NumPy (The Bridge to Scikit-Learn)
from sklearn.model_selection import train_test_split
# Select features and target
features = ['size', 'bedrooms', 'bathrooms', 'age']
X = df[features].values # .values converts to NumPy array
y = df['price'].values
print(type(X)) # <class 'numpy.ndarray'>
print(X.shape) # (1000, 4)
# Split for training and testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(X_train.shape) # (800, 4)
print(X_test.shape) # (200, 4)
Mini Project: Real Data Pipeline
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Load data
df = pd.read_csv('housing.csv')
# Clean
df.dropna(inplace=True)
df['age'] = 2026 - df['year_built']
# Prepare features
X = df[['size', 'bedrooms', 'age']].values
y = df['price'].values
# Split
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Train
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
print(f"R² Score: {r2_score(y_test, model.predict(X_test)):.3f}")
Key Takeaways
- NumPy gives you fast vectorized operations — always prefer array operations over loops
- Pandas handles real-world messy data — missing values, mixed types, filtering
- The pipeline is always: CSV → DataFrame (Pandas) → NumPy arrays → ML model
- Master
.shape,.dtype,.describe(),.isnull(),.fillna(), and indexing — you'll use them in every project
Next lesson: Exploratory Data Analysis — understanding your data before modeling.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises