Python + AI: How to Build Your First Machine Learning Model
A beginner's guide to building your first machine learning model with Python and scikit-learn: train a real model, evaluate it, and understand what you're doing.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Python + AI: How to Build Your First Machine Learning Model
The first machine learning model I built was a lie detector. Not a real one — a classifier trained on the Titanic dataset that predicted survival based on passenger features. It sounds silly in retrospect, but the moment model.score(X_test, y_test) returned 0.82 — 82% accuracy on data the model had never seen — something clicked.
This guide takes you from zero to a trained, evaluated machine learning model. We'll use scikit-learn and the Titanic dataset (public on Kaggle). By the end, you'll understand what training a model actually means, not just how to call the functions.
What Machine Learning Actually Is
Strip away the hype, and machine learning is pattern recognition at scale.
You show a model thousands of examples (houses with known prices, emails labeled spam or not, passengers with known survival outcomes). The model learns statistical patterns in those examples. Then it applies those patterns to new data it's never seen.
What it is not: The model doesn't understand anything. It doesn't know what a house is or why a passenger survived. It finds correlations in numbers. "Feature X tends to correlate with outcome Y" — that's the entirety of what a model learns.
This sounds deflating. In practice, pattern recognition at scale is extraordinarily useful.
Setup
pip install scikit-learn pandas numpy matplotlib seaborn
Download the Titanic dataset from Kaggle — you want train.csv.
For a complete data science environment setup, see our Python data science roadmap.
Step 1: Load and Explore the Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("train.csv")
print(df.shape) # (891, 12)
print(df.head())
print(df.info())
print(df.describe())
The columns we care about:
Survived— target variable (0 = died, 1 = survived)Pclass— passenger class (1, 2, 3)Sex— male/femaleAge— passenger ageSibSp— siblings/spouses aboardParch— parents/children aboardFare— ticket price
# Survival rate
print(df["Survived"].value_counts())
print(df["Survived"].mean()) # ~38% survived
# Survival by class
print(df.groupby("Pclass")["Survived"].mean())
# Class 1: 63%, Class 2: 47%, Class 3: 24%
# Survival by gender
print(df.groupby("Sex")["Survived"].mean())
# Female: 74%, Male: 19%
Even before machine learning, we can see that class and gender are strong predictors.
Step 2: Data Preprocessing
Machine learning algorithms need numbers — no missing values, no text strings.
# Check missing values
print(df.isnull().sum())
# Age: 177 missing, Cabin: 687 missing, Embarked: 2 missing
# Fill missing Age with median
df["Age"].fillna(df["Age"].median(), inplace=True)
# Fill missing Embarked with mode
df["Embarked"].fillna(df["Embarked"].mode()[0], inplace=True)
# Convert Sex to numbers (0/1)
df["Sex_encoded"] = (df["Sex"] == "female").astype(int)
# Convert Embarked to dummy variables
embarked_dummies = pd.get_dummies(df["Embarked"], prefix="Embarked")
df = pd.concat([df, embarked_dummies], axis=1)
# Select features
features = ["Pclass", "Sex_encoded", "Age", "SibSp", "Parch", "Fare",
"Embarked_C", "Embarked_Q", "Embarked_S"]
X = df[features]
y = df["Survived"]
print(f"Features shape: {X.shape}")
print(f"Any missing values: {X.isnull().any().any()}")
Step 3: Split the Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
Why split the data? If you train and test on the same data, you're testing whether the model memorized your data — not whether it learned anything generalizable. The test set simulates "new data the model has never seen."
Step 4: Train Your First Model
We'll start with Logistic Regression — a simple, interpretable model that's an excellent baseline.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Evaluate
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Training accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")
You'll likely see ~80% test accuracy. Not bad for a simple model with minimal feature engineering.
Step 5: Understand the Evaluation
Accuracy alone is misleading. What if 90% of passengers died? A model that always predicts "died" would achieve 90% accuracy without learning anything useful.
from sklearn.metrics import classification_report, confusion_matrix
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["Died", "Survived"]))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
xticklabels=["Pred Died", "Pred Survived"],
yticklabels=["Actual Died", "Actual Survived"])
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()
Understanding the classification report:
- Precision: Of the passengers predicted to survive, what fraction actually did?
- Recall: Of the passengers who actually survived, what fraction did we correctly predict?
- F1-score: Harmonic mean of precision and recall
Step 6: Try Different Algorithms
One of scikit-learn's best features: consistent API across all algorithms.
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
models = {
"Logistic Regression": LogisticRegression(max_iter=1000),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(random_state=42),
"SVM": SVC(random_state=42),
"KNN": KNeighborsClassifier(n_neighbors=5),
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
results[name] = model.score(X_test, y_test)
for name, score in sorted(results.items(), key=lambda x: x[1], reverse=True):
print(f"{name:25}: {score:.3f}")
You'll typically see Random Forest and Gradient Boosting outperform Logistic Regression on this dataset.
Step 7: Cross-Validation
Instead of one train/test split (which can be lucky or unlucky), cross-validation evaluates across multiple splits.
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(rf, X, y, cv=5, scoring="accuracy")
print(f"CV scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")
The +/- tells you how variable the model's performance is. Low variance means the model is consistently good. High variance might indicate overfitting.
Step 8: Feature Importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importance = pd.Series(rf.feature_importances_, index=features)
importance.sort_values().plot(kind="barh")
plt.title("Feature Importance")
plt.tight_layout()
plt.show()
This tells you which features the model finds most informative. For the Titanic dataset, you'll typically see Sex, Fare, and Age at the top.
Making Predictions on New Data
# Predict for a new passenger
new_passenger = pd.DataFrame({
"Pclass": [1],
"Sex_encoded": [1], # female
"Age": [25],
"SibSp": [0],
"Parch": [0],
"Fare": [100],
"Embarked_C": [1],
"Embarked_Q": [0],
"Embarked_S": [0],
})
prediction = rf.predict(new_passenger)[0]
probability = rf.predict_proba(new_passenger)[0]
print(f"Prediction: {'Survived' if prediction == 1 else 'Died'}")
print(f"Survival probability: {probability[1]:.1%}")
Frequently Asked Questions
What library is used for ML in Python?
scikit-learn for traditional ML. PyTorch and TensorFlow for deep learning.
How much Python do I need first?
Python fundamentals + basic pandas. You don't need advanced Python.
What is overfitting?
When a model learns training data too well and performs worse on new data. Evaluate on a held-out test set to detect it.
Supervised vs unsupervised learning?
Supervised: labeled data, predict labels. Unsupervised: unlabeled data, find patterns.
Final Thoughts
You've now trained, evaluated, and compared machine learning models. The workflow — load data, clean, split, train, evaluate — is the same for almost every ML project, regardless of complexity.
The real skill in machine learning isn't knowing which algorithm to call. It's understanding your data well enough to build useful features (feature engineering), choosing the right evaluation metric for your problem, and interpreting model outputs critically.
For the data manipulation skills that make feature engineering possible, see our Python data science roadmap. For building an API that serves your trained model, our FastAPI tutorial covers wrapping models in web endpoints. And for the Python libraries this entire stack depends on, our best Python libraries guide covers scikit-learn and the full data science toolkit.
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
The Python Libraries Every Developer Must Know in 2025
The essential Python libraries for 2025: from requests and pandas to FastAPI and LangChain — what each does, when to use it, and how to get started quickly.
Django vs Flask in 2025: Which Framework Should You Learn?
An honest Django vs Flask comparison for 2025 — which Python framework to learn first, when each excels, and why FastAPI has changed the equation.
FastAPI Tutorial: Building Your First REST API in 30 Minutes
A hands-on FastAPI tutorial for beginners: build a fully functional REST API in 30 minutes with CRUD endpoints, request validation, and automatic docs.
Jupyter Notebook Guide: The Data Scientist's Favorite Tool
A complete Jupyter Notebook guide for 2025: installation, essential shortcuts, best practices, and how data scientists use Jupyter for exploration, analysis, and sharing.