Scikit-Learn Tutorial: Build Your First ML Model in 30 Minutes
Scikit-learn tutorial for beginners — build your first machine learning model in 30 minutes with the complete workflow: data loading, preprocessing, training, evaluation, and tuning.
Get more content like this on Telegram!
Daily AI tips, notes & resources — free
Scikit-Learn Tutorial: Build Your First ML Model in 30 Minutes
The first time I built a machine learning model that actually worked on real data, the surprise wasn't the result — it was how simple the code was.
Training a Random Forest classifier, evaluating it with proper cross-validation, tuning its hyperparameters, and generating a full evaluation report is about 40 lines of scikit-learn code. The complexity isn't in the code — it's in understanding what each step does and making good decisions about data preparation.
This tutorial walks you through the complete scikit-learn workflow from data loading to model evaluation, using a real classification dataset. By the end, you'll have built, evaluated, and improved a working ML model — and you'll understand every line of the code.
Setup
pip install scikit-learn pandas numpy matplotlib seaborn
Required versions for this tutorial:
import sklearn
print(sklearn.__version__) # Should be 1.3+ for all examples to work
Dataset: Breast Cancer Classification
We'll use scikit-learn's built-in breast cancer dataset — a real medical dataset with 30 features computed from breast mass cell nuclei, predicting whether the mass is malignant or benign.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
# Load dataset
data = load_breast_cancer()
# Convert to DataFrame for easier exploration
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target # 0 = malignant, 1 = benign
print(f"Dataset shape: {df.shape}")
print(f"\nTarget distribution:")
print(df['target'].value_counts())
print(f"\nMalignant: {(df['target'] == 0).sum()} ({(df['target'] == 0).mean()*100:.1f}%)")
print(f"Benign: {(df['target'] == 1).sum()} ({(df['target'] == 1).mean()*100:.1f}%)")
Output:
Dataset shape: (569, 31)
Target distribution:
1 357
0 212
Malignant: 212 (37.3%)
Benign: 357 (62.7%)
Step 1: Exploratory Data Analysis
Before modeling, understand your data:
# Basic statistics
print(df.describe())
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum().sum()) # Should be 0 for this dataset
# Feature correlations with target
correlation_with_target = df.corr()['target'].sort_values()
print("\nTop 5 features most correlated with malignancy (negative = malignant):")
print(correlation_with_target.head(5))
Key findings from exploration:
- No missing values in this dataset
- Features like
worst radius,worst perimeter, andmean concave pointscorrelate most strongly with malignancy - Features vary widely in scale (radius ~ 10-28, area ~ 143-2501) — standardization will be important
# Visualize feature distributions by class
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
top_features = ['mean radius', 'mean texture', 'mean perimeter',
'mean area', 'mean smoothness', 'mean compactness']
for i, feature in enumerate(top_features):
ax = axes[i//3, i%3]
df[df['target'] == 0][feature].hist(alpha=0.7, label='Malignant', ax=ax, bins=20)
df[df['target'] == 1][feature].hist(alpha=0.7, label='Benign', ax=ax, bins=20)
ax.set_title(feature)
ax.legend()
plt.tight_layout()
plt.show()
Step 2: Data Preparation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']
# Split into train and test sets
# random_state ensures reproducibility
# stratify=y ensures same class distribution in train and test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% held out for final evaluation
random_state=42,
stratify=y # Maintain class balance
)
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"\nTrain class distribution: {y_train.value_counts(normalize=True).round(3).to_dict()}")
print(f"Test class distribution: {y_test.value_counts(normalize=True).round(3).to_dict()}")
# Scale features
# IMPORTANT: Fit ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform training data
X_test_scaled = scaler.transform(X_test) # Only transform test data (using train statistics)
Step 3: Training Your First Model
Let's start with Logistic Regression as our baseline:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Train the model
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred_lr = lr_model.predict(X_test_scaled)
# Evaluate
print("=== Logistic Regression Results ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr,
target_names=['Malignant', 'Benign']))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm, annot=True, fmt='d',
xticklabels=['Malignant', 'Benign'],
yticklabels=['Malignant', 'Benign'])
plt.title('Confusion Matrix — Logistic Regression')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
Output:
=== Logistic Regression Results ===
Accuracy: 0.9561
Classification Report:
precision recall f1-score support
Malignant 0.93 0.95 0.94 42
Benign 0.97 0.96 0.96 72
accuracy 0.96 114
macro avg 0.95 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
Understanding the Metrics
- Precision: Of all cases predicted as malignant, how many were actually malignant? (0.93 = 93%)
- Recall: Of all actual malignant cases, how many did we catch? (0.95 = 95%)
- F1-score: Harmonic mean of precision and recall — useful when both matter
- In medical contexts: Recall for malignant is critical — missing a cancer (false negative) is worse than a false alarm
Step 4: Trying Multiple Algorithms
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
# Define models to compare
models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'SVM': SVC(random_state=42),
'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
}
# Evaluate each model with 5-fold cross-validation
results = {}
for name, model in models.items():
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
results[name] = {
'mean': cv_scores.mean(),
'std': cv_scores.std(),
'scores': cv_scores
}
print(f"{name}: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
Output:
Logistic Regression: 0.9736 (+/- 0.0244)
Random Forest: 0.9604 (+/- 0.0228)
Gradient Boosting: 0.9670 (+/- 0.0196)
SVM: 0.9779 (+/- 0.0173)
K-Nearest Neighbors: 0.9604 (+/- 0.0357)
Note: Cross-validation scores on training data differ from the test set evaluation — this is expected and normal.
Step 5: Hyperparameter Tuning
Let's tune the Random Forest, which has several important hyperparameters:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
}
# Grid Search (exhaustive — tries all combinations)
# Note: Can be slow for large grids. Use RandomizedSearchCV for large grids.
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
rf, param_grid,
cv=5,
scoring='recall', # Optimize for recall of malignant class
n_jobs=-1, # Use all CPU cores
verbose=1
)
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV recall: {grid_search.best_score_:.4f}")
# Evaluate best model on test set
best_rf = grid_search.best_estimator_
y_pred_rf = best_rf.predict(X_test_scaled)
print("\n=== Tuned Random Forest Results ===")
print(classification_report(y_test, y_pred_rf, target_names=['Malignant', 'Benign']))
Step 6: Understanding Feature Importance
One advantage of tree-based models: interpretable feature importance.
# Feature importance from Random Forest
feature_importance = pd.Series(
best_rf.feature_importances_,
index=data.feature_names
).sort_values(ascending=False)
# Plot top 15 features
plt.figure(figsize=(10, 6))
feature_importance.head(15).plot(kind='barh')
plt.title('Top 15 Most Important Features')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()
print("\nTop 5 most important features:")
print(feature_importance.head(5))
Step 7: Building a Pipeline
In production and real projects, it's important to bundle preprocessing and modeling into a Pipeline. This prevents data leakage and makes deployment cleaner:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# Create a Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# The pipeline applies transformations and trains together
pipeline.fit(X_train, y_train) # Note: using raw X_train, not X_train_scaled
# Evaluate
y_pred_pipeline = pipeline.predict(X_test) # Raw X_test — scaling happens inside pipeline
print("Pipeline accuracy:", accuracy_score(y_test, y_pred_pipeline))
# Save the entire pipeline (preprocessing + model)
import joblib
joblib.dump(pipeline, 'breast_cancer_classifier.pkl')
# Load and use later
loaded_pipeline = joblib.load('breast_cancer_classifier.pkl')
new_prediction = loaded_pipeline.predict(X_test[:5]) # Predicts on raw features
Pipelines are essential for production ML. They ensure:
- Test data is preprocessed identically to training data
- Preprocessing and model are saved/deployed together
- No risk of forgetting to apply scaling when making predictions
Complete Workflow Summary
# Complete end-to-end workflow
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import joblib
# 1. Load data
data = load_breast_cancer()
X, y = data.data, data.target
# 2. Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 3. Build pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier(n_estimators=100, random_state=42))
])
# 4. Evaluate with cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")
# 5. Train on full training set
pipeline.fit(X_train, y_train)
# 6. Final evaluation on test set
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, target_names=data.target_names))
# 7. Save
joblib.dump(pipeline, 'model.pkl')
Next Steps
With this foundation, you're ready to apply the same workflow to any tabular dataset:
- Load your data and explore it
- Handle missing values and encode categoricals
- Split train/test with stratification
- Try multiple algorithms with cross-validation
- Tune the best performer
- Wrap in a Pipeline and evaluate on the held-out test set
For more advanced scikit-learn techniques including feature engineering and handling imbalanced data, see the scikit-learn official documentation. For the theoretical foundation behind these algorithms, see our neural networks explained guide.
Frequently Asked Questions
What is scikit-learn and what is it used for?
Scikit-learn is Python's standard library for traditional machine learning — classification, regression, clustering, and preprocessing. It provides a consistent API across dozens of algorithms. Used heavily in industry for structured/tabular data ML. The first library most practitioners learn and still heavily used for everything from fraud detection to demand forecasting.
What's the difference between fit(), transform(), and fit_transform()?
fit() computes parameters from training data. transform() applies them. fit_transform() does both. Critical rule: only call fit() on training data. Fitting on test data leaks test information into preprocessing, giving optimistically biased evaluation results.
How do I choose between different scikit-learn algorithms?
For classification: start with Logistic Regression (baseline), then Random Forest, then Gradient Boosting. For regression: Linear Regression, then Random Forest Regressor, then XGBoost. The algorithm doesn't matter as much as data quality and feature engineering. Try multiple algorithms with cross-validation to find the best for your data.
What is cross-validation and why should I use it?
Cross-validation evaluates model performance more reliably than a single train/test split by averaging over multiple splits (folds). 5-fold or 10-fold cross-validation is standard practice for model evaluation. Only use a final held-out test set for reporting the final model's performance.
How do I improve a poorly performing scikit-learn model?
In order: feature engineering (adding informative features), more data, try different algorithms, hyperparameter tuning (GridSearchCV/RandomizedSearchCV), handle class imbalance, and ensemble methods. Check learning curves to distinguish underfitting (both scores low) from overfitting (train score much higher than validation).
Frequently Asked Questions
AiTechWorlds Team
✓ Verified WriterThe AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.
Related Articles
Best Machine Learning Courses in 2025: Ranked After Taking Them All
The best machine learning courses in 2025 — ranked by a practitioner who completed them. Honest assessments of Coursera, Fast.ai, Kaggle, and 7 others with cost and time required.
Computer Vision Tutorial: Build an Image Classifier from Scratch
Computer vision tutorial for beginners — build a real image classifier using CNNs and PyTorch, understand how computers see images, and learn transfer learning for production results.
Feature Engineering Guide: Turn Raw Data into Powerful ML Inputs
Feature engineering guide for machine learning — practical techniques to create, transform, and select features that improve model accuracy, with Python code examples for every method.
Kaggle Competition Guide: How to Rank in the Top 10% Every Time
Kaggle competition guide — the systematic approach to finishing in the top 10%, from EDA and baseline models to ensembling and post-competition learning, used by Kaggle Masters.