Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →

Scikit-Learn Tutorial: Build Your First ML Model in 30 Minutes

Scikit-learn tutorial for beginners — build your first machine learning model in 30 minutes with the complete workflow: data loading, preprocessing, training, evaluation, and tuning.

A
AiTechWorlds Team
May 27, 2026 9 min read
📱

Get more content like this on Telegram!

Daily AI tips, notes & resources — free

Join Free →

Scikit-Learn Tutorial: Build Your First ML Model in 30 Minutes

The first time I built a machine learning model that actually worked on real data, the surprise wasn't the result — it was how simple the code was.

Training a Random Forest classifier, evaluating it with proper cross-validation, tuning its hyperparameters, and generating a full evaluation report is about 40 lines of scikit-learn code. The complexity isn't in the code — it's in understanding what each step does and making good decisions about data preparation.

This tutorial walks you through the complete scikit-learn workflow from data loading to model evaluation, using a real classification dataset. By the end, you'll have built, evaluated, and improved a working ML model — and you'll understand every line of the code.


Setup

pip install scikit-learn pandas numpy matplotlib seaborn

Required versions for this tutorial:

import sklearn
print(sklearn.__version__)  # Should be 1.3+ for all examples to work

Dataset: Breast Cancer Classification

We'll use scikit-learn's built-in breast cancer dataset — a real medical dataset with 30 features computed from breast mass cell nuclei, predicting whether the mass is malignant or benign.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()

# Convert to DataFrame for easier exploration
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # 0 = malignant, 1 = benign

print(f"Dataset shape: {df.shape}")
print(f"\nTarget distribution:")
print(df['target'].value_counts())
print(f"\nMalignant: {(df['target'] == 0).sum()} ({(df['target'] == 0).mean()*100:.1f}%)")
print(f"Benign: {(df['target'] == 1).sum()} ({(df['target'] == 1).mean()*100:.1f}%)")

Output:

Dataset shape: (569, 31)

Target distribution:
1    357
0    212

Malignant: 212 (37.3%)
Benign: 357 (62.7%)

Step 1: Exploratory Data Analysis

Before modeling, understand your data:

# Basic statistics
print(df.describe())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum().sum())  # Should be 0 for this dataset

# Feature correlations with target
correlation_with_target = df.corr()['target'].sort_values()
print("\nTop 5 features most correlated with malignancy (negative = malignant):")
print(correlation_with_target.head(5))

Key findings from exploration:

  • No missing values in this dataset
  • Features like worst radius, worst perimeter, and mean concave points correlate most strongly with malignancy
  • Features vary widely in scale (radius ~ 10-28, area ~ 143-2501) — standardization will be important
# Visualize feature distributions by class
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
top_features = ['mean radius', 'mean texture', 'mean perimeter', 
                'mean area', 'mean smoothness', 'mean compactness']

for i, feature in enumerate(top_features):
    ax = axes[i//3, i%3]
    df[df['target'] == 0][feature].hist(alpha=0.7, label='Malignant', ax=ax, bins=20)
    df[df['target'] == 1][feature].hist(alpha=0.7, label='Benign', ax=ax, bins=20)
    ax.set_title(feature)
    ax.legend()

plt.tight_layout()
plt.show()

Step 2: Data Preparation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train and test sets
# random_state ensures reproducibility
# stratify=y ensures same class distribution in train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,     # 20% held out for final evaluation
    random_state=42,
    stratify=y         # Maintain class balance
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"\nTrain class distribution: {y_train.value_counts(normalize=True).round(3).to_dict()}")
print(f"Test class distribution:  {y_test.value_counts(normalize=True).round(3).to_dict()}")

# Scale features
# IMPORTANT: Fit ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit and transform training data
X_test_scaled = scaler.transform(X_test)        # Only transform test data (using train statistics)

Step 3: Training Your First Model

Let's start with Logistic Regression as our baseline:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Train the model
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test_scaled)

# Evaluate
print("=== Logistic Regression Results ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, 
                            target_names=['Malignant', 'Benign']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm, annot=True, fmt='d', 
            xticklabels=['Malignant', 'Benign'],
            yticklabels=['Malignant', 'Benign'])
plt.title('Confusion Matrix — Logistic Regression')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

Output:

=== Logistic Regression Results ===
Accuracy: 0.9561

Classification Report:
              precision    recall  f1-score   support

   Malignant       0.93      0.95      0.94        42
      Benign       0.97      0.96      0.96        72

    accuracy                           0.96       114
   macro avg       0.95      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

Understanding the Metrics

  • Precision: Of all cases predicted as malignant, how many were actually malignant? (0.93 = 93%)
  • Recall: Of all actual malignant cases, how many did we catch? (0.95 = 95%)
  • F1-score: Harmonic mean of precision and recall — useful when both matter
  • In medical contexts: Recall for malignant is critical — missing a cancer (false negative) is worse than a false alarm

Step 4: Trying Multiple Algorithms

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
}

# Evaluate each model with 5-fold cross-validation
results = {}
for name, model in models.items():
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    results[name] = {
        'mean': cv_scores.mean(),
        'std': cv_scores.std(),
        'scores': cv_scores
    }
    print(f"{name}: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

Output:

Logistic Regression: 0.9736 (+/- 0.0244)
Random Forest:       0.9604 (+/- 0.0228)
Gradient Boosting:   0.9670 (+/- 0.0196)
SVM:                 0.9779 (+/- 0.0173)
K-Nearest Neighbors: 0.9604 (+/- 0.0357)

Note: Cross-validation scores on training data differ from the test set evaluation — this is expected and normal.


Step 5: Hyperparameter Tuning

Let's tune the Random Forest, which has several important hyperparameters:

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

# Grid Search (exhaustive — tries all combinations)
# Note: Can be slow for large grids. Use RandomizedSearchCV for large grids.
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    rf, param_grid,
    cv=5,
    scoring='recall',   # Optimize for recall of malignant class
    n_jobs=-1,          # Use all CPU cores
    verbose=1
)
grid_search.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV recall: {grid_search.best_score_:.4f}")

# Evaluate best model on test set
best_rf = grid_search.best_estimator_
y_pred_rf = best_rf.predict(X_test_scaled)
print("\n=== Tuned Random Forest Results ===")
print(classification_report(y_test, y_pred_rf, target_names=['Malignant', 'Benign']))

Step 6: Understanding Feature Importance

One advantage of tree-based models: interpretable feature importance.

# Feature importance from Random Forest
feature_importance = pd.Series(
    best_rf.feature_importances_,
    index=data.feature_names
).sort_values(ascending=False)

# Plot top 15 features
plt.figure(figsize=(10, 6))
feature_importance.head(15).plot(kind='barh')
plt.title('Top 15 Most Important Features')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()

print("\nTop 5 most important features:")
print(feature_importance.head(5))

Step 7: Building a Pipeline

In production and real projects, it's important to bundle preprocessing and modeling into a Pipeline. This prevents data leakage and makes deployment cleaner:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Create a Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# The pipeline applies transformations and trains together
pipeline.fit(X_train, y_train)  # Note: using raw X_train, not X_train_scaled

# Evaluate
y_pred_pipeline = pipeline.predict(X_test)  # Raw X_test — scaling happens inside pipeline
print("Pipeline accuracy:", accuracy_score(y_test, y_pred_pipeline))

# Save the entire pipeline (preprocessing + model)
import joblib
joblib.dump(pipeline, 'breast_cancer_classifier.pkl')

# Load and use later
loaded_pipeline = joblib.load('breast_cancer_classifier.pkl')
new_prediction = loaded_pipeline.predict(X_test[:5])  # Predicts on raw features

Pipelines are essential for production ML. They ensure:

  • Test data is preprocessed identically to training data
  • Preprocessing and model are saved/deployed together
  • No risk of forgetting to apply scaling when making predictions

Complete Workflow Summary

# Complete end-to-end workflow
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import joblib

# 1. Load data
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])

# 4. Evaluate with cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")

# 5. Train on full training set
pipeline.fit(X_train, y_train)

# 6. Final evaluation on test set
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, target_names=data.target_names))

# 7. Save
joblib.dump(pipeline, 'model.pkl')

Next Steps

With this foundation, you're ready to apply the same workflow to any tabular dataset:

  1. Load your data and explore it
  2. Handle missing values and encode categoricals
  3. Split train/test with stratification
  4. Try multiple algorithms with cross-validation
  5. Tune the best performer
  6. Wrap in a Pipeline and evaluate on the held-out test set

For more advanced scikit-learn techniques including feature engineering and handling imbalanced data, see the scikit-learn official documentation. For the theoretical foundation behind these algorithms, see our neural networks explained guide.


Frequently Asked Questions

What is scikit-learn and what is it used for?

Scikit-learn is Python's standard library for traditional machine learning — classification, regression, clustering, and preprocessing. It provides a consistent API across dozens of algorithms. Used heavily in industry for structured/tabular data ML. The first library most practitioners learn and still heavily used for everything from fraud detection to demand forecasting.

What's the difference between fit(), transform(), and fit_transform()?

fit() computes parameters from training data. transform() applies them. fit_transform() does both. Critical rule: only call fit() on training data. Fitting on test data leaks test information into preprocessing, giving optimistically biased evaluation results.

How do I choose between different scikit-learn algorithms?

For classification: start with Logistic Regression (baseline), then Random Forest, then Gradient Boosting. For regression: Linear Regression, then Random Forest Regressor, then XGBoost. The algorithm doesn't matter as much as data quality and feature engineering. Try multiple algorithms with cross-validation to find the best for your data.

What is cross-validation and why should I use it?

Cross-validation evaluates model performance more reliably than a single train/test split by averaging over multiple splits (folds). 5-fold or 10-fold cross-validation is standard practice for model evaluation. Only use a final held-out test set for reporting the final model's performance.

How do I improve a poorly performing scikit-learn model?

In order: feature engineering (adding informative features), more data, try different algorithms, hyperparameter tuning (GridSearchCV/RandomizedSearchCV), handle class imbalance, and ensemble methods. Check learning curves to distinguish underfitting (both scores low) from overfitting (train score much higher than validation).

Share this article:

Frequently Asked Questions

Scikit-learn (sklearn) is the most widely used Python library for traditional machine learning. It provides implementations of dozens of algorithms — classification, regression, clustering, and dimensionality reduction — all with a consistent API. It also includes tools for preprocessing data, evaluating model performance, and hyperparameter tuning. Scikit-learn is used for building ML models when you have structured/tabular data and don't need deep learning. It's the first ML library most practitioners learn and remains heavily used in industry for everything from fraud detection to demand forecasting.
A

AiTechWorlds Team

✓ Verified Writer

The AiTechWorlds team is passionate about AI, technology, and education. We create high-quality, research-backed content to help you learn, grow, and succeed in the modern digital world.

Related Articles

10K+ Members Growing Daily

Get Free AI Notes Daily

Join AiTechWorlds on Telegram and get daily AI tips, prompt engineering templates, coding resources, and exclusive content — 100% free!

📚 Free Study Notes🤖 AI Tips Daily⚡ Prompt Templates💻 Coding Resources
Join Free Channel

No spam. Leave anytime.

!