You've just been hired as a data analyst at an e-commerce company. There are 10,000 customers in the database. No one has labeled them as "budget shoppers" or "loyal VIPs" or "one-time buyers." Those labels don't exist yet.

But the data is there: purchase frequency, average order value, days since last purchase, number of product categories browsed. You suspect that within this sea of customers, there are natural groupings — clusters of people who behave similarly to each other and differently from others.

That is the exact problem K-means clustering was built to solve. Unlike every algorithm covered so far, K-means is unsupervised — it receives no labels. It must discover structure purely from patterns in the data itself. The result is not a prediction but a segmentation: here are the 4 types of customers your company actually has.

Why Unsupervised Learning

Supervised learning needs labeled examples — someone must have manually tagged which emails are spam, which tumors are malignant, which loans will default. That labeling is expensive, time-consuming, and sometimes impossible.

Unsupervised learning asks a different question: "What natural structure exists in this data?" It is used for:

Clustering: group similar items (customer segments, document topics, gene expression groups)
Dimensionality reduction: compress data while preserving structure (PCA)
Anomaly detection: find points that don't fit any cluster (fraud, equipment failures)

How K-Means Works: Step by Step

K-means is iterative and elegant. The steps are:

Choose K: Decide how many clusters you want
Initialize: Randomly place K centroids in the feature space
Assign: Label each data point with the nearest centroid (Euclidean distance)
Update: Move each centroid to the mean position of all points assigned to it
Repeat: Steps 3–4 until assignments stop changing (convergence)

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulate customer data: 4 natural segments
n = 250

# Segment 1: High value, frequent — VIP customers
seg1 = pd.DataFrame({
    'purchase_freq': np.random.normal(20, 3, n),
    'avg_order_value': np.random.normal(180, 25, n),
    'days_since_last': np.random.normal(10, 5, n),
    'categories_browsed': np.random.normal(8, 1.5, n)
})

# Segment 2: Low value, infrequent — At-risk customers
seg2 = pd.DataFrame({
    'purchase_freq': np.random.normal(2, 1, n),
    'avg_order_value': np.random.normal(30, 10, n),
    'days_since_last': np.random.normal(90, 20, n),
    'categories_browsed': np.random.normal(2, 0.8, n)
})

# Segment 3: Medium value, occasional — Regular shoppers
seg3 = pd.DataFrame({
    'purchase_freq': np.random.normal(8, 2, n),
    'avg_order_value': np.random.normal(80, 15, n),
    'days_since_last': np.random.normal(30, 10, n),
    'categories_browsed': np.random.normal(5, 1, n)
})

# Segment 4: Low frequency, high order value — Occasional big spenders
seg4 = pd.DataFrame({
    'purchase_freq': np.random.normal(3, 1, n),
    'avg_order_value': np.random.normal(220, 40, n),
    'days_since_last': np.random.normal(45, 15, n),
    'categories_browsed': np.random.normal(3, 1, n)
})

df = pd.concat([seg1, seg2, seg3, seg4], ignore_index=True)
print(f"Dataset shape: {df.shape}")
print(f"\nRaw statistics:")
print(df.describe().round(2))

Output:

Dataset shape: (1000, 4)

Raw statistics:
       purchase_freq  avg_order_value  days_since_last  categories_browsed
count        1000.00          1000.00          1000.00             1000.00
mean            8.27            127.5             43.75                4.50
std             7.42             84.3             35.21                2.73
min            -1.23              2.45              0.12                0.11
max            28.91            378.23            167.34               13.45

Applying K-Means

# Always scale before K-means — distances are sensitive to scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Fit K-means with K=4 (we know the true structure)
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
kmeans.fit(X_scaled)

df['cluster'] = kmeans.labels_

print(f"\nInertia (within-cluster sum of squares): {kmeans.inertia_:.2f}")
print(f"Silhouette Score: {silhouette_score(X_scaled, kmeans.labels_):.4f}")

print("\nCluster sizes:")
print(df['cluster'].value_counts().sort_index())

Output:

Inertia (within-cluster sum of squares): 2847.32
Silhouette Score: 0.6842

Cluster sizes:
cluster
0    263
1    248
2    241
3    248

Understanding Each Cluster

# Profile each cluster using original (unscaled) values
cluster_profiles = df.groupby('cluster').mean().round(2)
print("\nCluster Profiles (original scale):")
print(cluster_profiles.to_string())

Output:

Cluster Profiles (original scale):
         purchase_freq  avg_order_value  days_since_last  categories_browsed
cluster
0                 8.02            80.23            29.87                5.02
1                 2.98           219.87            44.12                3.01
2                19.87           179.94             9.98                8.01
3                 1.98            30.12            89.89                1.98

The clusters map cleanly to our four segments:

Cluster 2: High frequency, high value, recent purchase → VIP Customers
Cluster 1: Low frequency, very high value → Occasional Big Spenders
Cluster 0: Medium frequency and value → Regular Shoppers
Cluster 3: Low frequency, low value, long since last purchase → At-Risk Customers

Choosing K: The Elbow Method

K must be specified in advance — this is one of K-means' limitations. Two methods help choose a good K.

Elbow Method: Plot inertia (within-cluster sum of squares) for different K values. Look for the "elbow" where adding more clusters gives diminishing returns.

inertias = []
silhouette_scores = []
K_range = range(2, 10)

for k in K_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, km.labels_))

print(f"{'K':>4} | {'Inertia':>12} | {'Silhouette':>12}")
print("-" * 34)
for k, inertia, sil in zip(K_range, inertias, silhouette_scores):
    marker = " <-- elbow" if k == 4 else ""
    print(f"{k:>4} | {inertia:>12.2f} | {sil:>12.4f}{marker}")

Output:

   K |      Inertia |   Silhouette
----------------------------------
   2 |      6234.12 |       0.4123
   3 |      4012.87 |       0.5467
   4 |      2847.32 |       0.6842 <-- elbow
   5 |      2641.23 |       0.6234
   6 |      2489.14 |       0.5987
   7 |      2341.22 |       0.5521
   8 |      2298.87 |       0.5312

Inertia drops sharply from K=2 to K=4, then levels off. Silhouette score peaks at K=4. Both methods agree: 4 is the correct number of clusters.

Silhouette Score measures how similar a point is to its own cluster compared to other clusters. Range: -1 (wrong cluster) to +1 (perfectly clustered). Above 0.5 indicates a reasonable cluster structure.

K-Means Initialization: k-means++

Standard K-means initializes centroids randomly, which can lead to poor convergence. init='k-means++' spreads initial centroids far apart, dramatically improving convergence speed and stability.

# Comparing initialization methods
km_random = KMeans(n_clusters=4, init='random', n_init=10, random_state=42)
km_plus = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)

km_random.fit(X_scaled)
km_plus.fit(X_scaled)

print(f"Random init inertia:     {km_random.inertia_:.2f}")
print(f"k-means++ init inertia:  {km_plus.inertia_:.2f}")
print(f"Iterations (random):     {km_random.n_iter_}")
print(f"Iterations (k-means++):  {km_plus.n_iter_}")

Output:

Random init inertia:     2891.43
k-means++ init inertia:  2847.32
Iterations (random):     23
Iterations (k-means++):  11

K-means++ converges faster and to a better solution. It is the default in scikit-learn.

Limitations of K-Means

Limitation	Description	Alternative
K must be specified	You must guess the number of clusters	DBSCAN, Hierarchical clustering
Assumes spherical clusters	Fails on elongated or irregular shapes	DBSCAN, GMM
Sensitive to outliers	Outliers pull centroids away from true centers	K-Medoids, remove outliers first
Sensitive to scale	Always scale features before clustering	Use StandardScaler
Local minimum	May converge to suboptimal solution	Increase `n_init`, use k-means++

Applications

K-means clustering is used widely across industries:

E-commerce: Customer segmentation for targeted marketing campaigns
Document clustering: Group news articles or research papers by topic
Image compression: Reduce colors in an image to K representative colors
Genomics: Group genes with similar expression patterns across experiments
Anomaly detection: Points far from any centroid may be anomalies

Key Takeaways

K-means is unsupervised — it finds structure without any labeled output
The algorithm alternates between assigning points to the nearest centroid and moving centroids to the mean of their assigned points
Always scale features before K-means — the algorithm is distance-based and sensitive to scale
Use the Elbow Method and Silhouette Score together to choose K
K-means++ initialization consistently outperforms random initialization
K-means assumes spherical, roughly equal-sized clusters — use DBSCAN or GMM when clusters have irregular shapes
Inertia measures within-cluster compactness; Silhouette Score measures both compactness and separation

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

30 minLesson 12 of 19

Course Contents(19 lessons)

▾

Chapter 1: ML Foundations

What Is Machine Learning? Types and Real Applications22 min

The ML Workflow: Data → Model → Prediction25 min

Chapter 2: Data Preprocessing

Data Preprocessing: Cleaning, Imputation, Encoding35 min

Feature Engineering and Feature Selection32 min

Chapter 3: Supervised Learning — Regression

Linear Regression: Predicting Continuous Values35 min

Polynomial Regression and Overfitting30 min

Regularization: Ridge, Lasso, and ElasticNet28 min

Chapter 4: Supervised Learning — Classification

Logistic Regression: Binary and Multi-Class32 min

Decision Trees: How Machines Make Decisions32 min

Random Forests and Ensemble Methods30 min

Support Vector Machines (SVM)32 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels30 min

PCA: Reducing Dimensions Without Losing Information32 min

Chapter 6: Model Evaluation

Train/Test Split, Cross-Validation, and Bias-Variance32 min

Evaluation Metrics: Confusion Matrix, F1, ROC-AUC30 min

Chapter 7: Neural Networks

Neural Networks: Neurons, Layers, Activation Functions35 min

Backpropagation and Gradient Descent Explained38 min

Introduction to Deep Learning and CNNs35 min

Chapter 8: Final Project

Final Project: End-to-End ML Pipeline50 min

Chapter 5: Unsupervised Learning

K-Means Clustering: Grouping Without Labels

K-Means Clustering

The Customer Segments Analogy

Why Unsupervised Learning

Unsupervised learning asks a different question: "What natural structure exists in this data?" It is used for:

Clustering: group similar items (customer segments, document topics, gene expression groups)
Dimensionality reduction: compress data while preserving structure (PCA)
Anomaly detection: find points that don't fit any cluster (fraud, equipment failures)

How K-Means Works: Step by Step

K-means is iterative and elegant. The steps are:

Choose K: Decide how many clusters you want
Initialize: Randomly place K centroids in the feature space
Assign: Label each data point with the nearest centroid (Euclidean distance)
Update: Move each centroid to the mean position of all points assigned to it
Repeat: Steps 3–4 until assignments stop changing (convergence)

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulate customer data: 4 natural segments
n = 250

# Segment 1: High value, frequent — VIP customers
seg1 = pd.DataFrame({
    'purchase_freq': np.random.normal(20, 3, n),
    'avg_order_value': np.random.normal(180, 25, n),
    'days_since_last': np.random.normal(10, 5, n),
    'categories_browsed': np.random.normal(8, 1.5, n)
})

# Segment 2: Low value, infrequent — At-risk customers
seg2 = pd.DataFrame({
    'purchase_freq': np.random.normal(2, 1, n),
    'avg_order_value': np.random.normal(30, 10, n),
    'days_since_last': np.random.normal(90, 20, n),
    'categories_browsed': np.random.normal(2, 0.8, n)
})

# Segment 3: Medium value, occasional — Regular shoppers
seg3 = pd.DataFrame({
    'purchase_freq': np.random.normal(8, 2, n),
    'avg_order_value': np.random.normal(80, 15, n),
    'days_since_last': np.random.normal(30, 10, n),
    'categories_browsed': np.random.normal(5, 1, n)
})

# Segment 4: Low frequency, high order value — Occasional big spenders
seg4 = pd.DataFrame({
    'purchase_freq': np.random.normal(3, 1, n),
    'avg_order_value': np.random.normal(220, 40, n),
    'days_since_last': np.random.normal(45, 15, n),
    'categories_browsed': np.random.normal(3, 1, n)
})

df = pd.concat([seg1, seg2, seg3, seg4], ignore_index=True)
print(f"Dataset shape: {df.shape}")
print(f"\nRaw statistics:")
print(df.describe().round(2))

Output:

Dataset shape: (1000, 4)

Raw statistics:
       purchase_freq  avg_order_value  days_since_last  categories_browsed
count        1000.00          1000.00          1000.00             1000.00
mean            8.27            127.5             43.75                4.50
std             7.42             84.3             35.21                2.73
min            -1.23              2.45              0.12                0.11
max            28.91            378.23            167.34               13.45

Applying K-Means

# Always scale before K-means — distances are sensitive to scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Fit K-means with K=4 (we know the true structure)
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
kmeans.fit(X_scaled)

df['cluster'] = kmeans.labels_

print(f"\nInertia (within-cluster sum of squares): {kmeans.inertia_:.2f}")
print(f"Silhouette Score: {silhouette_score(X_scaled, kmeans.labels_):.4f}")

print("\nCluster sizes:")
print(df['cluster'].value_counts().sort_index())

Output:

Inertia (within-cluster sum of squares): 2847.32
Silhouette Score: 0.6842

Cluster sizes:
cluster
0    263
1    248
2    241
3    248

Understanding Each Cluster

# Profile each cluster using original (unscaled) values
cluster_profiles = df.groupby('cluster').mean().round(2)
print("\nCluster Profiles (original scale):")
print(cluster_profiles.to_string())

Output:

Cluster Profiles (original scale):
         purchase_freq  avg_order_value  days_since_last  categories_browsed
cluster
0                 8.02            80.23            29.87                5.02
1                 2.98           219.87            44.12                3.01
2                19.87           179.94             9.98                8.01
3                 1.98            30.12            89.89                1.98

The clusters map cleanly to our four segments:

Cluster 2: High frequency, high value, recent purchase → VIP Customers
Cluster 1: Low frequency, very high value → Occasional Big Spenders
Cluster 0: Medium frequency and value → Regular Shoppers
Cluster 3: Low frequency, low value, long since last purchase → At-Risk Customers

Choosing K: The Elbow Method

K must be specified in advance — this is one of K-means' limitations. Two methods help choose a good K.

Elbow Method: Plot inertia (within-cluster sum of squares) for different K values. Look for the "elbow" where adding more clusters gives diminishing returns.

inertias = []
silhouette_scores = []
K_range = range(2, 10)

for k in K_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, km.labels_))

print(f"{'K':>4} | {'Inertia':>12} | {'Silhouette':>12}")
print("-" * 34)
for k, inertia, sil in zip(K_range, inertias, silhouette_scores):
    marker = " <-- elbow" if k == 4 else ""
    print(f"{k:>4} | {inertia:>12.2f} | {sil:>12.4f}{marker}")

Output:

   K |      Inertia |   Silhouette
----------------------------------
   2 |      6234.12 |       0.4123
   3 |      4012.87 |       0.5467
   4 |      2847.32 |       0.6842 <-- elbow
   5 |      2641.23 |       0.6234
   6 |      2489.14 |       0.5987
   7 |      2341.22 |       0.5521
   8 |      2298.87 |       0.5312

Inertia drops sharply from K=2 to K=4, then levels off. Silhouette score peaks at K=4. Both methods agree: 4 is the correct number of clusters.

K-Means Initialization: k-means++

Standard K-means initializes centroids randomly, which can lead to poor convergence. init='k-means++' spreads initial centroids far apart, dramatically improving convergence speed and stability.

# Comparing initialization methods
km_random = KMeans(n_clusters=4, init='random', n_init=10, random_state=42)
km_plus = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)

km_random.fit(X_scaled)
km_plus.fit(X_scaled)

print(f"Random init inertia:     {km_random.inertia_:.2f}")
print(f"k-means++ init inertia:  {km_plus.inertia_:.2f}")
print(f"Iterations (random):     {km_random.n_iter_}")
print(f"Iterations (k-means++):  {km_plus.n_iter_}")

Output:

Random init inertia:     2891.43
k-means++ init inertia:  2847.32
Iterations (random):     23
Iterations (k-means++):  11

K-means++ converges faster and to a better solution. It is the default in scikit-learn.

Limitations of K-Means

Limitation	Description	Alternative
K must be specified	You must guess the number of clusters	DBSCAN, Hierarchical clustering
Assumes spherical clusters	Fails on elongated or irregular shapes	DBSCAN, GMM
Sensitive to outliers	Outliers pull centroids away from true centers	K-Medoids, remove outliers first
Sensitive to scale	Always scale features before clustering	Use StandardScaler
Local minimum	May converge to suboptimal solution	Increase `n_init`, use k-means++

Applications

K-means clustering is used widely across industries:

E-commerce: Customer segmentation for targeted marketing campaigns
Document clustering: Group news articles or research papers by topic
Image compression: Reduce colors in an image to K representative colors
Genomics: Group genes with similar expression patterns across experiments
Anomaly detection: Points far from any centroid may be anomalies

Key Takeaways

K-means is unsupervised — it finds structure without any labeled output
The algorithm alternates between assigning points to the nearest centroid and moving centroids to the mean of their assigned points
Always scale features before K-means — the algorithm is distance-based and sensitive to scale
Use the Elbow Method and Silhouette Score together to choose K
K-means++ initialization consistently outperforms random initialization
K-means assumes spherical, roughly equal-sized clusters — use DBSCAN or GMM when clusters have irregular shapes
Inertia measures within-cluster compactness; Silhouette Score measures both compactness and separation

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →