AiTechWorlds
AiTechWorlds
You've just been hired as a data analyst at an e-commerce company. There are 10,000 customers in the database. No one has labeled them as "budget shoppers" or "loyal VIPs" or "one-time buyers." Those labels don't exist yet.
But the data is there: purchase frequency, average order value, days since last purchase, number of product categories browsed. You suspect that within this sea of customers, there are natural groupings — clusters of people who behave similarly to each other and differently from others.
That is the exact problem K-means clustering was built to solve. Unlike every algorithm covered so far, K-means is unsupervised — it receives no labels. It must discover structure purely from patterns in the data itself. The result is not a prediction but a segmentation: here are the 4 types of customers your company actually has.
Supervised learning needs labeled examples — someone must have manually tagged which emails are spam, which tumors are malignant, which loans will default. That labeling is expensive, time-consuming, and sometimes impossible.
Unsupervised learning asks a different question: "What natural structure exists in this data?" It is used for:
K-means is iterative and elegant. The steps are:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
np.random.seed(42)
# Simulate customer data: 4 natural segments
n = 250
# Segment 1: High value, frequent — VIP customers
seg1 = pd.DataFrame({
'purchase_freq': np.random.normal(20, 3, n),
'avg_order_value': np.random.normal(180, 25, n),
'days_since_last': np.random.normal(10, 5, n),
'categories_browsed': np.random.normal(8, 1.5, n)
})
# Segment 2: Low value, infrequent — At-risk customers
seg2 = pd.DataFrame({
'purchase_freq': np.random.normal(2, 1, n),
'avg_order_value': np.random.normal(30, 10, n),
'days_since_last': np.random.normal(90, 20, n),
'categories_browsed': np.random.normal(2, 0.8, n)
})
# Segment 3: Medium value, occasional — Regular shoppers
seg3 = pd.DataFrame({
'purchase_freq': np.random.normal(8, 2, n),
'avg_order_value': np.random.normal(80, 15, n),
'days_since_last': np.random.normal(30, 10, n),
'categories_browsed': np.random.normal(5, 1, n)
})
# Segment 4: Low frequency, high order value — Occasional big spenders
seg4 = pd.DataFrame({
'purchase_freq': np.random.normal(3, 1, n),
'avg_order_value': np.random.normal(220, 40, n),
'days_since_last': np.random.normal(45, 15, n),
'categories_browsed': np.random.normal(3, 1, n)
})
df = pd.concat([seg1, seg2, seg3, seg4], ignore_index=True)
print(f"Dataset shape: {df.shape}")
print(f"\nRaw statistics:")
print(df.describe().round(2))
Output:
Dataset shape: (1000, 4)
Raw statistics:
purchase_freq avg_order_value days_since_last categories_browsed
count 1000.00 1000.00 1000.00 1000.00
mean 8.27 127.5 43.75 4.50
std 7.42 84.3 35.21 2.73
min -1.23 2.45 0.12 0.11
max 28.91 378.23 167.34 13.45
# Always scale before K-means — distances are sensitive to scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# Fit K-means with K=4 (we know the true structure)
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
kmeans.fit(X_scaled)
df['cluster'] = kmeans.labels_
print(f"\nInertia (within-cluster sum of squares): {kmeans.inertia_:.2f}")
print(f"Silhouette Score: {silhouette_score(X_scaled, kmeans.labels_):.4f}")
print("\nCluster sizes:")
print(df['cluster'].value_counts().sort_index())
Output:
Inertia (within-cluster sum of squares): 2847.32
Silhouette Score: 0.6842
Cluster sizes:
cluster
0 263
1 248
2 241
3 248
# Profile each cluster using original (unscaled) values
cluster_profiles = df.groupby('cluster').mean().round(2)
print("\nCluster Profiles (original scale):")
print(cluster_profiles.to_string())
Output:
Cluster Profiles (original scale):
purchase_freq avg_order_value days_since_last categories_browsed
cluster
0 8.02 80.23 29.87 5.02
1 2.98 219.87 44.12 3.01
2 19.87 179.94 9.98 8.01
3 1.98 30.12 89.89 1.98
The clusters map cleanly to our four segments:
K must be specified in advance — this is one of K-means' limitations. Two methods help choose a good K.
Elbow Method: Plot inertia (within-cluster sum of squares) for different K values. Look for the "elbow" where adding more clusters gives diminishing returns.
inertias = []
silhouette_scores = []
K_range = range(2, 10)
for k in K_range:
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
silhouette_scores.append(silhouette_score(X_scaled, km.labels_))
print(f"{'K':>4} | {'Inertia':>12} | {'Silhouette':>12}")
print("-" * 34)
for k, inertia, sil in zip(K_range, inertias, silhouette_scores):
marker = " <-- elbow" if k == 4 else ""
print(f"{k:>4} | {inertia:>12.2f} | {sil:>12.4f}{marker}")
Output:
K | Inertia | Silhouette
----------------------------------
2 | 6234.12 | 0.4123
3 | 4012.87 | 0.5467
4 | 2847.32 | 0.6842 <-- elbow
5 | 2641.23 | 0.6234
6 | 2489.14 | 0.5987
7 | 2341.22 | 0.5521
8 | 2298.87 | 0.5312
Inertia drops sharply from K=2 to K=4, then levels off. Silhouette score peaks at K=4. Both methods agree: 4 is the correct number of clusters.
Silhouette Score measures how similar a point is to its own cluster compared to other clusters. Range: -1 (wrong cluster) to +1 (perfectly clustered). Above 0.5 indicates a reasonable cluster structure.
Standard K-means initializes centroids randomly, which can lead to poor convergence. init='k-means++' spreads initial centroids far apart, dramatically improving convergence speed and stability.
# Comparing initialization methods
km_random = KMeans(n_clusters=4, init='random', n_init=10, random_state=42)
km_plus = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
km_random.fit(X_scaled)
km_plus.fit(X_scaled)
print(f"Random init inertia: {km_random.inertia_:.2f}")
print(f"k-means++ init inertia: {km_plus.inertia_:.2f}")
print(f"Iterations (random): {km_random.n_iter_}")
print(f"Iterations (k-means++): {km_plus.n_iter_}")
Output:
Random init inertia: 2891.43
k-means++ init inertia: 2847.32
Iterations (random): 23
Iterations (k-means++): 11
K-means++ converges faster and to a better solution. It is the default in scikit-learn.
| Limitation | Description | Alternative |
|---|---|---|
| K must be specified | You must guess the number of clusters | DBSCAN, Hierarchical clustering |
| Assumes spherical clusters | Fails on elongated or irregular shapes | DBSCAN, GMM |
| Sensitive to outliers | Outliers pull centroids away from true centers | K-Medoids, remove outliers first |
| Sensitive to scale | Always scale features before clustering | Use StandardScaler |
| Local minimum | May converge to suboptimal solution | Increase n_init, use k-means++ |
K-means clustering is used widely across industries:
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises