K-Nearest Neighbors
K-Nearest Neighbors: Learning by Example
KNN is the most intuitive ML algorithm: to classify a new point, find the k most similar training examples and take a vote. No training phase, no assumptions about data distribution — just memory and distance.
The Algorithm
New point arrives: [height=175cm, weight=70kg]
│
Find 5 nearest neighbors (k=5):
┌────────────────────────────────┐
│ Person 1: [174, 72] → Class A │
│ Person 2: [177, 68] → Class A │
│ Person 3: [173, 71] → Class A │
│ Person 4: [176, 80] → Class B │
│ Person 5: [172, 65] → Class A │
└────────────────────────────────┘
│
Majority vote: A=4, B=1
│
Predict: Class A
KNN is a lazy learner — it doesn't build a model during training. Instead, it memorizes the training set and does all computation at prediction time.
Implementing KNN
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import numpy as np
import matplotlib.pyplot as plt
# Classification
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# KNN requires scaled features — distances are meaningless otherwise
knn_pipeline = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=5, metric='euclidean'))
])
knn_pipeline.fit(X_train, y_train)
print(f"Accuracy: {knn_pipeline.score(X_test, y_test):.3f}")
print(classification_report(y_test, knn_pipeline.predict(X_test),
target_names=cancer.target_names))
Choosing K: The Critical Hyperparameter
train_scores = []
test_scores = []
k_values = range(1, 50)
for k in k_values:
knn = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=k))
])
knn.fit(X_train, y_train)
train_scores.append(knn.score(X_train, y_train))
test_scores.append(knn.score(X_test, y_test))
plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, label='Train')
plt.plot(k_values, test_scores, label='Test')
plt.xlabel('K (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('KNN: Effect of K on Performance')
plt.legend()
plt.show()
best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k}, Best Test Accuracy: {max(test_scores):.3f}")
k=1: Train=1.00, Test=0.91 ← Overfit (memorizes training data)
k=5: Train=0.97, Test=0.94 ← Good balance
k=21: Train=0.95, Test=0.95 ← Smoother, slightly better
k=100: Train=0.89, Test=0.87 ← Underfit (too smooth)
The pattern:
- Small k → complex decision boundary → overfit → high variance
- Large k → smooth decision boundary → underfit → high bias
- Optimal k is usually between 3 and 20; cross-validate to find it
Distance Metrics
The choice of distance metric dramatically affects results.
from sklearn.model_selection import cross_val_score
metrics = ['euclidean', 'manhattan', 'minkowski', 'cosine', 'chebyshev']
for metric in metrics:
knn = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=5, metric=metric))
])
scores = cross_val_score(knn, X_train, y_train, cv=5)
print(f"{metric:12s}: {scores.mean():.3f} ± {scores.std():.3f}")
Distance metrics guide:
| Metric | Formula | Best For |
|---|---|---|
| Euclidean | √Σ(xᵢ-yᵢ)² | Continuous features, isotropic data |
| Manhattan | Σ | xᵢ-yᵢ |
| Minkowski | (Σ | xᵢ-yᵢ |
| Cosine | 1 - (x·y)/(‖x‖‖y‖) | Text data, angles matter more than magnitude |
| Hamming | Fraction of different positions | Categorical data |
Weighted KNN
Instead of giving equal weight to all k neighbors, weight by distance — closer neighbors have more influence.
# Uniform weights (all neighbors equal)
knn_uniform = KNeighborsClassifier(n_neighbors=7, weights='uniform')
# Distance weights (closer = more influence)
knn_distance = KNeighborsClassifier(n_neighbors=7, weights='distance')
# Custom weight function
def custom_weights(distances):
return 1 / (distances ** 2 + 1e-10) # Inverse squared distance
knn_custom = KNeighborsClassifier(n_neighbors=7, weights=custom_weights)
Distance weighting typically improves performance, especially with moderate k values.
KNN for Regression
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn_reg = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsRegressor(n_neighbors=10, weights='distance'))
])
knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)
print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
For regression, KNN averages the y-values of the k nearest neighbors.
The Curse of Dimensionality
KNN's biggest weakness: it breaks down in high-dimensional spaces.
2D: A point's "nearest neighbors" are genuinely nearby
10D: Nearest neighbors start getting farther away
50D: Everything is approximately equidistant from everything else
100D+: The concept of "nearest neighbor" becomes nearly meaningless
# Demonstrate curse of dimensionality
from scipy.spatial.distance import pdist
import pandas as pd
n_samples = 1000
results = []
for n_dims in [2, 5, 10, 50, 100, 500]:
data = np.random.randn(n_samples, n_dims)
distances = pdist(data[:100], metric='euclidean') # Use subset for speed
results.append({
'dimensions': n_dims,
'min_dist': distances.min(),
'max_dist': distances.max(),
'ratio': distances.max() / (distances.min() + 1e-10)
})
df = pd.DataFrame(results)
print(df)
# As dimensions increase, max/min ratio approaches 1
# All points become equidistant → KNN breaks down
Solutions:
- Use dimensionality reduction (PCA) before KNN
- Apply feature selection to remove irrelevant dimensions
- Use a different algorithm for high-dimensional data
Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
param_grid = {
'knn__n_neighbors': [3, 5, 7, 11, 15, 21],
'knn__weights': ['uniform', 'distance'],
'knn__metric': ['euclidean', 'manhattan']
}
knn_pipeline = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier())
])
grid_search = GridSearchCV(
knn_pipeline, param_grid,
cv=5, scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print("Best params:", grid_search.best_params_)
print("Best CV accuracy:", grid_search.best_score_)
print("Test accuracy:", grid_search.best_estimator_.score(X_test, y_test))
KNN vs Other Algorithms
| Aspect | KNN | Random Forest | SVM |
|---|---|---|---|
| Training time | O(1) | O(n log n) | O(n²-n³) |
| Prediction time | O(n) | O(log n) | O(support vectors) |
| Memory | O(n) — stores all data | O(n trees) | O(support vectors) |
| Accuracy | Good for small data | Excellent | Excellent |
| Interpretability | Medium | Medium | Low |
| Scalability | Poor (large n) | Good | Poor (large n) |
When to Use KNN
Good choice when:
- Dataset is small to medium (under 50K samples)
- You need a quick, interpretable baseline
- Recommendation systems (collaborative filtering)
- Anomaly detection (unusual if no nearby neighbors)
- Data has natural cluster structure
Avoid when:
- Large datasets (prediction is slow — O(n) per query)
- High-dimensional features (curse of dimensionality)
- Features on very different scales without scaling pipeline
- You need fast real-time predictions
KNN is excellent as a baseline and for specific use cases like recommendation systems, but for general ML tasks, Random Forest or gradient boosting will usually outperform it.
Next lesson: K-Means Clustering — discovering hidden structure in unlabeled data.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises