Follow AiTechWorlds on LinkedIn for professional AI content!Follow Now →
14 minLesson 15 of 31
Core Algorithms

K-Nearest Neighbors

K-Nearest Neighbors: Learning by Example

KNN is the most intuitive ML algorithm: to classify a new point, find the k most similar training examples and take a vote. No training phase, no assumptions about data distribution — just memory and distance.

The Algorithm

New point arrives: [height=175cm, weight=70kg]
                           │
           Find 5 nearest neighbors (k=5):
           ┌────────────────────────────────┐
           │ Person 1: [174, 72] → Class A  │
           │ Person 2: [177, 68] → Class A  │
           │ Person 3: [173, 71] → Class A  │
           │ Person 4: [176, 80] → Class B  │
           │ Person 5: [172, 65] → Class A  │
           └────────────────────────────────┘
                           │
           Majority vote: A=4, B=1
                           │
                    Predict: Class A

KNN is a lazy learner — it doesn't build a model during training. Instead, it memorizes the training set and does all computation at prediction time.

Implementing KNN

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import numpy as np
import matplotlib.pyplot as plt

# Classification
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# KNN requires scaled features — distances are meaningless otherwise
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5, metric='euclidean'))
])

knn_pipeline.fit(X_train, y_train)
print(f"Accuracy: {knn_pipeline.score(X_test, y_test):.3f}")
print(classification_report(y_test, knn_pipeline.predict(X_test), 
                             target_names=cancer.target_names))

Choosing K: The Critical Hyperparameter

train_scores = []
test_scores = []
k_values = range(1, 50)

for k in k_values:
    knn = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k))
    ])
    knn.fit(X_train, y_train)
    train_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, label='Train')
plt.plot(k_values, test_scores, label='Test')
plt.xlabel('K (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('KNN: Effect of K on Performance')
plt.legend()
plt.show()

best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k}, Best Test Accuracy: {max(test_scores):.3f}")
k=1:   Train=1.00, Test=0.91  ← Overfit (memorizes training data)
k=5:   Train=0.97, Test=0.94  ← Good balance
k=21:  Train=0.95, Test=0.95  ← Smoother, slightly better
k=100: Train=0.89, Test=0.87  ← Underfit (too smooth)

The pattern:

  • Small k → complex decision boundary → overfit → high variance
  • Large k → smooth decision boundary → underfit → high bias
  • Optimal k is usually between 3 and 20; cross-validate to find it

Distance Metrics

The choice of distance metric dramatically affects results.

from sklearn.model_selection import cross_val_score

metrics = ['euclidean', 'manhattan', 'minkowski', 'cosine', 'chebyshev']

for metric in metrics:
    knn = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=5, metric=metric))
    ])
    scores = cross_val_score(knn, X_train, y_train, cv=5)
    print(f"{metric:12s}: {scores.mean():.3f} ± {scores.std():.3f}")

Distance metrics guide:

MetricFormulaBest For
Euclidean√Σ(xᵢ-yᵢ)²Continuous features, isotropic data
ManhattanΣxᵢ-yᵢ
Minkowskixᵢ-yᵢ
Cosine1 - (x·y)/(‖x‖‖y‖)Text data, angles matter more than magnitude
HammingFraction of different positionsCategorical data

Weighted KNN

Instead of giving equal weight to all k neighbors, weight by distance — closer neighbors have more influence.

# Uniform weights (all neighbors equal)
knn_uniform = KNeighborsClassifier(n_neighbors=7, weights='uniform')

# Distance weights (closer = more influence)
knn_distance = KNeighborsClassifier(n_neighbors=7, weights='distance')

# Custom weight function
def custom_weights(distances):
    return 1 / (distances ** 2 + 1e-10)  # Inverse squared distance

knn_custom = KNeighborsClassifier(n_neighbors=7, weights=custom_weights)

Distance weighting typically improves performance, especially with moderate k values.

KNN for Regression

from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_reg = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(n_neighbors=10, weights='distance'))
])

knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)

print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")

For regression, KNN averages the y-values of the k nearest neighbors.

The Curse of Dimensionality

KNN's biggest weakness: it breaks down in high-dimensional spaces.

2D: A point's "nearest neighbors" are genuinely nearby
10D: Nearest neighbors start getting farther away
50D: Everything is approximately equidistant from everything else
100D+: The concept of "nearest neighbor" becomes nearly meaningless
# Demonstrate curse of dimensionality
from scipy.spatial.distance import pdist
import pandas as pd

n_samples = 1000

results = []
for n_dims in [2, 5, 10, 50, 100, 500]:
    data = np.random.randn(n_samples, n_dims)
    distances = pdist(data[:100], metric='euclidean')  # Use subset for speed
    results.append({
        'dimensions': n_dims,
        'min_dist': distances.min(),
        'max_dist': distances.max(),
        'ratio': distances.max() / (distances.min() + 1e-10)
    })
    
df = pd.DataFrame(results)
print(df)
# As dimensions increase, max/min ratio approaches 1
# All points become equidistant → KNN breaks down

Solutions:

  • Use dimensionality reduction (PCA) before KNN
  • Apply feature selection to remove irrelevant dimensions
  • Use a different algorithm for high-dimensional data

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {
    'knn__n_neighbors': [3, 5, 7, 11, 15, 21],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan']
}

knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

grid_search = GridSearchCV(
    knn_pipeline, param_grid,
    cv=5, scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

print("Best params:", grid_search.best_params_)
print("Best CV accuracy:", grid_search.best_score_)
print("Test accuracy:", grid_search.best_estimator_.score(X_test, y_test))

KNN vs Other Algorithms

AspectKNNRandom ForestSVM
Training timeO(1)O(n log n)O(n²-n³)
Prediction timeO(n)O(log n)O(support vectors)
MemoryO(n) — stores all dataO(n trees)O(support vectors)
AccuracyGood for small dataExcellentExcellent
InterpretabilityMediumMediumLow
ScalabilityPoor (large n)GoodPoor (large n)

When to Use KNN

Good choice when:

  • Dataset is small to medium (under 50K samples)
  • You need a quick, interpretable baseline
  • Recommendation systems (collaborative filtering)
  • Anomaly detection (unusual if no nearby neighbors)
  • Data has natural cluster structure

Avoid when:

  • Large datasets (prediction is slow — O(n) per query)
  • High-dimensional features (curse of dimensionality)
  • Features on very different scales without scaling pipeline
  • You need fast real-time predictions

KNN is excellent as a baseline and for specific use cases like recommendation systems, but for general ML tasks, Random Forest or gradient boosting will usually outperform it.

Next lesson: K-Means Clustering — discovering hidden structure in unlabeled data.

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →
!