K-Nearest Neighbors | Machine Learning Fundamentals | AiTechWorlds

K-Nearest Neighbors: Learning by Example

KNN is the most intuitive ML algorithm: to classify a new point, find the k most similar training examples and take a vote. No training phase, no assumptions about data distribution — just memory and distance.

The Algorithm

New point arrives: [height=175cm, weight=70kg]
                           │
           Find 5 nearest neighbors (k=5):
           ┌────────────────────────────────┐
           │ Person 1: [174, 72] → Class A  │
           │ Person 2: [177, 68] → Class A  │
           │ Person 3: [173, 71] → Class A  │
           │ Person 4: [176, 80] → Class B  │
           │ Person 5: [172, 65] → Class A  │
           └────────────────────────────────┘
                           │
           Majority vote: A=4, B=1
                           │
                    Predict: Class A

KNN is a lazy learner — it doesn't build a model during training. Instead, it memorizes the training set and does all computation at prediction time.

Implementing KNN

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import numpy as np
import matplotlib.pyplot as plt

# Classification
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# KNN requires scaled features — distances are meaningless otherwise
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5, metric='euclidean'))
])

knn_pipeline.fit(X_train, y_train)
print(f"Accuracy: {knn_pipeline.score(X_test, y_test):.3f}")
print(classification_report(y_test, knn_pipeline.predict(X_test), 
                             target_names=cancer.target_names))

Choosing K: The Critical Hyperparameter

train_scores = []
test_scores = []
k_values = range(1, 50)

for k in k_values:
    knn = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k))
    ])
    knn.fit(X_train, y_train)
    train_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(k_values, train_scores, label='Train')
plt.plot(k_values, test_scores, label='Test')
plt.xlabel('K (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('KNN: Effect of K on Performance')
plt.legend()
plt.show()

best_k = k_values[np.argmax(test_scores)]
print(f"Best K: {best_k}, Best Test Accuracy: {max(test_scores):.3f}")

k=1:   Train=1.00, Test=0.91  ← Overfit (memorizes training data)
k=5:   Train=0.97, Test=0.94  ← Good balance
k=21:  Train=0.95, Test=0.95  ← Smoother, slightly better
k=100: Train=0.89, Test=0.87  ← Underfit (too smooth)

The pattern:

Small k → complex decision boundary → overfit → high variance
Large k → smooth decision boundary → underfit → high bias
Optimal k is usually between 3 and 20; cross-validate to find it

Distance Metrics

The choice of distance metric dramatically affects results.

from sklearn.model_selection import cross_val_score

metrics = ['euclidean', 'manhattan', 'minkowski', 'cosine', 'chebyshev']

for metric in metrics:
    knn = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=5, metric=metric))
    ])
    scores = cross_val_score(knn, X_train, y_train, cv=5)
    print(f"{metric:12s}: {scores.mean():.3f} ± {scores.std():.3f}")

Distance metrics guide:

Metric	Formula	Best For
Euclidean	√Σ(xᵢ-yᵢ)²	Continuous features, isotropic data
Manhattan	Σ	xᵢ-yᵢ
Minkowski	(Σ	xᵢ-yᵢ
Cosine	1 - (x·y)/(‖x‖‖y‖)	Text data, angles matter more than magnitude
Hamming	Fraction of different positions	Categorical data

Weighted KNN

Instead of giving equal weight to all k neighbors, weight by distance — closer neighbors have more influence.

# Uniform weights (all neighbors equal)
knn_uniform = KNeighborsClassifier(n_neighbors=7, weights='uniform')

# Distance weights (closer = more influence)
knn_distance = KNeighborsClassifier(n_neighbors=7, weights='distance')

# Custom weight function
def custom_weights(distances):
    return 1 / (distances ** 2 + 1e-10)  # Inverse squared distance

knn_custom = KNeighborsClassifier(n_neighbors=7, weights=custom_weights)

Distance weighting typically improves performance, especially with moderate k values.

KNN for Regression

from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score, mean_squared_error

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

knn_reg = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(n_neighbors=10, weights='distance'))
])

knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)

print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")

For regression, KNN averages the y-values of the k nearest neighbors.

The Curse of Dimensionality

KNN's biggest weakness: it breaks down in high-dimensional spaces.

2D: A point's "nearest neighbors" are genuinely nearby
10D: Nearest neighbors start getting farther away
50D: Everything is approximately equidistant from everything else
100D+: The concept of "nearest neighbor" becomes nearly meaningless

# Demonstrate curse of dimensionality
from scipy.spatial.distance import pdist
import pandas as pd

n_samples = 1000

results = []
for n_dims in [2, 5, 10, 50, 100, 500]:
    data = np.random.randn(n_samples, n_dims)
    distances = pdist(data[:100], metric='euclidean')  # Use subset for speed
    results.append({
        'dimensions': n_dims,
        'min_dist': distances.min(),
        'max_dist': distances.max(),
        'ratio': distances.max() / (distances.min() + 1e-10)
    })
    
df = pd.DataFrame(results)
print(df)
# As dimensions increase, max/min ratio approaches 1
# All points become equidistant → KNN breaks down

Solutions:

Use dimensionality reduction (PCA) before KNN
Apply feature selection to remove irrelevant dimensions
Use a different algorithm for high-dimensional data

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {
    'knn__n_neighbors': [3, 5, 7, 11, 15, 21],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan']
}

knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

grid_search = GridSearchCV(
    knn_pipeline, param_grid,
    cv=5, scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

print("Best params:", grid_search.best_params_)
print("Best CV accuracy:", grid_search.best_score_)
print("Test accuracy:", grid_search.best_estimator_.score(X_test, y_test))

KNN vs Other Algorithms

Aspect	KNN	Random Forest	SVM
Training time	O(1)	O(n log n)	O(n²-n³)
Prediction time	O(n)	O(log n)	O(support vectors)
Memory	O(n) — stores all data	O(n trees)	O(support vectors)
Accuracy	Good for small data	Excellent	Excellent
Interpretability	Medium	Medium	Low
Scalability	Poor (large n)	Good	Poor (large n)

When to Use KNN

Good choice when:

Dataset is small to medium (under 50K samples)
You need a quick, interpretable baseline
Recommendation systems (collaborative filtering)
Anomaly detection (unusual if no nearby neighbors)
Data has natural cluster structure

Avoid when:

Large datasets (prediction is slow — O(n) per query)
High-dimensional features (curse of dimensionality)
Features on very different scales without scaling pipeline
You need fast real-time predictions

KNN is excellent as a baseline and for specific use cases like recommendation systems, but for general ML tasks, Random Forest or gradient boosting will usually outperform it.

Next lesson: K-Means Clustering — discovering hidden structure in unlabeled data.