An Introduction to K-Nearest Neighbors (KNN) Classification
Introduction:
K-nearest neighbors (KNN) is a well-known supervised technique in machine learning that is used for classification. KNN is a non-parametric method that predicts new examples based on their closeness to labeled samples in training data. This blog focuses on the working of KNN with example, the role of K in KNN, and some of KNN’s limitations.
Working of KNN:
KNN calculates the distance between the new sample and all the training samples before considering just k neighbors from the training samples who have the least distance from the new sample. The new sample is then given the same class as the majority samples in the k selected neighbors closest to the new sample. When two training samples from different classes are within the same distance of a new point, the tie rule is applied, choosing the samples based on the class of most of the population. The example that follows will show how the tie rule works. Euclidean distance is one of the widely used distance metrics for KNN.
where d is distance and (x1, y1) is training sample while (x2, y2) is new sample.
Role of K:
The number of closest neighbors to consider when making predictions for new data is determined by the K value defined. For KNN, choosing a suitable k value is essential since it affects algorithm performance; for example, bigger k values minimize the possibility of overfitting but blur the decision boundary between classes, whereas smaller k values result in more flexible decision boundaries but raise the risk of overfitting.
Limitations of KNN:
KNN can experience the “curse of dimensionality,” which makes it challenging for it to estimate distances in high-dimensional spaces. Data becomes sparser as dimension space increases, which negatively impacts KNN performance.
The ideal k value must be chosen to have good model performance since, as was already discussed, the value of k can substantially affect algorithm performance. Examining the cross-validation methodology, which enables finding the ideal balance between bias and variance, is one method for determining the optimal value for k.
Example:
Assume following training samples with their classes:
Based on given training samples and classes, Using KNN with k=3, classify: (2.5,3.5), (4,3.5), (5,3.5), (1,1).
Solution:
Using Euclidean distance:
Sample1: (2.5,3.5)
d((1,2),(2.5,3.5))=2.1213203435596424
d((2,3),(2.5,3.5))=0.7071067811865476
d((3,3),(2.5,3.5))=0.7071067811865476
d((3,4),(2.5,3.5))=0.7071067811865476
d((4,4),(2.5,3.5))=1.5811388300841898
d((4,5),(2.5,3.5))= 2.1213203435596424
For k=3, (2.5,3.5) shares minimum distance with (2,3), (3,3), and (3,4) and majority belongs to category A hence classified as A.
Sample2: (4,3.5)
d((1,2),(4,3.5))=3.3541019662496847
d((2,3),(4,3.5))=2.0615528128088303
d((3,3),(4,3.5))=1.118033988749895
d((3,4),(4,3.5))=1.118033988749895
d((4,4),(4,3.5))=0.5
d((4,5),(4,3.5))=1.5
For k=3, (4,3.5) shares minimum distance with (4,4), (3,3), and (3,4) and majority belongs to category B hence classified as B.
Sample3: (5,3.5)
d((1,2),(5,3.5))=4.272001872658765
d((2,3),(5,3.5))=3.0413812651491097
d((3,3),(5,3.5))=2.0615528128088303
d((3,4),(5,3.5))=2.0615528128088303
d((4,4),(5,3.5))=1.118033988749895
d((4,5),(5,3.5))= 1.8027756377319946
For k=3, (5,3.5) shares minimum distance with (4,4), (3,3)&(3,4)but considering the tie rule (majority belongs to class B)selects(3,4), and (4,5) and majority belongs to category B hence classified as B.
Sample4: (1,1)
d((1,2),(1,1))=1.0
d((2,3),(1,1))=2.23606797749979
d((3,3),(1,1))=2.8284271247461903
d((3,4),(1,1))=3.605551275463989
d((4,4),(1,1))=4.242640687119285
d((4,5),(1,1))=5.0
For k=3, (1,1) shares minimum distance with (1,2), (2,3), and (3,3) and majority belongs to category A hence classified as A.
Python implementation:
KNN is imported from the Scikit-learn toolkit, which is used for algorithms for machine learning. Initialization of the KNN classifier uses the provided number of neighbors, k=3.
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
X_train = np.array([[1, 2], [2, 3], [3, 3], [3, 4], [4, 4], [4, 5]])
y_train = np.array(['A', 'A', 'A', 'B', 'B', 'B'])
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
X_new = np.array([[2.5, 3.5], [4, 3.5], [5, 3.5], [1, 1]])
predictions = knn.predict(X_new)
print(predictions)
category_colors = {'A': 'blue', 'B': 'red'}
# Plot training samples
for category in np.unique(y_train):
indices = np.where(y_train == category)
plt.scatter(X_train[indices, 0], X_train[indices, 1], color=category_colors[category], label='Category ' + category)
# Plot new samples
for category in np.unique(predictions):
indices = np.where(predictions == category)
plt.scatter(X_new[indices, 0], X_new[indices, 1], color=category_colors[category], marker='s', label='Predicted Category ' + category)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('KNN Classification')
plt.legend()
plt.show()
Python Output:
Conclusion:
The principle of KNN and its use in classification tasks were examined in this blog. The importance of the K value and how it affects the decision boundary were discussed. The KNN method uses most of the categories among the K nearest neighbors to categorize new samples. Understanding the KNN technique and experimenting with various K values can help you categorize new samples efficiently based on how close they are to already labeled data.