K-Nearest Neighbors (KNN)
Algorithm*
Motivation
The k-nearest neighbor algorithm,
commonly known as the KNN algorithm, is a simple yet effective classification
and regression supervised machine
learning algorithm.
Classification is more intuitive to human learning. Since we were kids, our
parents made sure we recognize objects , letters , people correctly. This
method is formally known as supervised learning. In terms of notation, supervised learning is a
function that maps an input to an output based on examples, input-output pairs
As a parent myself,
when I tell my then 2-year old that a round object that bounces
is a ball. ‘Round’ and ‘Bounce’ are features (input) and ‘Ball’
is variable (output)
This entire process can be called model training. Similarly, when the
independent variable/output is a constant value, it is known as a regression
problem.
What is the K-Nearest Neighbors
(KNN) Algorithm?
The KNN
algorithm is a classical machine learning algorithm that focuses on the
distance from new unclassified/unlabeled data points to existing classified/labeled data points.
We already have labeled data points
from our dataset. These data points belong to three categories represented by
red, green, and yellow colors as shown above
Along comes a
new ‘alien’ data point represented by the black cross mark and we need
to determine which category this new data point belongs to from the three
colors.
First, we assign
‘k’ a random value. The k-value tells the no. of nearest points to look
for from the new unlabelled data point. Next, we calculate the distance from
the unlabelled data point to every data point on the graph and select the top 5
shortest distances.
Amongst the top 5 nearest data points, It
is evident that the new data point belongs to category red as most of its
nearest neighbors are from category red.
Similarly, in a regression problem, the
aim is to predict a new data point’s value. Based on a x-y graph, there are 2
features for each data point. The x-axis represents feature-1, and the y-axis
represents feature-2.
We introduce a
new data point for which only feature-1 value is known, and we need to predict feature-2
value. We start with a k-value of 5 and get the 5 nearest neighboring points
from the new data point. The predicted value of feature-2 for the new data
point is the mean of the feature-2 of 5 nearest neighbors.
When to use KNN?
1.
KNN algorithm
for applications that require high accuracy
2.
When the dataset
is small.
3.
When data labeled
correctly, and the predicted value will be among the given labels.
4.
KNN is used to
solve regression, classification, or search problems
Pros and Cons of Using KNN
Pros
·
It is
straightforward to implement since it requires only 2 param: k-value and
distance function.
·
A good value of
k will make the algorithm robust to noise.
·
It learns a
nonlinear decision boundary.
·
There are
almost no assumptions on the given data.
·
It is a
non-parametric approach. No model fitting/training is required.
Cons
·
Inefficient for
large datasets since distance needs to be calculated.
·
The model is
susceptible to outliers.
·
It cannot
handle imbalanced data. It needs to be handled explicitly.
·
If dataset
requires large K value, it will increase the computation of the algorithm.
Math Behind KNN
T algorithm
calculates the distance from the new data point to each existing data point. There
are 2 common methods.
Minkowski Distance:
a) The Minkowski Distance is a generalized distance
function in a normed vector space.

b) In cases,
1.
When p=1 — this
is the Manhattan Distance
2.
When p=2 — this
is the Euclidean Distance
Euclidean distance
the Euclidean
distance between two points in the Euclidean space is the line segment’s
length between the two points. We use the following formula:
Manhattan distance
The Manhattan
distance calculation is similar to the calculation of Euclidean distance, with
the difference being that we take absolute value instead of taking the square
root of the sum of squared difference.
Fundamentally,
the Euclidean
distance represents “flying from one point to
another,” and the Manhattan distance is “traveling from one point to another point” in a city following the pathway or the road.
The common method
is the Euclidean distance formula. One of the reasons being, Euclidean distance
can calculate in any
dimension, whereas
Manhattan finds the elements on a vertical or a horizontal plane.
How to Choose the Right Value for
K?
There is no
standard statistical method to compute the most optimal K value. We want to
choose a K value that will reduce errors. As we increase K, our predictions
become more stable due to averaging or majority voting but then the error rate
will increase as it will underfit the model. A small K yields a low bias and
high variance (higher complexity), while a large K yields a high bias and a low
variance. There are few different methods to try: Domain knowledge, Cross-Validation or Square Root
Implementation of KNN in Python
Using Iris dataset
of Iris flowers of three related species: Setosa, Versicolor, Virginica. The
observed features are sepal length, sepal width, petal length, and petal width.
import numpy as npimport
pandas as pdimport
matplotlib.pyplot as pltimport
seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KneighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
iris = load_iris()
The table below
shows the first 5 rows of data in a Pandas DataFrame. The target values are
0.0, 1.0, and 2.0, representing Setosa, Versicolor, and Virginica,
respectively.
Since KNN is
sensitive to outliers and imbalanced data. Plotting a count plot for the target
variable, there are 50 samples of each flower type. Thankfully, the data is
perfectly balanced.
sns.countplot(x=’target’,
data=iris
for feature in
[‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width
(cm)’]:
sns.boxplot(x=’target’, y=feature,
data=iris)
plt.show()
Next, we split
the data into training and testing sets to measure how accurate the model is.
The model will be trained on the training set, which is randomly selected 60:40
of the original data. Before splitting it into training and testing sets, it is
essential to separate the feature/dependent and target/independent variable.
X = iris.drop([‘target’], axis=1)
y = iris[‘target’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=0)
Building the
initial model with a k-value of 1, meaning only 1 nearest neighbor will be
considered for classifying a new data point. Internally, the distance from the
new data point to all the data points will be calculated and then sorted ascending,
along with their respective classes. Since the k-value is 1, the class (target
value) of the first instance from the sorted array will determine the new data
class. As we can see, we obtain a decent accuracy score of 91.6%. However, the
optimal k-value needs to be selected.
knn =
KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))
Output: 0.9166666666666666
To find the
optimal k-value by cross-validation method, we calculate the k-values’ accuracy
ranging from 1 to 26 in this case and choosing the optimal k-value.
k_range = list(range(1,26))
scores = []
for k in k_range:
knn =
KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
scores.append(metrics.accuracy_score(y_test,
y_pred))
plt.plot(k_range, scores)
plt.xlabel(‘Value of k’)
plt.ylabel(‘Accuracy Score’)
plt.title(‘Accuracy Scores for
different values of k’)
plt.show()
Introducing a
new unlabeled data point whose class we need to predict is the flower type,
which belongs to a category. We will build the model with a k-value of 11.
knn =
KNeighborsClassifier(n_neighbors=11)
knn.fit(iris.drop([‘target’], axis=1), iris[‘target’])
X_new = np.array([[1, 2.9, 10, 0.2]])
prediction = knn.predict(X_new)
print(prediction)
if prediction[0] == 0.0:
print(‘Setosa’)
elif prediction[0] == 1.0:
print(‘Versicolor’)
else: print(‘Virginica’)
Output: [2.]Virginica
KNN Applications
From
forecasting epidemics and the economy to information retrieval ,recommender
systems, data compression, and healthcare, the k-nearest neighbors (KNN)
algorithm has become fundamental in such applications. KNN is primarily used
during regression and classification tasks.
Conclusion
KNN is a highly
effective and easy-to-implement supervised machine learning algorithm that can
be used for classification and regression problems. The model functions by
calculating distances of a selected number of examples, K, nearest to the
predicting point.
For a
classification problem, the label becomes the majority vote of the nearest K
points. For a regression problem, the label becomes the average of the nearest
K points.
* = Source Credit : TowardsAI.net