Ali (aka Salman) Farooqi: March 2021

K-Nearest Neighbors (KNN) Algorithm*

Motivation

The k-nearest neighbor algorithm, commonly known as the KNN algorithm, is a simple yet effective classification and regression supervised machine learning algorithm.

Classification is more intuitive to human learning. Since we were kids, our parents made sure we recognize objects , letters , people correctly. This method is formally known as supervised learning. In terms of notation, supervised learning is a function that maps an input to an output based on examples, input-output pairs

As a parent myself, when I tell my then 2-year old that a round object that bounces is a ball. ‘Round’ and ‘Bounce’ are features (input) and ‘Ball’ is variable (output)
This entire process can be called model training. Similarly, when the independent variable/output is a constant value, it is known as a regression problem.

What is the K-Nearest Neighbors (KNN) Algorithm?

The KNN algorithm is a classical machine learning algorithm that focuses on the distance from new unclassified/unlabeled data points to existing classified/labeled data points.

Figure 1: Colors of the three flower categories in our example dataset, red, yellow and green.

We already have labeled data points from our dataset. These data points belong to three categories represented by red, green, and yellow colors as shown above

Along comes a new ‘alien’ data point represented by the black cross mark and we need to determine which category this new data point belongs to from the three colors.

First, we assign ‘k’ a random value. The k-value tells the no. of nearest points to look for from the new unlabelled data point. Next, we calculate the distance from the unlabelled data point to every data point on the graph and select the top 5 shortest distances.

Figure 2: Measuring the distance between the categories from our dataset.

Amongst the top 5 nearest data points, It is evident that the new data point belongs to category red as most of its nearest neighbors are from category red.

Figure 3: Seeing how the new data point belongs to the category red from our example dataset.

Similarly, in a regression problem, the aim is to predict a new data point’s value. Based on a x-y graph, there are 2 features for each data point. The x-axis represents feature-1, and the y-axis represents feature-2.

We introduce a new data point for which only feature-1 value is known, and we need to predict feature-2 value. We start with a k-value of 5 and get the 5 nearest neighboring points from the new data point. The predicted value of feature-2 for the new data point is the mean of the feature-2 of 5 nearest neighbors.

Figure 4: Our example of introducing a new data point to illustrate the k-nearest neighbor (KNN) algorithm.

When to use KNN?

1. KNN algorithm for applications that require high accuracy

2. When the dataset is small.

3. When data labeled correctly, and the predicted value will be among the given labels.

4. KNN is used to solve regression, classification, or search problems

Pros and Cons of Using KNN

Pros

· It is straightforward to implement since it requires only 2 param: k-value and distance function.

· A good value of k will make the algorithm robust to noise.

· It learns a nonlinear decision boundary.

· There are almost no assumptions on the given data.

· It is a non-parametric approach. No model fitting/training is required.

Cons

· Inefficient for large datasets since distance needs to be calculated.

· The model is susceptible to outliers.

· It cannot handle imbalanced data. It needs to be handled explicitly.

· If dataset requires large K value, it will increase the computation of the algorithm.

Math Behind KNN

T algorithm calculates the distance from the new data point to each existing data point. There are 2 common methods.

Minkowski Distance:

a) The Minkowski Distance is a generalized distance function in a normed vector space.

b) In cases,

1. When p=1 — this is the Manhattan Distance

2. When p=2 — this is the Euclidean Distance

Euclidean distance

the Euclidean distance between two points in the Euclidean space is the line segment’s length between the two points. We use the following formula:

Manhattan distance

The Manhattan distance calculation is similar to the calculation of Euclidean distance, with the difference being that we take absolute value instead of taking the square root of the sum of squared difference.

Fundamentally, the Euclidean distance represents “flying from one point to another,” and the Manhattan distance is “traveling from one point to another point” in a city following the pathway or the road.

The common method is the Euclidean distance formula. One of the reasons being, Euclidean distance can calculate in any dimension, whereas Manhattan finds the elements on a vertical or a horizontal plane.

How to Choose the Right Value for K?

There is no standard statistical method to compute the most optimal K value. We want to choose a K value that will reduce errors. As we increase K, our predictions become more stable due to averaging or majority voting but then the error rate will increase as it will underfit the model. A small K yields a low bias and high variance (higher complexity), while a large K yields a high bias and a low variance. There are few different methods to try: Domain knowledge, Cross-Validation or Square Root

Implementation of KNN in Python

Using Iris dataset of Iris flowers of three related species: Setosa, Versicolor, Virginica. The observed features are sepal length, sepal width, petal length, and petal width.

import numpy as npimport
pandas as pdimport
matplotlib.pyplot as pltimport
seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KneighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
iris = load_iris()

The table below shows the first 5 rows of data in a Pandas DataFrame. The target values are 0.0, 1.0, and 2.0, representing Setosa, Versicolor, and Virginica, respectively.

Since KNN is sensitive to outliers and imbalanced data. Plotting a count plot for the target variable, there are 50 samples of each flower type. Thankfully, the data is perfectly balanced.

sns.countplot(x=’target’, data=iris

for feature in [‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]:
sns.boxplot(x=’target’, y=feature, data=iris)
plt.show()

Next, we split the data into training and testing sets to measure how accurate the model is. The model will be trained on the training set, which is randomly selected 60:40 of the original data. Before splitting it into training and testing sets, it is essential to separate the feature/dependent and target/independent variable.

X = iris.drop([‘target’], axis=1)
y = iris[‘target’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

Building the initial model with a k-value of 1, meaning only 1 nearest neighbor will be considered for classifying a new data point. Internally, the distance from the new data point to all the data points will be calculated and then sorted ascending, along with their respective classes. Since the k-value is 1, the class (target value) of the first instance from the sorted array will determine the new data class. As we can see, we obtain a decent accuracy score of 91.6%. However, the optimal k-value needs to be selected.

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))
Output: 0.9166666666666666

To find the optimal k-value by cross-validation method, we calculate the k-values’ accuracy ranging from 1 to 26 in this case and choosing the optimal k-value.

k_range = list(range(1,26))
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))
    plt.plot(k_range, scores)
    plt.xlabel(‘Value of k’)
    plt.ylabel(‘Accuracy Score’)
    plt.title(‘Accuracy Scores for different values of k’)
    plt.show()

Introducing a new unlabeled data point whose class we need to predict is the flower type, which belongs to a category. We will build the model with a k-value of 11.

knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(iris.drop([‘target’], axis=1), iris[‘target’])
X_new = np.array([[1, 2.9, 10, 0.2]])
prediction = knn.predict(X_new)
print(prediction)
if prediction[0] == 0.0:
print(‘Setosa’)
elif prediction[0] == 1.0:
print(‘Versicolor’)
else: print(‘Virginica’)

Output: [2.]Virginica

KNN Applications

From forecasting epidemics and the economy to information retrieval ,recommender systems, data compression, and healthcare, the k-nearest neighbors (KNN) algorithm has become fundamental in such applications. KNN is primarily used during regression and classification tasks.

Conclusion

KNN is a highly effective and easy-to-implement supervised machine learning algorithm that can be used for classification and regression problems. The model functions by calculating distances of a selected number of examples, K, nearest to the predicting point.

For a classification problem, the label becomes the majority vote of the nearest K points. For a regression problem, the label becomes the average of the nearest K points.

* = Source Credit : TowardsAI.net

Ali (aka Salman) Farooqi

Tuesday, March 2, 2021

Inside the mind of a Product Manager

Supervised Learning : KNN Algorithm