// shared: regression & classification

K Nearest Neighbour

K Nearest Neighbour, commonly known as KNN, is an instance-based learning algorithm, and unlike linear or logistic regression where mathematical equations are used to predict the values, KNN is based on instances and doesn’t have a mathematical equation. As there is no mathematical equation, it doesn’t have to presume anything, such as the distribution of the data being normal, etc., and thus is a non-parametric way of predicting. It is a supervised non-linear method used to solve classification as well as regression problems.

Idea Behind KNN

Let us understand KNN with an example. Suppose we have a dataset where the Y variable has two classes - Squares and Rounds. There is an additional unknown point (black triangle) and we want to know which class it belongs to.

Dataset with two classes, red squares and blue rounds, and an unknown black triangle point near the round cluster — An unknown point (black triangle) needs to be classified as either a square or a round.

If the same dataset is shown to a child, then unlike computing a prior as done by Naive Bayes, a hyperplane as done by logistic regression, or computing information gain as done by decision trees, the child will simply look at the points that are close to the unknown point and will predict the point to be of that class. This is something that KNN does, where it classifies a data point based on its proximity to the training examples.

Voronoi cell

If we have a dataset that looks something like shown below, where the data points of different classes are in more proximity than what was shown earlier.

Denser dataset with red squares and blue rounds in closer proximity to each other — A denser dataset where points of different classes are closer together.

If we set k=1 for our K’s nearest neighbour algorithm, then for each new data point, the nearest training example is found and its class is considered for prediction. To understand this properly we have to first understand what Voronoi cells mean. See below how a blue data point has a region around it, designated as blue, so any point that lies in that area/neighbourhood will be classified as blue. Thus an entire cell is coloured ‘blue’ on the basis of one data point. This blue region forms a Voronoi cell.

A single blue polygonal Voronoi cell formed around one blue data point — The shaded polygon is the Voronoi cell belonging to the highlighted blue point.

Role of the value of K

In K’s nearest neighbour, each of these data points will have their Voronoi cell, which is collectively called Voronoi Tesselation. We now can come up with a decision boundary which passes in such a way that it differentiates one class of data points from the other. However, K’s nearest neighbour, even being a very simple algorithm, comes up with a decision boundary that is more complex than what Naive Bayes and Logistic Regression come up with, and is almost at par with decision trees.

Voronoi tessellation of all data points with the resulting decision boundary highlighted in red — The Voronoi Tessellation with the resulting decision boundary highlighted in red.

To understand why we use k=n and don’t just stick with k=1 for prediction, we have to first go through some of the limitations of KNN, which is that it is very sensitive to outliers and noise, as it can end up shifting the decision boundary very dramatically. For example, if we have a blue data point on the extreme left which is there due to noise/outlier, then the decision boundary will shift drastically, and the whole area around that data point will be labelled as blue while it is clearly in a region dominated by red data points.

The same Voronoi tessellation with the decision boundary dramatically shifted to claim a region around an isolated outlier point — A single noisy/outlier point can dramatically shift the decision boundary.

Also, unlike Naive Bayes, it doesn’t take into account the prominence of a class and doesn’t have any parallel to a ‘prior’ to see the frequency of one class compared to another. Thus it generates no confidence value. However, if there are data points belonging to a particular class which are overwhelmingly in dominance, then it will affect the majority voting, and to counter this, the value of k has to be tweaked, whereby increasing the number of k we are able to add more stability to our model.

Thus we now look at the way the value of k affects the whole game of KNN. As k=1 will make the decision boundary very sensitive to noise and will end up making the model very complex and unstable. Any small change in the training data will greatly alter the decision boundary.

We can increase the number of k, for example, to 20. Now what we mean by k=20 is that for a given unknown data point, it will look at more than 1 nearest neighbour, i.e. now it will look at 20 neighbours to predict its class. Here the dominant class will be considered for prediction, while the methods of considering a class as dominant can vary and will be discussed further.

Increasing k, as discussed earlier, will make the model less complex; however, if we take a large enough value of k, say k = no. of samples, then it will simply classify on the basis of the class that occurs most commonly, converting the model into a prior classifier. Thus a large value of k will make the model underfit. To find the right value of k we use grid search along with resampling measures such as cross-validation.

If we take a look at the figure below, we can see how, with a lower value of k, the model becomes a victim of overfitting and is very complex to effectively generalise, while increasing the k smoothens the decision boundary and makes the model more stable.

Side by side decision boundary plots for K=1 and K=20, with K=1 producing a highly complex jagged boundary and K=20 producing a much smoother boundary — The decision boundary is much less complex for a higher value of k.

Working of KNN for a Classification Problem v/s Regression Problem

Classification Problem

We have the below data with two classes - Red and Green - and we have to find the class of the Yellow Box.

Red and green class data points with an unknown orange point marked with a question mark — The unknown point (orange) needs to be classified as red or green.

In a classification problem, we take the point that we want to classify and start computing the distance from the unknown data point to every other data point of the training dataset. Now we pick k number of data points that are closest to the unknown data point, and based on the most frequent class, assign the label to the unknown data point. Thus here we use a majority voting system to decide the class of the data point.

Four panels showing the nearest neighbours considered for K=1, K=3, K=7, and K=13 around the unknown orange point — The number of data points considered is based on the value of k.

We can see in the above figure that when k was equal to 1, the nearest data point to the yellow square was red, and thus it would have been labelled as a red data point. Similarly, when we increased the value of k to 3, then it got the 3 data points nearest to it, out of which 2 were red and 1 was green. When k was 7, we got 4 red and 3 green, while for k=13 we had 9 red and 4 green data points.

Regression Problem

If the labels that are to be predicted are continuous, then KNN Regression is used.

A smooth true function curve with five discrete observed data points plotted along it — A 1-dimensional regression problem with the true function (unknown to us) and the observations we have.

Here also we take the unknown point and compute distances to all the data points and find k nearest data points, and this time, rather than taking a majority vote, we average the value of the Y variable that the nearest data points denote, thus computing the mean of the labels of the k nearest neighbours, and this mean is used as the prediction. Above, we have the 1-dimensional representation of a regression problem, and we have to predict the true function, which is unknown to us. We now try to predict this function from the observations we have.

On the x-axis, we have the values that we want to predict, and on the Y axis we have the various numbers of the Y variable. Now for a given input of 4, if we have to predict the Y, then we look on the x-axis at the point nearest to it (we don’t look at the y-axis, as it represents the Y variable, which is not the basis for prediction; we are predicting Y on the basis of a value of x). We find that the nearest point to 4 is at 3.2, so if we are using k=1 in the algorithm, then this point will be considered and its corresponding Y label will be used to predict the value. Thus for a given value of 4, the Y will be predicted as 6.

However, if we use k=2 in the algorithm, then we consider the two closest values of x, which are 3.2 and 5, and we take an average of their corresponding Y variable, which comes out to be 4.5 ((6 + 3) ÷ 2 = 4.5). Thus for a value of x=4 with k=2, the value of Y will be 4.5. We can do the same with various values of k and settle with the value of k that gives us the prediction which is closest to the true function.

Three panels showing the predicted value at x=4 for K=1, K=2, and K=3 — Predicted values based on different values of k.

Line chart of accuracy against number of neighbours, peaking sharply at K=1 and declining thereafter — Accuracy v/s Number of Neighbours.

We can see that k=2 gives us the best result. Thus we can see that finding the right value of k is very important, as with a very low value of k, the model becomes complex. On the other hand, a very high value of k, say where k = total number of samples, then the output will be the arithmetic mean of the Y variables.

Three panels comparing the true function against the predicted function for increasing values of k, from overfit to smoother — True function v/s Predicted function for different values of k.

As mentioned earlier, we can use grid search to test various values of k and find the k that provides the best accuracy.

Line chart comparing accuracy on the training dataset versus accuracy on the test dataset across different values of k — Accuracy on Training Dataset v/s Accuracy on Test Dataset for different values of k.

Methods of Improvement

Till now we have only discussed the value of k as a hyper-parameter that can be tweaked to give us better results. However, there is a lot of scope when dealing with KNN, as there are many parameters and other methods of improvement that can be used to provide us with better results. Some of these methods are: using different distance metrics, alternatives to majority voting, techniques to approximate nearest neighbour, resolving ties, rescaling data, missing value and outlier treatment, and dimensionality reduction.

Distance Metrics

Depending upon the type of features KNN is working with, adjusting the methods of calculating the distance can greatly improve the results produced by KNN. There are various distance metrics other than the usual Euclidean Distance used so far, such as Hamming Distance, Minkowski Distance, etc. You can refer to the blog on KNN Distance Metrics for more information.

Alternatives to Majority Voting

So far the method of voting was majority voting; however, this method can have its own set of drawbacks, because of which other alternatives to majority voting can be used, which can help in improving the results produced by the algorithm. For more information, refer to KNN Alternatives to Majority Voting.

Techniques to Approximate Nearest Neighbour

One of the biggest drawbacks of KNN is that it is very slow in the testing phase, which limits its usage. The main reason for this is that it considers all of the data points when approximating the nearest neighbours, which makes the whole process very slow. There are methods such as K-D Trees, LSH, and Inverted List that can be used to make the whole process faster without losing much accuracy and reducing reliability, and have been explored in KNN - Techniques to Approximate Nearest Neighbours.

Resolving Ties

If we have the same number of votes for different classes, then it becomes difficult to select one class for predicting the unknown data point. When the classes are binary, then an even value of k should be avoided, as an even value of k can lead to ties. So values of k should be odd (1, 3, 5, 7, 9 . . .). However, this technique cannot work with a multiple class problem, and for such cases, we should avoid the value of k from being a multiple of the number of classes. In the event of ties, we can either set the value of k to 1 or can pick a class randomly. A more robust way will be to use a prior, where the class that has an overall high frequency is considered for prediction.

Rescaling Data

Data, when being used for a distance-based model, must be rescaled. For example, we have two features: height of the person in metres and weight of a person in pounds. Here, as the absolute value of weight will be more, the distance metric will be skewed. Therefore, various rescaling methods should be used to rescale the features before using them for KNN. To know about the different scaling methods, refer to Feature Scaling.

Missing Value and Outlier Treatment

Missing and extreme values are very problematic for KNN. All the models that use distance metrics can get negatively affected by outliers, and therefore outlier treatment must be done on the data before using it for a KNN model. Similarly, missing value treatment must be done on the data so that KNN can be performed properly. Refer to Missing Value Treatment and Outlier Treatment for more information.

Dimensionality Reduction

If there are a lot of unnecessary variables (causing multicollinearity), then, unlike decision trees, where it ignores such variables and doesn’t get affected, KNN gets negatively affected, as it gives equal importance to each feature. Thus if the data is in very high dimensions, then KNN doesn’t work properly; therefore, feature reduction methods must be applied to make the distance metric more meaningful and the model more robust.

Pros and Cons

Like all the other algorithms, KNN has its own advantages and limitations.

Pros

It requires no assumption, i.e. it is a non-parametric method, thus we don’t fit any distribution. It can be used for regression as well as for binary and multiclass classification problems. The time taken by KNN in the training phase is much lesser when compared to other models, and is much faster for generating such complex results. And, obviously, it is very easy to understand and is simple to implement.

Cons

It doesn’t work properly if the data has missing values, outliers, too many useless features, or if it is not scaled. Also, the biggest problem of KNN is that, unlike ANN, which is slow at training and fast at testing, KNN is relatively fast at the training phase when compared to the testing phase, where it is extremely slow. This happens because, unlike other models where some function is learnt and is then applied to the testing set, KNN memorises all the instances of the training set. All these instances can be expressed as n. In the testing phase, then, each of these testing samples is compared with n. Thus this makes the process very computationally intensive. Also, as the size of the training data becomes larger, the training phase becomes slower. If using majority voting, then the results can become skewed, as the classes that are frequent in the data will dominate the majority voting process. Thus other methods, such as weighted voting, may be required.

K Nearest Neighbour is a very simple and powerful technique that provides us with a lot of options to generalise our model. It can be used for Classification as well as Regression problems. It faces the problem of being vulnerable to overfitting and being slow; however, there are various methods to go around it. All the hyperparameters discussed in this article must be played with to come up with a KNN model that suits your data and yields good results.