
K-Nearest Neighbors Algorithm in Python and Scikit-Learn - ScottWRobinson
http://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/
======
0800
> As said earlier, it is lazy learning algorithm and therefore requires no
> training prior to making real time predictions. This makes the KNN algorithm
> much faster than other algorithms that require training e.g SVM, linear
> regression, etc.

Linear regression is way faster than KNN as the dataset grows beyond toy data.
Both training and especially testing. In practical applications, test speed
often trumps train speed.

> The KNN algorithm doesn't work well with high dimensional data because with
> large number of dimensions, it becomes difficult for the algorithm to
> calculate distance in each dimension.

KNN works fine on high-dimensional text. From something simple as Hamming
distance on binary tokens, to euclidean distance on TFIDF, to cosine distance
on 900-dimensional word vector aggregates.

> There are only two parameters required to implement KNN i.e. the value of K
> and the distance function (e.g. Euclidean or Manhattan etc.)

Also implement distance weighing (you probably want to weigh the 1-th nearest
neighbor label higher than the 5-th nearest neighbor label).

> The KNN algorithm has a high prediction cost for large datasets. This is
> because in large datasets the cost of calculating distance between new point
> and each existing point becomes higher.

This is why you "fit" something like a K-D tree during training.

> Finally, the KNN algorithm doesn't work well with categorical features since
> it is difficult to find the distance between dimensions with categorical
> features.

Hamming distance works fine on one-hot encoded categorical features. If not,
embed/reduce the dimensionality. If not, use feature selection first and do
KNN on top 10%-20% features. Remember that you don't have to use the same
distance measure for every feature column.

> If one of the features has a broad range of values, the distance will be
> governed by this particular feature. Therefore, the range of all features
> should be normalized so that each feature contributes approximately
> proportionately to the final distance.

You can skip this step (and feature selection) by learning an additional
weight for each feature to multiply with before distance measure. It is rare
for each feature to contribute proportionately to the target.

~~~
srean
All of your claims call for citations. Its not that they are untrue. But they
are not as sweepingly true as they might seem to an uninitiated reader. For
example, you can try KD tree on a 50K dimension data set and judge for
yourself.

~~~
0800
I just tried this with 50k dimensions and 2 rows and it worked fine.

------
gcmac
Very informative and well written article about KNN classification. However,
as a data scientist it always pains me to see the iris data set being used. It
is linearly separable and gives no indication of whether or not you actually
want to use the given methodology on your problem since almost every technique
can achieve these results on this data. I'd recommend using something from
kaggle or even the UCI repository to make these types of articles even more
useful!

~~~
ScottWRobinson
Author here. You have a very good point. I tend to default to using Iris for
articles like these because of its simplicity, ease of setup, etc., but you're
right that it isn't as informative in showing readers the algorithm's
capability. I'll have to try out some different datasets for upcoming articles
:)

~~~
workhn
+1

Recommend the wine data set or the PIMA Indian diabetes dataset.

------
DrWumbo
A very informative write-up. I believe that KNN is a great intro project for
anyone getting into machine learning. I recently wrote an MNIST KNN classifier
with numpy and pandas, this post certainly would have been helpful.
[https://github.com/ShahZafrani/machineLearningPractice/blob/...](https://github.com/ShahZafrani/machineLearningPractice/blob/master/knn/knn_classifier.ipynb)

------
poster123
Instead of finding the K nearest neighbors and averaging their y values, which
is effectively fitting a 0th order (constant) mode, I wonder why fitting a
multiple linear regression using those neighbors is not done. Since a linear
model can approximate a function over a wider range than constant can, one can
use a larger value of K.

I know that local linear regression is often used in one dimension, but I
wonder if it should be used more often when there are multiple predictors.

------
skyisblue
I'm new to ML and currently looking at building a recommender system using KNN
on a site based on what users have read. I'm struggling to select the optimum
feature set. Should I use the article ids or tags as the features? Also, how
do I go about the high dimensionality curse with potentially thousands of
articles or tags?

