
ML Basics: K Nearest Neighbors in Ruby - foob
http://www.thagomizer.com/blog/2017/09/13/ml-basics-k-nearest-neighbors.html
======
apathy
Just a thought:

When Cover & Hart proved that the error for k-NN classification is no worse
than twice the Bayes (optimal) error, "machine learning" as a phrase had not
yet been observed in the wild.

[http://ieeexplore.ieee.org/document/1053964/](http://ieeexplore.ieee.org/document/1053964/)

EE, CS, stats -- these are your fundamentals...

~~~
taeric
This feels like a more significant result than it appears to get. I'm guessing
the problem then comes down solely to defining a distance metric that you can
easily/quickly evaluate? Or is this merely the upper bound and many folks do
markedly better nowdays?

~~~
highd
The result is in the large sample limit, which is pretty much never the case
for high dimensional datasets like the ones most popular for ML these days
(images, audio, text). It doesn't mean what the parent thinks it means.

~~~
apathy
You are proposing that a reduced dimensional projection of a large dataset
cannot approach this limit?

I.e. expose the underlying low rank of nearly any huge sparse data matrix with
an SVD or NMF. Enable fast recovery with a shitty (CS-wise) hash function.
Recover most of the information about an observation's neighbors in a fraction
of the time taken by many other approaches.

What's popular for ML benchmarking these days is not necessarily the same as
what's needed for a specific application. It's a useful proof to keep in mind
before prematurely optimizing with overly complicated approaches.

~~~
highd
You are of course free to try simple dimensionality reduction and nearest
neighbors, and if that works on your problem that's fantastic. To the research
community, though, problems where approaches like that work were considered
"solved" decades ago. And of course, in industry, if there's a chance of that
working it's tried. But no one's building self driving cars with PCA and LSH.

------
aswanson
I love ruby, but its falling almost irretrievably behind python in data
analysis/visualization/machine learning libraries.

~~~
dnc
Same here, I use python at work for ML related stuff on daily basis for the
reason pretty much everyone else chooses python over ruby ('that's what
everyone else in my company uses and it just has more mature ML libraries and
greater support'), but countless times have I caught myself thinking while
writing python code: 'this could have been much cleaner and shorter code had I
used ruby and its blocks'. It's a shame.

~~~
jonnytran
I agree. I love Ruby. Not just the language, but the values of the community.
Don't give up. Try SciRuby. I also recently discovered pycall, for those times
when something really doesn't exist yet in Ruby. Of course, if it doesn't
exist, that's an opportunity to make it yourself!

1: [https://github.com/SciRuby](https://github.com/SciRuby)

2: [https://github.com/mrkn/pycall.rb](https://github.com/mrkn/pycall.rb)

------
wyc
A different kind of "expressiveness."

    
    
        NB. naive k-nearest neighbors in J
    
        dist =: [:%:[:|[:+/(*:@:-)    NB. dyad takes two vectors, returns euclidean distance
    
        data =: 1 2 3,2 2 3,2 3 3,1 2 3,:2 3 4
        query =: 1 2 3
        k =: 3
    
        k {."1 /:"1 query&dist"1 data

~~~
vidarh
Whenever I've seen K or J code and someone has actually explained how it
works, it's turned out to be viable to implement the necessary primitives to
get similar expressiveness without making it completely unreadable. I'd love
to learn more about them, but I'm starting to feel that the terse syntax is
mostly obscuring relatively simple/basic functionality that's easy to re-
implement rather than any major language features.

------
computerwizard
Ruby is so expressive it's a shame it isn't used more for ML and AI.

~~~
nerdponx
Just last night, I was thinking about what data and stats tooling would be
like if NumPy had been, say, NumRuby instead.

------
mck-
For the js crowd, here's an implementation of KNN in Node I did a few years
ago: [https://github.com/axiomzen/Alike](https://github.com/axiomzen/Alike)

And it's cousin KD-tree: [https://github.com/axiomzen/look-
alike](https://github.com/axiomzen/look-alike)

------
abhgh
K-NN is one of the more concise classifiers to implement. I did a Python
implementation a while ago that can fit into a tweet -[1]. Since maps/lambdas
are available in Ruby, this should be possible in Ruby too. Sorry for the bad
presentation - I am planning to migrate soon.

[1] [http://quipu-strands.blogspot.com/2014/08/knn-classifier-
in-...](http://quipu-strands.blogspot.com/2014/08/knn-classifier-in-one-line-
of-python.html)

------
s17n
I get that this is supposed to be basic but why wouldn't you at least observe
that the sqrt call is unnecessary?

~~~
tw1010
Premature optimization is the root of all evil, etc etc.

------
xchip
KNN is quite a simple operation in ML. I might be underestimating the effort
but... Why is this an achievement?

~~~
s17n
I think it's a post aimed at ML and/or ruby beginners. Why it's front page on
HN is a mystery.

~~~
pkd
It's front page because enough people found it interesting enough to upvote.

------
frugalmail
who still thinks Ruby is a good idea for this?

~~~
pkd
Why is it a bad idea?

