Hacker News new | comments | show | ask | jobs | submit login
ML Basics: K Nearest Neighbors in Ruby (thagomizer.com)
83 points by foob 9 months ago | hide | past | web | favorite | 31 comments

Just a thought:

When Cover & Hart proved that the error for k-NN classification is no worse than twice the Bayes (optimal) error, "machine learning" as a phrase had not yet been observed in the wild.


EE, CS, stats -- these are your fundamentals...

Umm that's a beautiful theoretical result no doubt. But in practice the choice of the "space" that determines the neighborhood, matters more than anything else on the top of that space. Also as another commenter pointed out the result is applicable when infinite labeled samples are present, which is almost never.

IMHO the reason Machine Learning is NOT Stats or Information theory because these beautiful fundamental result were found to be futile in practice.

This feels like a more significant result than it appears to get. I'm guessing the problem then comes down solely to defining a distance metric that you can easily/quickly evaluate? Or is this merely the upper bound and many folks do markedly better nowdays?

The result is in the large sample limit, which is pretty much never the case for high dimensional datasets like the ones most popular for ML these days (images, audio, text). It doesn't mean what the parent thinks it means.

You are proposing that a reduced dimensional projection of a large dataset cannot approach this limit?

I.e. expose the underlying low rank of nearly any huge sparse data matrix with an SVD or NMF. Enable fast recovery with a shitty (CS-wise) hash function. Recover most of the information about an observation's neighbors in a fraction of the time taken by many other approaches.

What's popular for ML benchmarking these days is not necessarily the same as what's needed for a specific application. It's a useful proof to keep in mind before prematurely optimizing with overly complicated approaches.

You are of course free to try simple dimensionality reduction and nearest neighbors, and if that works on your problem that's fantastic. To the research community, though, problems where approaches like that work were considered "solved" decades ago. And of course, in industry, if there's a chance of that working it's tried. But no one's building self driving cars with PCA and LSH.

Ah, so that means it doesn't mean what I also took it to mean. :)

Know a good reading on this?

I love ruby, but its falling almost irretrievably behind python in data analysis/visualization/machine learning libraries.

Same here, I use python at work for ML related stuff on daily basis for the reason pretty much everyone else chooses python over ruby ('that's what everyone else in my company uses and it just has more mature ML libraries and greater support'), but countless times have I caught myself thinking while writing python code: 'this could have been much cleaner and shorter code had I used ruby and its blocks'. It's a shame.

I agree. I love Ruby. Not just the language, but the values of the community. Don't give up. Try SciRuby. I also recently discovered pycall, for those times when something really doesn't exist yet in Ruby. Of course, if it doesn't exist, that's an opportunity to make it yourself!

1: https://github.com/SciRuby

2: https://github.com/mrkn/pycall.rb

Would you like some code review to improve your Python?

I'm doing machine learning in Python ... this is because there is no choice. Given a choice, i'd use Ruby rather than Python.

because there is no choice

You could be using R or MATLAB certainly, there are still choices in this space

C++, R, Matlab, etc. There is plenty of choice.

All you mentioned is not better choice, C++ is not suited for ML / startups (I was a hardcore C++ programmer). As for R & Mathlab, i think they are not suited for ppl with CS background, they are more suited for people with maths background.

Actually I'd say it's catching up now because of http://sciruby.com/

SciRuby has been around for quite a while, and hasn't made significant inroads. Their development also seems to have stalled.

A different kind of "expressiveness."

    NB. naive k-nearest neighbors in J

    dist =: [:%:[:|[:+/(*:@:-)    NB. dyad takes two vectors, returns euclidean distance

    data =: 1 2 3,2 2 3,2 3 3,1 2 3,:2 3 4
    query =: 1 2 3
    k =: 3

    k {."1 /:"1 query&dist"1 data

Whenever I've seen K or J code and someone has actually explained how it works, it's turned out to be viable to implement the necessary primitives to get similar expressiveness without making it completely unreadable. I'd love to learn more about them, but I'm starting to feel that the terse syntax is mostly obscuring relatively simple/basic functionality that's easy to re-implement rather than any major language features.

My eyes are bleeding already

Ruby is so expressive it's a shame it isn't used more for ML and AI.

Just last night, I was thinking about what data and stats tooling would be like if NumPy had been, say, NumRuby instead.

For the js crowd, here's an implementation of KNN in Node I did a few years ago: https://github.com/axiomzen/Alike

And it's cousin KD-tree: https://github.com/axiomzen/look-alike

K-NN is one of the more concise classifiers to implement. I did a Python implementation a while ago that can fit into a tweet -[1]. Since maps/lambdas are available in Ruby, this should be possible in Ruby too. Sorry for the bad presentation - I am planning to migrate soon.

[1] http://quipu-strands.blogspot.com/2014/08/knn-classifier-in-...

I get that this is supposed to be basic but why wouldn't you at least observe that the sqrt call is unnecessary?

Premature optimization is the root of all evil, etc etc.

KNN is quite a simple operation in ML. I might be underestimating the effort but... Why is this an achievement?

I think it's a post aimed at ML and/or ruby beginners. Why it's front page on HN is a mystery.

It's front page because enough people found it interesting enough to upvote.

who still thinks Ruby is a good idea for this?

Why is it a bad idea?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact