

Calculating 316 Million Movie Correlations in 2 Minutes (Down From 2.5 Hours) - physcab
http://dmnewbie.blogspot.com/2009/06/calculating-316-million-movie.html

======
physcab
Hi HN, I submitted this article because it has sat in my bookmarks for a while
and I've referenced it quite a number of times over the past few months. I
thought it may be helpful to some of you who work on similar problems.

I like it because when most people think of "giant datasets" they think they
need some special tool to process it. Infact, there are such tools and I use
them everyday (namely Hadoop), but this article is a reminder that with the
proper forethought and consideration for the data structures in question, it
is quite possible to wrestle a large dataset on a single machine.

Also, people always wonder how to better optimize their code. I think this is
one of the few examples I've seen where the author went through a series of
steps to obtain the optimization they had in mind and documented their
strategy well. It serves a practical purpose too.

If you want to try your hand out at this problem you can obtain the dataset
here: <http://archive.ics.uci.edu/ml/datasets/Netflix+Prize>

and follow the forums here: <http://www.netflixprize.com/community/>

------
blantonl
I hate to ask this, but I'm going to do it anyway.

Can someone explain this to the rest of us (maybe just me) what the heck this
article is about?

This article started right out of the gate assuming the reader was well
informed of the context.

~~~
jey
k-nearest neighbors is a simple approach for prediction in machine learning.
The objective is to predict the value of a function at some point for which we
don't have an observation in the training set. In the Netflix Prize, this
means predicting the rating a user _U_ would give to some unrated movie _M_.
The kNN approach is: 1. Identify the k users most similar to _U_. This is
called the "neighborhood". 2. Have these k neighbors vote on the rating that
_U_ should assign to _M_.

The premise behind the above scheme is that similar users will assign
approximately the same rating to a particular movie. To actually implement the
kNN scheme requires a notion of "similarity" for step 1 and "voting" for step
2. The linked article is using Pearson Correlation as the similarity function
(aka the "distance metric"), and some kind of weighted average as the voting
function (as mentioned in [http://dmnewbie.blogspot.com/2007/09/greater-
collaborative-f...](http://dmnewbie.blogspot.com/2007/09/greater-
collaborative-filtering.html) )

I don't think this would work very well on the Netflix dataset because the
training set is super sparse. I deliberately glossed over this above, but
users in _U_ 's neighborhood who haven't rated the movie _M_ are useless when
voting on the value that _U_ should assign to _M_! So you have to make a call
about how you form the neighborhood: do you just find the k nearest users and
only average over the <= k users who actually rated _M_? Or do you find the k
nearest users who actually assigned a rating to _M_ (ignoring _U_ 's neighbors
who haven't rated _M_ )? Either way, with a dataset that's as sparse as the
Netflix data, you're going to have a hard time forming useful neighborhoods
since either you're going to have neighborhoods where there's very little
information to go off of, or the "k most similar" users are actually really
not very similar to _U_ at all, leading to inaccurate prediction.

More info: <http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm>

Chapter two of the excellent and free book "Elements of Statistical Learning"
has a better exposition of this idea. [http://www-
stat.stanford.edu/~tibs/ElemStatLearn/download.ht...](http://www-
stat.stanford.edu/~tibs/ElemStatLearn/download.html)

~~~
Xichekolas
(Not really on topic, but if I could nominate this as an example of the ideal
HN comment, I would. It'd be nice to have a gallery of things like this
attached to the guidelines. Thanks jey!)

~~~
iamelgringo
Seconded. Excellent work, Jey.

