
Netflix prize tribute: Recommendation algorithm in Python - fogus
http://blog.smellthedata.com/2009/06/netflix-prize-tribute-recommendation.html
======
physcab
Really cool blog. I'll add it to the other good ones I typically read:
flowingdata.com and datawrangling.com.

I wonder why he chose to use PMF. There are seemingly less complicated
algorithms (PCA/SVD/LSA) that appear to do the same type of thing.

It also looks like PMF uses expectation maximization, which is quite common
but not necessarily scalable (although I could be wrong). This would probably
take quite a bit of time to run on the whole dataset and unfortunately he
doesn't give any specs!

~~~
danger
SVD doesn't work when you have a sparse matrix as input. You could use some
sort of imputation method (like EM) to fill in all of the missing values then
run SVD, but this would be extremely inefficient, since only a tiny fraction
of the user x movies matrix of ratings is observed. It also probably wouldn't
work very well, because your true signal from the data would be overwhelmed by
all of the missing values that you'd somewhat arbitrarily filled in.

PMF is a generalization of SVD that is able to work directly on the sparse
structure of the input. It does gradient descent -- modifying the low
dimension latent vectors directly to minimize error, and never does any
imputation of missing values. This makes it very efficient, and it can easily
handle millions of ratings. See
<http://www.cs.toronto.edu/~amnih/papers/pmf.pdf> for an application of PMF to
the very large Netflix problem.

~~~
bravura
Another reason he uses PMF is probably because Danny Tarlow (post author) is a
student at UToronto. Russ Salakhutdinov and Andrei Mnih, the authors of the
PMF method, are also students at UToronto.

------
lkozma
Here's the same algorithm and other variants implemented in Matlab/C++ (can be
run on Netflix-sized data): <http://www.cis.hut.fi/alexilin/software/>

