

My Python Code for the Netflix Prize - alexbw
https://github.com/alexbw/Netflix-Prize
I competed alone in the Netflix Prize in college under the team name "Hi!". I've never seen anybody release their code, and I'm getting back into machine learning now, and realized that some folks might want to take a gander at a competitive machine learning codeset.<p>It's implemented mostly in Python, with Cython for the real speed-sensitive parts (everything in file "svd.pyx" did the heavy lifting, and got me up the leaderboard).<p>I hope that some folks will find this useful.
======
Nogwater
Here's mine if anyone is interested. I wrote it in D and haven't looked at it
in years. I'm sure it's not usable as-is, but it might be fun anyway.

<https://github.com/nogwater/NetflixPrizeD>

The algorithm is based on Simon Funk's blog post here:
<http://sifter.org/~simon/journal/20061211.html>

For me, the best part was squeezing the data and indexes into memory. :)

------
alexbw
@tuananh I've got the dataset stored away, but I don't know if I'm legally
allowed to post it. Would love if someone could produce proof one way or the
other.

@viraj_shah I spent about 6 months working on the project before I had to stop
to concentrate on my schoolwork (I was a senior in collge at the time). I
think it would have been impossible to do this for myself without Cython. If
it were to happen today, I would probably be writing in PyCuda, or with Numba,
and it would be much, much, MUCH more succinct.

~~~
viraj_shah
@alexbw Thanks for the info, great work!

------
richardlblair
You indent by 8 characters.... I wanted to read your code but this will make
my eyes bleed.

From pep 8: "Use 4 spaces per indentation level."

~~~
beaumartinez
Close but no cigar. He's indenting with tabs.

------
arekp
I wrote a 195-page monograph on the Netflix Prize, for people interested in
that sort of stuff: <http://arek-paterek.com/book>

~~~
rabidsnail
Comic Sans?

~~~
arekp
right on :)

the beauty of self-publishing

------
skystorm
Very nice. It might be helpful to (briefly) describe the actual techniques you
tried in the readme file? At least that's the first thing I looked for...

------
viraj_shah
This was incredibly kind of you to post up. It is great to see it public
domain as many can learn from it. The Cython code looks scary though - 18k
lines! May I ask how long you spent on this?

------
andreasvc
Nitpick: binary blobs like .pyc and .so don't belong in a code repository.
Instead you would put a makefile or setup.py to compile the .pyx files.

------
jdleesmiller
I also worked on this at uni and had lots of fun -- those lessons certainly
look familiar! We were trying to mine Wikipedia for more information on the
movies. The code's here:

[http://code.google.com/p/wikipedia-
netflix/wiki/WikipediaNet...](http://code.google.com/p/wikipedia-
netflix/wiki/WikipediaNetflix)

It includes Wikipedia parsing stuff and a fairly fast C++ implementation of
the very cool BellKor kNN algorithm.

~~~
textminer
Love each of the individual BellKor approaches
([http://www2.research.att.com/~volinsky/netflix/ProgressPrize...](http://www2.research.att.com/~volinsky/netflix/ProgressPrize2007BellKorSolution.pdf))
for finding recommendations in the space of movies or users-- an MDS
embedding, a PCA whitening, an NMF factorization by alternating least squares.
Each of those hunches seems like the true art in these problems. The blending
100 of them together is far less interesting to me, though.

Yet that seems to be the sort of jockeying and tweaking these problems (seen
now in Kaggle contests) seem to require. Is there an art or science then to
the subsequent blending? Does one develop a better intuition for the problem
at that point, or am I entirely missing the point of most ensemble methods
(predictiveness over parsimonious understanding)?

~~~
beagle3
> Is there an art or science then to the subsequent blending?

You could regard this as an application of the "Smoothed expectation theorem",
Saying E[X] = E[E[X|Y]]. That is, if you are trying to compute the expectation
of something, you can make it depend on anything else, and compute the inner
expectation with respect to that. Might seem trivial or useless, but it is
wildly applicable and often significantly simplifies computations.

One of the practical implications is that if you're not sure about something
(underlying model, specific parameters), just apply some prior distribution
and compute the expectation over that -- it is essentially guaranteed* to
provide a better result than trying to pick the correct setup.

Although I'm not sure what the interpretation here would be.

* - so long as the entropy of your prior is not more wrong than the entropy of your hyper-parameters. This _is_ often the case.

~~~
textminer
Yeah, the tower property! That made my day. Thanks for cleanly giving
motivation and mathematical beauty to something that irked me up until now.
Which is probably the problem of having your aesthetics drive you in the first
place.

------
JacobiX
Thanks for posting this. Your code combines two successful approachs : a
latent factor model (SVD) and a neighborhood model :)

Here's my implementation of a recommender algorithm in C if someone is
interested : <https://github.com/GHamrouni/Recommender>

------
surine
Mine is at <https://github.com/hbcdev/Netflix> in C++, got to the top 500. It
runs just inside my 8GB PC :)

------
tlocke
I had a look though this and was confused by this line:

ratings[ratings<1.0] = 1.0

in

[https://github.com/alexbw/Netflix-
Prize/blob/master/src/pred...](https://github.com/alexbw/Netflix-
Prize/blob/master/src/predict.py)

Is it specific to NumPy? Or perhaps a Python trick I haven't seen before?

~~~
malkarouri
Specific to NumPy. Boolean indexing.

[http://www.scipy.org/Cookbook/Indexing#head-86055279f6592d36...](http://www.scipy.org/Cookbook/Indexing#head-86055279f6592d36b8956f3400e3716c6491ae95)

~~~
beambot
Yep. The indexing and slicing are quite powerful. For those familiar with
MATLAB, this is a fantastic resource:
<http://www.scipy.org/NumPy_for_Matlab_Users>

------
tuananh
great for ref. however i can't found the dataset anywhere on the internet :(

~~~
alexholehouse
Netflix had to pull the dataset after a researchers at U Texas were able to
de-anonymize the dataset, shown in their paper Robust De-anonymization of
Large Sparse
Datasets.[<http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf>]

------
raheemm
Good readme file. Liked the lessons learned.

------
marklit
Does anyone have any thoughts on the lack of conformity to PEP8? His code
works and it's valid python but I feel that it's difficult to read.

