

Movie Recommendations with k-Nearest Neighbors and Cosine Similarity - mcphilip
http://gist.neo4j.org/?8173017

======
wavesum
The Netflix challenge forums are a treasure trove when it comes to this
particular ML problem. In the end the algorithms grew into huge monsters
blending results from tens of algorithms, but the most interesting discoveries
were made in the first year IMO.

[http://www.netflixprize.com/community/](http://www.netflixprize.com/community/)

~~~
thomasahle
Wow, I never thought of looking there. It's amazing reading tricks like that
which people have come up with after years of thinking and experimenting. And
they seem usable for quite a wide range of rating related problems.

------
avaku
KNN is not nearly as good as deeper probabilistic models. For example, if
movie description mentions different subjects like "comedy" and "family", the
model would not be able to differentiate them. It would only have a cluster
"family comedies", but it won't be able to tell that a person likes
"comedies".

I've recently done a similar research on finding "company peers" using company
descriptions. The descriptions are put through the LDA
([http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation))
to find the topics expressed in each description. Then, similar companies are
identified using K-L divergence
([http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diverg...](http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence))
between their topic distributions. It worked much better than cosine
similarity in my tests.

My guess is that it would work better for movies too. If interested, have a
look at my results here: [http://akuz.me/2014/03/finding-company-peers-using-
lda/](http://akuz.me/2014/03/finding-company-peers-using-lda/)

------
lowglow
Would anyone be interested in more ML stuff coming out of #SFHN? We have a new
speaker in the pipeline wanting to speak to Simulated Annealing.

~~~
robert_tweed
Yes.

~~~
lowglow
Sounds great. I think we'll aim for a May deadline on the ML talk.

------
Radim
Related: benchmarks of available k-NN libs using cosine similarity, in Python.

[http://radimrehurek.com/2013/12/performance-shootout-of-
near...](http://radimrehurek.com/2013/12/performance-shootout-of-nearest-
neighbours-contestants/)

------
sgt101
The ideas that are used here are heavily researched. As an example of current
work have a look at
[http://www.eecs.qmul.ac.uk/~laurissa/Laurissas_Pages/Publica...](http://www.eecs.qmul.ac.uk/~laurissa/Laurissas_Pages/Publications_files/5137a351.pdf)

Most of the challenge is in getting a way of assessing the value of
innovations in the algorithms - how do you know how well it works ? Difficult
unless you are running a large scale recommender that users can't opt out of
(because pop goes your stats if the do!)

------
rpicard
I've been thinking about something like this for a while. My idea was to use
data from IMDB to create a graph where the distance between movies is some
ranking of similarity based on the people involved in the movie, the genre,
the setting and any other information you could get from the data set.

You could say, "I want to watch a movie like _The Wolf of Wall Street_ " and
it would find the closest 10 movies in the graph.

It's still something I'd like to play with if I find the time.

~~~
no_gravity
Im doing something similar at [http://www.movie-map.com](http://www.movie-
map.com)

Its not based on imdb, but based on
[http://www.gnovies.com](http://www.gnovies.com)

~~~
LanceH
I've contemplated the idea of taking soundtracks from movies and comparing
them to a user's favorite tracks.

It wouldn't provide accurate prediction of the best pick movie to watch, but
it might come up with an indirect, quality pick that might otherwise never
been seen.

~~~
rpicard
Wouldn't it be awesome if it turned out to be a great indicator of movies
they'd like though? That sounds like a fascinating experiment.

------
kenshiro_o
This looks very good. I have not delved into maths stuff for quite some time
so it is refreshing to read such an approachable article.

Which other relatively simple techniques could we apply to find out who is
similar to us in terms of movie taste, etc?

Also, It may be better to compute cosine similarity _based on movie type_ , as
it may provide less noise.

Moreover, I have never used R but it seems like a very neat language...

------
ManyNames
I was wondering how this possibly got so high on HN with such a basic method,
but the break-down and explanation given is great!

~~~
UK-AL
Considering most stuff on HN is just web development/general programming its
not that simple for some people on here.

For anyone who's studied a bit of ML its simple though.

------
hitlin37
This looks good, i will give it a try.Also GraphGists looks good way, quite
similar to ipython. I have one question: Why do i use neo4j when i can do k-NN
in python with so many macine learning toolkits available? What benefit do i
get when using neo4j for movie recommendation?

~~~
PaulRobinson
Neo4j as a data persistence layer has many, many advantages over other data
persistence layers for graph-like problems (social, recommendation, etc.),
principally from the point of view of performance.

You can do it in Python but you're going to have to persist and query your
data somehow, and I think at that point you'll "get" why a graph DB might be
beneficial. It's not an ML-related thing though, it's a data query performance
thing.

