Movie Recommendations with k-Nearest Neighbors and Cosine Similarity

wavesum · on March 31, 2014

The Netflix challenge forums are a treasure trove when it comes to this particular ML problem. In the end the algorithms grew into huge monsters blending results from tens of algorithms, but the most interesting discoveries were made in the first year IMO.

http://www.netflixprize.com/community/

thomasahle · on March 31, 2014

Wow, I never thought of looking there. It's amazing reading tricks like that which people have come up with after years of thinking and experimenting. And they seem usable for quite a wide range of rating related problems.

avaku · on March 31, 2014

KNN is not nearly as good as deeper probabilistic models. For example, if movie description mentions different subjects like "comedy" and "family", the model would not be able to differentiate them. It would only have a cluster "family comedies", but it won't be able to tell that a person likes "comedies".

I've recently done a similar research on finding "company peers" using company descriptions. The descriptions are put through the LDA (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) to find the topics expressed in each description. Then, similar companies are identified using K-L divergence (http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diverg...) between their topic distributions. It worked much better than cosine similarity in my tests.

My guess is that it would work better for movies too. If interested, have a look at my results here: http://akuz.me/2014/03/finding-company-peers-using-lda/

lowglow · on March 31, 2014

Would anyone be interested in more ML stuff coming out of #SFHN? We have a new speaker in the pipeline wanting to speak to Simulated Annealing.

agibsonccc · on March 31, 2014

Along that same note you may want to use the words genetic algorithms. It sounds cooler and they're in the same category.

Look in to hill climbing algorithms if you're ever curious how basic optimization algorithms (re: games) work.

robert_tweed · on March 31, 2014

lowglow · on March 31, 2014

Sounds great. I think we'll aim for a May deadline on the ML talk.

Radim · on March 31, 2014

Related: benchmarks of available k-NN libs using cosine similarity, in Python.

http://radimrehurek.com/2013/12/performance-shootout-of-near...

sgt101 · on March 31, 2014

The ideas that are used here are heavily researched. As an example of current work have a look at http://www.eecs.qmul.ac.uk/~laurissa/Laurissas_Pages/Publica...

Most of the challenge is in getting a way of assessing the value of innovations in the algorithms - how do you know how well it works ? Difficult unless you are running a large scale recommender that users can't opt out of (because pop goes your stats if the do!)

rpicard · on March 31, 2014

I've been thinking about something like this for a while. My idea was to use data from IMDB to create a graph where the distance between movies is some ranking of similarity based on the people involved in the movie, the genre, the setting and any other information you could get from the data set.

You could say, "I want to watch a movie like The Wolf of Wall Street" and it would find the closest 10 movies in the graph.

It's still something I'd like to play with if I find the time.

mcphilip · on March 31, 2014

I started on something similar as a side project[1]. I decided to build a dataset from the AFI Top 100 Films list and persist it in neo4j. The goal was to find interesting questions to answer with this dataset that couldn't easily be googled.

Most of my time thus far has been spent gathering the dataset, but I do have a few example cypher queries answering the following simple questions [2]:

1) What actors have appeared in the most AFI Top 100 films?

2) What are the genres of the top ten films?

3) Have any actors appeared in 2 or more of the top 25 films?

I'm working on building a much larger data set using a combination of freebase and imdb so that I can have enough data to start exploring much more interesting interesting questions (e.g. graph the frequencies of genres over the past 60 years; for a given film, find movies with the greatest overlap in genres, actors, and directors; generalize the n-degrees-to-bacon problem to work on any two actors; etc).

[1]https://github.com/mcphilip/film-graph

[2]http://htmlpreview.github.io/?https://github.com/mcphilip/fi...

rpicard · on March 31, 2014

Very cool. Thanks for posting this. The n-degrees-to-bacon problem is actual what made me think of this in the first place. It would be great to be able to plug in two actors and have it spit out an answer with the shortest path.

mg · on March 31, 2014

Im doing something similar at http://www.movie-map.com

Its not based on imdb, but based on http://www.gnovies.com

LanceH · on March 31, 2014

I've contemplated the idea of taking soundtracks from movies and comparing them to a user's favorite tracks.

It wouldn't provide accurate prediction of the best pick movie to watch, but it might come up with an indirect, quality pick that might otherwise never been seen.

rpicard · on March 31, 2014

Wouldn't it be awesome if it turned out to be a great indicator of movies they'd like though? That sounds like a fascinating experiment.

rpicard · on March 31, 2014

That's awesome! What is your algorithm like for determining the similarity? There are some good answers, but The Wolf of Wall Street is apparently pretty close to Frozen. ;)

rmc · on March 31, 2014

"Films similar to X" is definitly a good use case. But sometimes you can just go to the Amazon page for a DVD, and look at "people who bought this also bought that" ;)

kenshiro_o · on March 31, 2014

This looks very good. I have not delved into maths stuff for quite some time so it is refreshing to read such an approachable article.

Which other relatively simple techniques could we apply to find out who is similar to us in terms of movie taste, etc?

Also, It may be better to compute cosine similarity based on movie type, as it may provide less noise.

Moreover, I have never used R but it seems like a very neat language...

ManyNames · on March 31, 2014

I was wondering how this possibly got so high on HN with such a basic method, but the break-down and explanation given is great!

UK-AL · on March 31, 2014

Considering most stuff on HN is just web development/general programming its not that simple for some people on here.

For anyone who's studied a bit of ML its simple though.

hitlin37 · on March 31, 2014

This looks good, i will give it a try.Also GraphGists looks good way, quite similar to ipython. I have one question: Why do i use neo4j when i can do k-NN in python with so many macine learning toolkits available? What benefit do i get when using neo4j for movie recommendation?

PaulRobinson · on March 31, 2014

Neo4j as a data persistence layer has many, many advantages over other data persistence layers for graph-like problems (social, recommendation, etc.), principally from the point of view of performance.

You can do it in Python but you're going to have to persist and query your data somehow, and I think at that point you'll "get" why a graph DB might be beneficial. It's not an ML-related thing though, it's a data query performance thing.

mcphilip · on March 31, 2014

This post was an entry in the Winter 2013 Neo4j GraphGist Contest. The author of this submission also has a blog post about doing k-NN in R [1]. In other words, this GraphGist is an example of how you could do basic ML in neo4j, but it's not meant to imply that neo4j should be chosen over python.

http://nicolemargaretwhite.blogspot.com/2013/12/movie-recomm...