Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Netflix prize competitor: With the best algorithms, metadata becomes worthless (pragmatictheory.blogspot.com)
69 points by bdr on Aug 15, 2008 | hide | past | favorite | 31 comments


Metadata is useful when you need to interpret the results, and most people care about why something is recommended to them. Top netflix algorithms are black boxes from the user standpoint, that doesn't help when making a decision to buy or rent based on that recommendation.

Compare that with Amazon's approach when they sacrifice the predictive power for the useful explanation (customers who bought X also bought Y)


I wonder if, having used an impossible-to-explain algorithm to arrive at your solution, you could compare your solution against a few very simple algorithms, and offer up the best fit as a rationalization.


Fine, but if anyone copyrights normal human reasoning, you're in trouble.


Legal nitpick: You meant patents. Patents are for processes, copyright is for instances.


You're right. Thanks for the correction.


I don't know you can say Amazon is sacrificing any power. The Netflix prize isn't about better recommendations per se, rather its about predicting users movie ratings. Its not clear predicting these ratings better will give better recommendations.

I think most of the component algorithms in the Bell labs method provide fairly straight forward explanations - the best predictor is a nearest neighbour algorithm.


Amazon isn't sacrificing predictive power because they want to give you the explanation. They're just including profit margin in the data to give recommendations that are valuable to them. Not that I can blame them.


But Netflix does say that. e.g.:

Twin Peaks: Season 1 (3-Disc Series)

Because you enjoyed: Mulholland Drive Brazil Videodrome

So right away I know that Twin Peaks is pretty freaky.


I'm talking about current winners in netflix prize. Their algorithms are a combination of hundreds individual recommenders blended together by another learning algorithm. There's no easy way to explain the result, but I guess one can cheat and run a simpler version on top of it to produce an "explanation".


Correct. When using a latent factor method, providing an explanation for for the prediction has been essentially intractable.

A common post-processing step for many teams (particularly BellKor/KorBell) is a KNN. The KNN's are used to provide a sort of confidence metric and if the anyone so decided, using the KNN derived network for explanation is pretty simple and not really cheating.

Edit: Yehuda might have made some breakthrough - He'll be presenting a paper at KDD "Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model", the paper will probably be released after the conference.


its already on his page http://www.research.att.com/~yehuda/index_pubs.html

4th one down


It's interesting, because there's evidence that that's what we as humans do. The subconscious decides something, then we make up a story when asked why we did X.


This is fascinating. I was aware of one half of what the article talks about: me and my co-author broke the anonymity of the netflix data (see http://www.cs.utexas.edu/~arvindn/ for paper/press links). Our main insight was that everyone's movie watching behavior is different. The quote "User tastes are infinite shades of grey" in the article just about sums it up perfectly.

What's funny is that I keep arguing for using more meta-data with my friends who are participating in the competition. I guess I didn't realize that data mining algorithms actually capture the nuances of user tastes.


This reminds me of some of the spam filtering algorithms I've read about.

You'd think that categorizing spam based on keywords (or sender IP, etc) would be useful, but machine learning algorithms can pick up more subtle nuances of language patterns and act more effectively.

http://portal.acm.org/citation.cfm?id=1216017&jmp=cit...


In addition it can pick up stuff that you never thought about as properties/clues for classification. PG has written about this in his spam posts.


I made a simple spamfilter in Erlang based on the bayesian approach, works really well already with just a few spamemails.


Metadata is in fact useful, though not the metadata that you might expect. One of the biggest wins many teams made was when they started ranking similarity based on edit distance of titles.


levenshtein distance for predictions? haha.

I'd be really curious to know which teams in particular use movie metadata? Yehuda Koren (the first commenter in the blog post) has explicitly stated many times that in his humble opinion movie titles and any non-explicit info has been useless.

BellKor, BigChaos, Gravity, and Gavin Potter (just a guy in a garage) are going to be presenting at Yehuda's KDD workshop next week. I'm sure other teams will also be represented. I'll ask them if they use movie metadata, and I'm pretty sure the answer will be no.


Edit distance of titles?! Do you have a source? I'm very curious about how and why that would help.


Indiana Jones and the _______________.


"Heat" and "WALL-E" have a shorter edit distance between them than any Indiana Jones movies.


not if you normalize by e.g. minimal string length.


Here's a paper on the BellKor solution, from one of the top teams:

http://research.att.com/~volinsky/netflix/ProgressPrize2007B...


Yehuda later wrote http://glinden.blogspot.com/2008/03/using-imdb-data-for-netf... and http://hunch.net/?p=331 that using movie metadata has produced no measurable improvement in RMSE.


The crux of the argument though, is that if you have a strong CF model with many many ratings, you don't seem to get much benefit with their approach (linear combination of models). That doesn't mean that metadata can't be useful with a different approach. It also doesn't mean that metadata isn't useful for sparse data: in fact, it's incredibly useful, because you don't have much of anything else.


I cannot dispute that metadata can be useful. But it appears, at least for prediction tasks similar to the prize, that an ounce of weak or strong explicit user input is worth a ton of rich implicit data (including item metadata).


This makes perfect sense; genres and other information about movies is just an approximation of user taste, while the actual taste of users themselves is clearly the best data to train your models on, since that's what you have to predict.

Any good model should be able to derive the relationships between films without knowing them beforehand, solely by using users' choices. And, likely, these relationships will be more useful than any from an external database.


I can see that metadata about the movies becomes worthless, because there is already a wealth of data about each movie entity in the dataset. However, metadata about the users should be fruitful since each user has fairly few data points to use for prediction.

Take for example two users who have each rated only Wall-E, and they both rated it a 5. Now, given Jet Li's "The One", what prediction do you give for each user? It is unlikely that two real people with this one data point on Wall-E would have the same outcome on "The One", so any additional data that can help to statistically separate the people can only help your case. For example, is the person male/female? What are the person's favorite genre's (something netflix collects), even things like did the person sign up for 6-at-a-time or 2-at-a-time might correlate slightly.


The way I see it is that these people have a set of data that one could say is a line on an xy axis. This line goes up, down, etc, and there does not seem to be any pattern. So they come up with a bunch of algorithms that go as near to the line as possible -> they approximate the line with an algorithm. So from that, they can predict how the next step of the function is going to look like.

Metadata is like placing some dots on this line and saying "this spot is horror", "this spot is comedy". It becomes irrelevant, because you are already near enough to the line, and that dot does not help you any.

If I were dealing with this problem, what I would do is break free of these constraints and concentrate on taking the data as an abstract blob of random, then splitting the individual data (i.e, move data into separate 'dimensions') till I had hundreds of straight lines, and then using those for prediction. But I'm sure they must have tested this already :)

I'm rooting for the team with the two jewish guys and the black guy, afterwards they could get together and make a sitcom. Or a joke.


I could see how it's easier to learn user simple preferences from their voting history, but it's shortsighted to say "all meta data is useless".

What about deriving statistical information from scripts, reviews, or online forums?


I would guess that an SVD/SVM feature extraction of the movie script could be of predictive value.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: