

Netflix prize competitor: With the best algorithms, metadata becomes worthless - bdr
http://pragmatictheory.blogspot.com/2008/08/you-want-truth-you-cant-handle-truth.html

======
gaika
Metadata is useful when you need to interpret the results, and most people
care about why something is recommended to them. Top netflix algorithms are
black boxes from the user standpoint, that doesn't help when making a decision
to buy or rent based on that recommendation.

Compare that with Amazon's approach when they sacrifice the predictive power
for the useful explanation (customers who bought X also bought Y)

~~~
andrewf
I wonder if, having used an impossible-to-explain algorithm to arrive at your
solution, you could compare your solution against a few very simple
algorithms, and offer up the best fit as a rationalization.

~~~
byrneseyeview
Fine, but if anyone copyrights normal human reasoning, you're in trouble.

~~~
wheels
Legal nitpick: You meant patents. Patents are for processes, copyright is for
instances.

~~~
byrneseyeview
You're right. Thanks for the correction.

------
randomwalker
This is fascinating. I was aware of one half of what the article talks about:
me and my co-author broke the anonymity of the netflix data (see
<http://www.cs.utexas.edu/~arvindn/> for paper/press links). Our main insight
was that everyone's movie watching behavior is different. The quote "User
tastes are infinite shades of grey" in the article just about sums it up
perfectly.

What's funny is that I keep arguing for using more meta-data with my friends
who are participating in the competition. I guess I didn't realize that data
mining algorithms actually capture the nuances of user tastes.

------
martian
This reminds me of some of the spam filtering algorithms I've read about.

You'd think that categorizing spam based on keywords (or sender IP, etc) would
be useful, but machine learning algorithms can pick up more subtle nuances of
language patterns and act more effectively.

[http://portal.acm.org/citation.cfm?id=1216017&jmp=cit...](http://portal.acm.org/citation.cfm?id=1216017&jmp=cit&coll=GUIDE&dl=ACM)

~~~
bluishgreen
In addition it can pick up stuff that you never thought about as
properties/clues for classification. PG has written about this in his spam
posts.

------
DaniFong
Metadata is in fact useful, though not the metadata that you might expect. One
of the biggest wins many teams made was when they started ranking similarity
based on edit distance of titles.

~~~
ambition
Edit distance of titles?! Do you have a source? I'm very curious about how and
why that would help.

~~~
trevelyan
Indiana Jones and the _______________.

~~~
akd
"Heat" and "WALL-E" have a shorter edit distance between them than any Indiana
Jones movies.

~~~
jsn
not if you normalize by e.g. minimal string length.

------
DarkShikari
This makes perfect sense; genres and other information about movies is just an
approximation of user taste, while the actual taste of users themselves is
clearly the best data to train your models on, since that's what you have to
predict.

Any good model should be able to derive the relationships between films
_without_ knowing them beforehand, solely by using users' choices. And,
likely, these relationships will be more useful than any from an external
database.

------
ggrot
I can see that metadata about the movies becomes worthless, because there is
already a wealth of data about each movie entity in the dataset. However,
metadata about the users should be fruitful since each user has fairly few
data points to use for prediction.

Take for example two users who have each rated _only_ Wall-E, and they both
rated it a 5. Now, given Jet Li's "The One", what prediction do you give for
each user? It is unlikely that two real people with this one data point on
Wall-E would have the same outcome on "The One", so any additional data that
can help to statistically separate the people can only help your case. For
example, is the person male/female? What are the person's favorite genre's
(something netflix collects), even things like did the person sign up for
6-at-a-time or 2-at-a-time might correlate slightly.

------
maxklein
The way I see it is that these people have a set of data that one could say is
a line on an xy axis. This line goes up, down, etc, and there does not seem to
be any pattern. So they come up with a bunch of algorithms that go as near to
the line as possible -> they approximate the line with an algorithm. So from
that, they can predict how the next step of the function is going to look
like.

Metadata is like placing some dots on this line and saying "this spot is
horror", "this spot is comedy". It becomes irrelevant, because you are already
near enough to the line, and that dot does not help you any.

If I were dealing with this problem, what I would do is break free of these
constraints and concentrate on taking the data as an abstract blob of random,
then splitting the individual data (i.e, move data into separate 'dimensions')
till I had hundreds of straight lines, and then using those for prediction.
But I'm sure they must have tested this already :)

I'm rooting for the team with the two jewish guys and the black guy,
afterwards they could get together and make a sitcom. Or a joke.

------
andreyf
I could see how it's easier to learn user simple preferences from their voting
history, but it's shortsighted to say "all meta data is useless".

What about deriving statistical information from scripts, reviews, or online
forums?

------
aswanson
I would guess that an SVD/SVM feature extraction of the movie script could be
of predictive value.

