
More data usually beats better algorithms - toffer
http://anand.typepad.com/datawocky/2008/03/more-data-usual.html
======
gaika
Not true - from current leaders of netflix prize:

Our experiments clearly show that once you have strong CF models, such extra
data is redundant and cannot improve accuracy on the Netflix dataset.

[http://glinden.blogspot.com/2008/03/using-imdb-data-for-
netf...](http://glinden.blogspot.com/2008/03/using-imdb-data-for-netflix-
prize.html)

~~~
greendestiny
The students used a simple algorithm and got nearly the same results as the
bellkor team. So the extra data isn't redundant if it enables a simpler
algorithm to perform as well as a more complicated one, even if the
complicated algorithm gets no benefit from the extra data.

~~~
gaika
Would you, as a human, prefer a model that gives you accurate predictions with
less data, or a model (that is simpler) but needs more data to be accurate?

How is an algorithm any different?

~~~
greendestiny
Well put it this way, to get good predictions without a lot of data takes a
whole team of researchers at bell labs working on this for over a year, with
data its a student project.

------
michaelneale
Well - thats what google keep saying (specifically, I think Peter Norvig has
said that, I think, over and over - that he hasn't had access to as much data
before and its fascinating to him).

Ah Peter Norvig, responsible for hours of my time wiled away on his web site
with all sorts of knowledge porn.

~~~
henning
One talk in particular in which he focuses on this idea, including with facts
and figures from research literature, is a talk he gave a while ago called
"Theorizing from Data": <http://www.youtube.com/watch?v=nU8DcBF-qo4> . You
should definitely watch it if you haven't seen it before.

Norvig states his opinion slightly differently: of course being smart about
algorithms is good. But until you get a lot of it, you often can't even fairly
evaluate different algorithms. It's only when you're no longer getting
significant gains from more data that you should then start thinking about
being an algorithm smartypants.

~~~
sammyo
Now that is just a marvelous restatement of the canonical optimization
principle!

------
jsomers
"Team B used a very simple algorithm, but they added in additional data beyond
the Netflix set: information about movie genres from the Internet Movie
Database (IMDB)."

I had no idea imdb had so much genre data. E.g., a "keywords" page for every
movie [<http://www.imdb.com/title/tt0062622/keywords>] and, for every keyword,
maps of (a) related keywords and (b) movies that mention it [c.f.,
<http://www.imdb.com/keyword/metaphysical/>].

Very cool.

------
stcredzero
More like bad/insufficient data defeats even good algorithms. When we
recommend movies to friends, we are often using very different and more useful
information than what's in the Netflix database.

------
csmajorfive
I did the same thing last semester inspired by a class at Cornell. We came up
with a very, very simple graph-based algorithm that gets above the
competition's baseline (not quite bellkor level but there's lots of tweaking
left to be done).

Now -- I was under the impression that using extra proprietary data (like
imdb) is beyond the bounds of the competition. Can anyone shed some light on
this? Maybe I should pick up the project again!

~~~
msg
"Why provide rating dates and movie names?"

Other datasets provide it. Cinematch doesn’t currently use this data. Use it
if you want.

As it happens we provided years of release for all but a few movies in the
dataset; those seven movies have NULL as the "year" of release. Sorry about
that.

"Why not provide other data about the movies, like genres, directors, or
actors?"

We know others do. Again, Cinematch doesn’t currently use any of this data.
Use it if you want.

<http://www.netflixprize.com/faq>

