

“When you have enough data, sometimes, you don’t have to be too clever” - smarterchild
http://blog.strafenet.com/2011/12/17/when-you-have-enough-data-sometimes-you-dont-have-to-be-too-clever/

======
keithpeter
When data is easy to collect, someone will ask you to collect it and someone
else will query the data and compile a report with _percentages_ in it. Then
someone else will worry about some of the _percentages_ being less or more
than some _benchmark_. Then your work life will become less happy.

Example 1: Some years ago, I had to sit through a meeting where a committee
worried about a _2% drop_ in satisfaction scores on a student questionnaire.
No-one checked how many replies were involved (around 400, so it worked out to
about 6 people less in the second year than the first as the ratings were
something like 75%).

Example 2: I recently had to add comments in a record system about students
whose attendance _percentage_ had dropped below 90%. That was 8 weeks into the
course...

------
tambourine_man
Sometimes I get the feeling that when we had less data, we were forced to
think harder and more daringly. I feel we lack new groundbreaking theoretical
framework because of this.

I don't know if Newton's law's would jump out of the paper if you simply threw
a ball at one million different vectors.

~~~
apu
On the other hand, a lot of new research (including possibly ground-breaking
theoretical results) are only possible now that we have access to large data.

We might be initially processing the large data using relatively simple
techniques, but on the reduced data, we can now run more sophisticated methods
that actually work because the underlying data comes from a huge number of
samples.

As but one example, in computer vision, the concept of "attributes" --
automatically labeling objects using descriptive words instead of categorical
ones, i.e., "this thing is _like_..." rather than "this thing _is_..." -- has
opened the door to a number of exciting advances. One is the concept of "zero-
shot learning": automatically recognizing an object that you've _never seen an
instance of before_ simply via a description. For example, one could recognize
beavers as "small, four-legged furry rodents with big teeth and a flat tail",
without having ever seen a beaver before. The training data for this
classifier need not include beavers, but only images which match the
individual attributes, not necessarily all in the same image -- small, four-
legged, furry, rodent, big teeth, flat tail.

This kind of thing was not really possible before, because there just wasn't
enough data to train reliable classifiers for each attribute in any kind of
automated way.

Finally, as I alluded to at the beginning, these individual attribute
classifiers are often relatively simple algorithms, such as Support Vector
Machines (SVMs). Yet, the 2nd-stage algorithms that use the attribute values
to do something useful, such as the zero-shot learning application described
above, are often much more involved/advanced techniques.

------
DanielRibeiro
This is pretty much the same conclusion Ilya Grigorik (founder of postrank,
which was recently bought by Google) came to: <http://vimeo.com/22513786>

------
glimcat
Naive Bayes: the "good enough" classifier.

------
joe_the_user
Looking at the video, you could interpret his statement two ways. Either, the
headline - _“When you have enough data, sometimes, you don’t have to be too
clever”_ OR the sort-of-opposite - _"AI has made so little progress that we
don't anything much better than naive Bayes"_

~~~
apu
I'd say both are somewhat true.

A lot of early "progress" in AI was found to not survive contact with the real
world -- for example, most of computer vision. This was because collecting
data was so expensive/difficult that only a few images could be captured for
many experiments, and the methods they came up with often worked okay for
those examples, _but nothing else_! So a lot of clever-seeming algorithms end
up being rather useless in the real world, and progress was illusionary.

I find that in computer vision (my area of research), a fundamental component
of many disparate problems is that you are trying to interpolate or
extrapolate data in a very complicated underlying space, where linear
approximations are completely unusable and optimization is too unconstrained.
The key is to come up with suitable regularizers that can use prior
information to constrain the problem appropriately.

Getting more data thus helps in two ways:

1\. It reduces the amount of interpolation you have to do, since you can get a
denser sampling of the space.

2\. It allows you up to build up these priors using real data, making
interpolation much better.

