
Official Google Research Blog: The Unreasonable Effectiveness of Data - Anon84
http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html
======
presty
This line of thought follows previous posts on the same subject.

It's very Norvig-esq (<http://www.youtube.com/watch?v=LNjJTgXujno>), but
there's also [http://anand.typepad.com/datawocky/2008/03/more-data-
usual.h...](http://anand.typepad.com/datawocky/2008/03/more-data-usual.html)
and also Chris Anderson and Wired's flame bait
[http://www.wired.com/science/discoveries/magazine/16-07/pb_t...](http://www.wired.com/science/discoveries/magazine/16-07/pb_theory)
(that month's wired was dedicated to this subject)

And like someone at the previous discussions has said, this is the base of the
scientific method, not it's death

~~~
bd
Golden quote from Norvig's talk at Startup School:

 _Q: What's your opinion about semantic web?

A: Semantic web. Future of the web. And it always will be._

Also:

 _If I assigned engineers to (semantic web) formats based on the percentage of
pages that had those formats, then the correct number of engineers for
semantic web was zero._

------
Anon84
Unfortunately, this only shows one side of the equation. Namely, the internet
behemoth side.

If you're Google, Yahoo! or one of their friends, you can get away with
relying just on correlations extracted directly from data. After all, you have
all the data you could possibly want, and if you don't have, you can easily
measure it in a straightforward way.

Everybody else, however, has to do a much better job of developing the right
algorithms and insights to get the upper hand. The best way to do this of
course, is to use whatever data you manage to scrape together.

Luckily, they also seem to recognize that sometimes data just isn't enough and
ask for help. You've seen this in the Netflix prize, the AOL search log
debacle and more recently in Microsoft's release of search logs for WSCD09.

~~~
Retric
The Netflix prize is a contest to discover how much you can do with pure data.
I don't see how you can place it on the semantic web side of things when they
don't do any tagging etc.

~~~
Anon84
I said nothing the semantic web.

I just said that sometimes, all the data in the world isn't enough if you
don't have the right algorithms or insights.

------
mikepellon
I think an interesting related avenue of research would be investigating
analytically the "emotional" content of the Internet. Johnathan Harris over at
<http://www.number27.org> has made some great strides in the area looking at
blog posts and global news content (see <http://www.wefeelfine.org> and
<http://www.tenbyten.org>). While Harris has some very impressive
visualizations of massive amounts of data I believe we are at that point that
we can move beyond just looking at massive collections of data and begin to
saw something mathematically about the patterns and characteristics that
emerge from those sources. With the advent of cheap cloud computing, aka
Amazon EC2, such detailed and massive undertakings are now possible by
ordinary developers.

------
jacoblyles
"Let large quantities of data solve your problems" might not be the best
advice if you are hardware constrained. Not everyone has the petabytes of
storage and the terabytes of RAM that Google has.

I guess cutting edge natural language apps are going to be the playground of
the big boys until PCs reach the scale necessary to do experiments.

~~~
ntoshev
Practical natural language apps might require much less, though: see Norvig's
spell checker for example. You can probably fit the google index from 1998 on
a single modern machine.

------
jorgem
Anyone know of an API to access that trillion word google corpus mentioned in
the article?

~~~
snprbob86
No API, but you can buy it on 6 DVDs for $150:

[http://googleresearch.blogspot.com/2006/08/all-our-n-gram-
ar...](http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-
to-you.html)

[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=...](http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13)

------
nl
Not big fans of the Semantic Web, then...

~~~
mikedouglas
I wouldn't be so sure. It's just that they're proposing a very different
method (statistical analysis) to uncover the inherent meaning in the text.

