Hacker News new | past | comments | ask | show | jobs | submit login
Official Google Research Blog: The Unreasonable Effectiveness of Data (googleresearch.blogspot.com)
52 points by Anon84 on Mar 26, 2009 | hide | past | web | favorite | 16 comments



This line of thought follows previous posts on the same subject.

It's very Norvig-esq (http://www.youtube.com/watch?v=LNjJTgXujno), but there's also http://anand.typepad.com/datawocky/2008/03/more-data-usual.h... and also Chris Anderson and Wired's flame bait http://www.wired.com/science/discoveries/magazine/16-07/pb_t... (that month's wired was dedicated to this subject)

And like someone at the previous discussions has said, this is the base of the scientific method, not it's death


Golden quote from Norvig's talk at Startup School:

Q: What's your opinion about semantic web?

A: Semantic web. Future of the web. And it always will be.

Also:

If I assigned engineers to (semantic web) formats based on the percentage of pages that had those formats, then the correct number of engineers for semantic web was zero.


I would also add the "Theorizing from data" talk from Norvig:

http://www.youtube.com/watch?v=nU8DcBF-qo4


Norvig had a great response to that wired article: http://norvig.com/fact-check.html


Unfortunately, this only shows one side of the equation. Namely, the internet behemoth side.

If you're Google, Yahoo! or one of their friends, you can get away with relying just on correlations extracted directly from data. After all, you have all the data you could possibly want, and if you don't have, you can easily measure it in a straightforward way.

Everybody else, however, has to do a much better job of developing the right algorithms and insights to get the upper hand. The best way to do this of course, is to use whatever data you manage to scrape together.

Luckily, they also seem to recognize that sometimes data just isn't enough and ask for help. You've seen this in the Netflix prize, the AOL search log debacle and more recently in Microsoft's release of search logs for WSCD09.


The Netflix prize is a contest to discover how much you can do with pure data. I don't see how you can place it on the semantic web side of things when they don't do any tagging etc.


I said nothing the semantic web.

I just said that sometimes, all the data in the world isn't enough if you don't have the right algorithms or insights.


I think an interesting related avenue of research would be investigating analytically the "emotional" content of the Internet. Johnathan Harris over at http://www.number27.org has made some great strides in the area looking at blog posts and global news content (see http://www.wefeelfine.org and http://www.tenbyten.org). While Harris has some very impressive visualizations of massive amounts of data I believe we are at that point that we can move beyond just looking at massive collections of data and begin to saw something mathematically about the patterns and characteristics that emerge from those sources. With the advent of cheap cloud computing, aka Amazon EC2, such detailed and massive undertakings are now possible by ordinary developers.


"Let large quantities of data solve your problems" might not be the best advice if you are hardware constrained. Not everyone has the petabytes of storage and the terabytes of RAM that Google has.

I guess cutting edge natural language apps are going to be the playground of the big boys until PCs reach the scale necessary to do experiments.


Practical natural language apps might require much less, though: see Norvig's spell checker for example. You can probably fit the google index from 1998 on a single modern machine.


Anyone know of an API to access that trillion word google corpus mentioned in the article?



Not big fans of the Semantic Web, then...


I wouldn't be so sure. It's just that they're proposing a very different method (statistical analysis) to uncover the inherent meaning in the text.


I thought one of their main points in the paper is that there are always going to be more data sets in "natural" form than in conveniently marked-up form, so that a researcher has to develop tools to deal with natural data as they are and still cope with that. Then the next new-and-improved scheme for semantic mark-up can learn from what is observed in vast data sets.


I remember watching one of Norvig's Tech Talks where he was asked specifically about the semantic web and his response: "the semantic web is the future of the web... and always will be"




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: