

Google Is All About Large Amounts of Data - bootload
http://googlesystem.blogspot.com/2007/12/google-is-all-about-large-amounts-of.html

======
bootload
_"I have always believed (well, at least for the past 15 years) that the way
to get better understanding of text is through statistics rather than through
hand-crafted grammars and lexicons. The statistical approach is cheaper,
faster, more robust, easier to internationalize, and so far more effective."_

Thats how google is attacking the _"parsing text problem"_ to find meaning.
[0] Not with regular expressions, rules or clever AI hacks. Just plain old
math.

[0] Attributed to Peter Norvig. Here's an example of what is suggested. A
spell checker in about 25 lines python (old but good) ~
<http://norvig.com/spell-correct.html>

~~~
andreyf
_A spell checker in about 25 lines python (old but good)_

21 lines in Python 2.5 code

------
robg
What I find interesting it that it's much better to leverage quality data than
a larger quantity of crappy data. To me, it's the difference between running a
well-designed study on a small group versus a large study that's poorly
controlled. Computation gymnastics can only do so much to clean up a
multivariate mess. To improve data quality, you need to understand user
psychology (i.e., better design). But any engineer can build a massive
database. Problem is, how do you decide what's most important for the problem
at-hand? Collecting more data, to figure it out later, only introduces more
noise into the analysis.

------
henning
For a more technical take on this idea, search YouTube for a Peter Norvig talk
called "theorizing from data".

~~~
dood
[<http://www.youtube.com/watch?v=nU8DcBF-qo4>]

Its a good talk, though maybe should be called "theorizing from massive
amounts of data". Makes you wonder what Google are keeping under wraps for
now, and how the powerset approach can compete, except maybe for domain-
specific stuff.

~~~
henning
well, they do statistical learning of inflections and closely related terms.

google "knows" (believes with high probability, I guess) that PWC is an
abbreviation of PriceWaterhouseCoopers, for instance.

you can see this directly in how google highlights terms in search results.
you can certainly find instances where they "should" have gotten something but
didn't, or made a mistake, but it works well enough most of the time.

------
cellis
Very interesting to me. But I know very little about Artificial Intelligence
or ML.

