
How Google understands language like a 10-year-old - gibsonf1
http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2010/10/18/BUJ61FTF9I.DTL
======
jakevoytko
I found this article wanting. It seems like they took an official Google Blog
post from January [0] and stripped out all of the interesting information.

I have a more specific question about Google Search that I'd like to see
answered. To what extent do they model specific languages, versus training
classifiers? Are they really grokking sentences or sentence fragments, or do
they have enough training data to fake it, like Bill Gates in "Petals Around
the Rose" [1]?

[0] [http://googleblog.blogspot.com/2010/01/helping-computers-
und...](http://googleblog.blogspot.com/2010/01/helping-computers-understand-
language.html)

[1] <http://www.borrett.id.au/computing/petals-bg.htm>

~~~
syllogism
Yeah, it's really a jumble of ideas. I'm a postdoc doing research on syntactic
parsing, and I'm very sure that Google doesn't currently use a syntactic
parser. Parsers are currently either too inefficient or too inaccurate to use
at web scale --- even for Google.

The problem is that natural languages are at least context-free, and no
algorithm exists (can exist?) to parse context-free languages in less than
polynomial time, with respect to the length of the sentence. You can
approximate by parsing with probabilistic finite-state machines, but they get
led down blind alleys and can't backtrack, so they're inaccurate. Let me know
if you'd like me to elaborate on this with examples (or you can Wiki "garden
path sentence" and probably imagine the problem).

I'm also sure they're not doing supervised word sense disambiguation. That's
in a poor state too, and imho isn't even a good idea. The whole concept of
having someone list out the "senses" of a word is misguided, because it's
totally unclear how fine-grained you should be. And then you need at least a
couple of hundred labelled examples for every word...

Most of the examples they give are best explained by dimensionality reduction
techniques, which have been popular in information retrieval for some time.
Google have undoubtedly invented some secret sauce, but they've also just got
orders of magnitude more data and processing power.

~~~
nostrademons
Don't you mean context-sensitive? A context-free grammar can be parsed pretty
easily with a stack machine, and most programming languages are context-free.
C and C++ are even context-sensitive, but the context-sensitivity is mostly
limited to typedefs, and so doesn't tend to blow up parse times beyond reason.
(Well, many would consider C++'s compile time to be unreasonable, but this is
largely because of #include, which is another issue.)

~~~
nvoorhies
Most sane programming language syntaxes are confined to a subset of CFGs,
usually LALR, which is nice and easy to generate fast parsers for. To parse
any CFG you have to fall back to strategies like CKY that have worst case
behaviors that are way worse than linear, like CKY or GLR or what have you.

But yeah, context sensitive stuff can be way worse.

Unless new evidence has surfaced in the years since I finished a degree in
linguistics, there's only a couple of pieces of evidence for language
constructions in natural languages that can't be generated by a context free
grammar, like a Adv1Adv2Adv3Adj1Adj2Adj3 construction in Zürich dialectical
German (where Adv = adverb and Adj=adjective, and numbers represent which
adverb modifies which adjective).

------
cryptoz
I remember laughing at people (in the late '90s) who tried to search the
Internet with English-language queries: how do i cook pancakes without butter
vs pancakes +"no butter"

Anyway, these days I'd obviously be wrong. I'm so impressed with the strides
Google has taken to NLP, and I am fully expecting them to beat everyone to
Strong AI. And why not? They know that the better they are, the better their
advertising revenue will be. And they know that once they get there, even if
ads are no longer profitable having the world's only AI will be
incomprehensibly popular.

My one problem with this article is the last line:

"They're still not approaching the conversations you'd have as a teenager."

Google hasn't yet approached "the conversations" you'd have with a 5 year old.
While Google may _understand_ a 5-year old's conversation, it certainly
couldn't participate it in and reply back to the kid.

~~~
donaq
_While Google may understand a 5-year old's conversation, it certainly
couldn't participate it in and reply back to the kid_

I would argue that it sort of does. Only instead of a normal kid, it's a mute
kid that can only reply to you by passing you back documents it thinks you're
asking for.

~~~
nostrademons
It can actually do simple facts now:

[http://www.google.com/search?q=what+is+the+height+of+the+emp...](http://www.google.com/search?q=what+is+the+height+of+the+empire+state+building)

[http://www.google.com/search?q=what+is+the+boiling+point+of+...](http://www.google.com/search?q=what+is+the+boiling+point+of+water)

~~~
cryptoz
The answer it gives for the height of the Empire State Building is totally
useless.

"1,250".

What?! 1,250...feet? inches? meters? centimeters?

------
jorleif
It's a bit strange to compare Google's language understanding to that of a
human, since Google does not really understand language, as much as is able to
return documents that are about the same thing as the person writing a query
intends. Surely, a two-year old understands negation just fine, while Google
does not.

Google's understanding of language is similar to that of a savant who has been
imprisoned since birth and has been tied to a bench in front of a screen
showing texts of the web. He can't read in the sense that he could pronounce
words, but he recognizes familiar patterns of symbols. He does not know what
"pancakes" are, but he knows that the word is often seen with the word
"butter". It's amazing how much can be done in this way, but it is quite
different from how humans understand.

~~~
moxiemk1
>> Google does not really understand language

I used to believe, like you do, that the ability to parse and make decisions
based on input was not the same thing as understanding.

Then, I wrote a chess bot for a CS lab. The thing plays better than me, better
than it's peers. My partner, who was good at chess (or at least very literate
in it) could identify what strategies it was going for. We had a visualization
of what moves it was considering, and you could see that it was essentially
playing chess by swinging a baseball bat around and seeing what looked nice.

Does the chessbot "understand" chess? It sure seems like it. We like to think
humans are special, and have some kind of unique understanding that computers
can't, but I think it's only us lying to ourselves.

~~~
brosephius
I dunno, for me "understanding" something implies an ability to reason about
your reasoning. a rule-based AI can't do that, and even if interesting
patterns emerge, it's still deterministic.

~~~
sfphotoarts
This point crosses into the realm of metaphysics and philosophy because it's
entirely possible that more than just google's AI is deterministic. Besides,
as input for the AI they are using the behavior of the millions of google
users. Its likely as deterministic as you or I.

------
user24
Google probably use a tweaked version of the Viterbi Algorithm (
<http://en.wikipedia.org/wiki/Viterbi_algorithm> ) to perform POS tagging.
It's really not that hard if you have a tagged corpus.

People always seem to look at google as though they're doing some special
secret magic, but in fact they're really just implementing fairly well-known
CS algorithms. They just do it _exceptionally_ well.

~~~
plq
i don't think it's the fact that they are good at implementing algorithms.
they're rather good at running those algorithms at exceptionally high scale.
(once implemented, viterbi decoding is viterbi decoding, after all)

~~~
robryan
It's the scale of data available for training to, there is a Google Research
paper call 'The Unreasonable Effectiveness of Data' talking about how even in
the noisy medium that is the internet there is a lot to be gained from that
level of data.

Also a paper from someone at Google on brute force paraphrase acquisition
using billions of sentences combined with some relatively simple rules.

~~~
GFischer
Thanks for pointing out that paper ('The Unreasonable Effectiveness of Data'):

[http://www.scribd.com/doc/13863110/The-Unreasonable-
Effectiv...](http://www.scribd.com/doc/13863110/The-Unreasonable-
Effectiveness-of-Data)

------
jcroberts
The article wasn't very useful compared to blog posts by google itself, but
the discussions on what google can "understand" always reminds me of this:

<http://www.fun-images.com/thumbs/25-my-fucking-keysbig.jpg>

------
metageek
_The search engine has also begun to understand which words are synonyms for
others._

I was delighted the day I noticed it knew that "regexp" and "regex" were
synonyms for "regular expression".

------
mickdarling
So in the 15 years they started using statistical methods for understanding
language, googles ability to understand language is at about an 8 year olds
level. So it is learning about half as fast as a human child. Not bad and an
excellent opportunity to predict it's growth rate for the future.

If this is a linear learning curve, in another 15 years it should just start
to be able to 'understand' the nuances of Shakespeare and Ulysses among
others.

