
On Chomsky and the Two Cultures of Statistical Learning - EdiX
http://norvig.com/chomsky.html
======
brockf
Chomsky's one paragraph quote at the beginning of this article is more clear
and thoughtful than the rest of this. I feel the author's missing the point.

In the case of language, observing and reporting statistical probabilities in
written/spoken language output does very little to explain the cognitive
systems used in acquiring and using language. Even one statistical anomaly
serves to show that statistical learning is NOT the entire picture when it
comes to language development.

There was another article on HN a while back that had another great quote from
Chomsky that does well to illustrate what I feel is his main point here:
"Fooling people into mistaking a submarine for a whale doesn't show that
submarines really swim; nor does it fail to establish the fact". Creating a
computer that can produce millions of grammatical utterances does little to
show that we understand language systems. Now, if a computer could - like
humans - _learn_ to produce infinite, novel, contextual, and meaningful
grammatical utterances, that's a different story. But that story will take a
lot more than statistical learning to write.

~~~
losvedir
>In the case of language, observing and reporting statistical probabilities in
written/spoken language output does very little to explain the cognitive
systems used in acquiring and using language.

Unless, of course, those cognitive systems are nothing more than some
statistical probabilistic mechanism. I don't know anything about the field,
but the article was interesting to me in that it seemed to at least partly
argue that. I know, for me at least, I'll frequently produce a sentence and
then repeat it to myself a few times to see if it "sounds right." Now, I don't
know what is happening to determine that, but perhaps I'm comparing it to some
statistical probabilistic model I have in my head?

> Even one statistical anomaly serves to show that statistical learning is NOT
> the entire picture when it comes to language development.

1) Does it? Maybe it shows the specific statistical probabilistic model in
question is wrong. Consider, as Chomsky did, a model which predicts zero
probability for a novel sentence. Clearly, as you say, one anomalous novel
sentence is all it takes to disprove such a model. But what about other models
which can handle them? The "anomaly" may not be an anomaly anymore.

2) Do you have some anomaly in mind which shows statistical probabilistic
models don't work?

\-----

The article was very interesting to me, but I don't know anything about the
field. I guess my main question boils down to: Is it possible that language
acquisition and production is nothing more inside our heads than a simple
statistical probabilistic model?

~~~
brockf
If that was true, then why did humans evolve to speak at all? Why, if speech
is simply a reaction to statistics we are tracking and behaviours that have
been rewarded, would the first utterances have been made? And how do we make
completely novel utterances that attempt to express our otherwise abstract
thoughts?

~~~
pygy_
_> Why, if speech is simply a reaction to statistics we are tracking and
behaviours that have been rewarded, would the first utterances have been
made?_

Why not? Look at it from the bottom up:

Communication is a fundament of life, from intra-cellular to inter-cellular to
inter-organism interactions (another fundament is the ability to keep oneself
in a low entropy state, at the expense of the rest of the world).

Human speech is an evolution of mammal communication. It grew up in
complexity, from grunts and other basic noises, along with our way of living,
up to what we have now.

 _> And how do we make completely novel utterances that attempt to express our
otherwise abstract thoughts?_

Speech is a big collage. New is either the result of

* a recombination of the sub-parts of past speech

* the definition of a new word in terms of older words, or sometimes arbitrarily (for proper nouns).

Nothing fancy AFAICT.

~~~
brockf
There's a big difference between "grunts and basic noises" and language. Or at
least, that's my opinion. In this same line, I don't believe
dogs/monkeys/birds/bees have language, despite the ability to communicate.

This view is just to simplistic to hold its weight when you really look at the
intricacies of language and its evolutionary history which, by the way, I
would suggest comes from manual gesture and not grunting.

~~~
pygy_
_> There's a big difference between "grunts and basic noises" and language. Or
at least, that's my opinion. In this same line, I don't believe
dogs/monkeys/birds/bees have language, despite the ability to communicate.
This view is just to simplistic to hold its weight when you really look at the
intricacies of language and its evolutionary history which, by the way, I
would suggest comes from manual gesture and not grunting._

Mu![1]

But you're probably right about gestures.

Wild chimps have a vocabulary of about 66 signs. We can also observe tribes
with languages more primitive than ours (no pronouns, for example). But
there's a missing link of several millions of years of evolution between both.

What are the (known) intricacies of the evolution of our ability to
communicate?

There's no definitive proof for the statistical argument, but a growing amount
of (neuro)scientific evidence points to it. What's (are) your alternative
hypothese(s)?

[1] <http://en.wikipedia.org/wiki/Mu_(negative)>

[2] <http://www.ncbi.nlm.nih.gov/pubmed/21533821>

~~~
brockf
This is where the debates really begin :)

I think that most people who believe in some form of the motor theory of
speech perception will also believe that speech evolved from manual gesture.

Others scoff at the motor theory. In fact, I'd say I'm in the minority by
bringing it up with any regularity.

If the question, what is "known" about the evolution of our ability to
communicate, I wouldn't have much to point you towards. Most is theory based
on modern evidence, somewhat like armchair psychology. Other people point to
our ability to integrate non-verbal gestures into our comprehension,
activation of our motor cortex prior to semantic/phonetic network activation
when disambiguating difficult speech sounds, our ability to synthesize
visual/auditory sources of information when the visual information relates to
speech gestures (mouth/tongue movements), etc.

~~~
pygy_
What's the link between the motor theory of speech perception and your
criticism of losvedir's post?

Aren't these issues completely orthogonal?

------
Jun8
This is not a new debate. Within Linguistics there has been a continuous push
against statistical NLP models. Read the introduction of Manning's book, even
he seems to be defensive about NLP.

Chomsky is a colossus, his achievements are well-known. However, at one point
in many disciplines it comes to pass that the pioneers who pave the way in
time become the very impediment to new ideas. His emphasis on Semantics have
warped the minds of _many_ generations of researchers (and some other ideas on
universal grammar, too).

I experienced this first hand, my advisor, Prof. Raskin, a great researcher on
semantics, nevertheless thought that statistical approaches were not the way
to go. Sadly, in many Linguistics departments people are just not equipped
with the statistical tools necessary to have a basic understand of what's
being done in the NLP field. So NLP is generally taught under CS, EE, or
CompE.

~~~
adavies42
i saw someone once compare chomsky to freud, as a foundational figure whose
discipline can't/couldn't progress during his lifetime.

~~~
ordinary
Einstein would be another example.

------
christianpbrink
"If Chomsky had focused on the other side, interpretation, as Claude Shannon
did, he may have changed his tune. In interpretation (such as speech
recognition) the listener receives a noisy, ambiguous signal and needs to
decide which of many possible intended messages is most likely. Thus, it is
obvious that this is inherently a probabilistic problem, as was recognized
early on by all researchers in speech recognition..."

This is the money shot especially since speakers are aware of the interpretive
activity of listeners, and effective speakers play constantly on the
ambiguities in their statements - structural (i.e. grammatical) ambiguities as
well as semantic ambiguities. Listeners in turn are aware of speakers'
awareness of this.. There is, effectively, an infinity of mutual awarenesses
of structural ambiguities. In any instance of communication.

I think most technologists and (especially) businesspeople see this
intuitively. I think many academics do not. Not sure how to articulate what I
mean but I think I am saying something non-trivial about academics and their
perspective on language.

~~~
cma
Freeman Dyson earlier this year on this type of ambiguity as expressed in the
drum language of the Democratic Republic of Congo:

[http://www.nybooks.com/articles/archives/2011/mar/10/how-
we-...](http://www.nybooks.com/articles/archives/2011/mar/10/how-we-know/)

------
CWuestefeld
Server's down. Here's a cached link:
[http://webcache.googleusercontent.com/search?q=cache:http%3A...](http://webcache.googleusercontent.com/search?q=cache:http%3A%2F%2Fnorvig.com%2Fchomsky.html)

EDIT: stop giving me upvotes. I've got 11 points now for nothing more than a
link. I don't deserve them. Stupid hidden points...

~~~
norvig
Sorry about the intermittent access. My hosting service provides me with
sufficient bandwidth, but only provides a version of Apache that forks a new
process for every GET, and thus runs out of processes and denies access to a
portion of visitors when I get slashdotted/redditted/hacker-newsified. If
anyone can suggest a more reasonable hosting service, let me know. -Peter
Norvig

~~~
alphamerik
[cough] I hear Google has pretty good bandwidth and scaling. Ever try App
Engine? [/cough]

------
PaulHoule
It's funny. Lately I've been working with NLP systems and in the last few
years there are a few really good parts-of-speech taggers that are about 99%
accurate. All the ones I know of are based on hidden markov models, which
definitely would disappoint Chomsky.

Part of the trouble w/ Chomsky is that real language doesn't draw a clear line
between syntax and semantics. Even though an HMM doesn't correctly model the
nested structures that are common in natural language, it makes up for it by
encoding semantic information.

~~~
sharmajai
Another trouble is that human beings are innately probabilistic when it comes
to language. A sentence written/spoken by humans does not have to be
gramatically correct, to convey it's meaning, and does not always follow the
strict rules that Chomsky talks about.

It's not the language that defines how we communicate, it's how we communicate
defines the language.

But I also disagree with peter when he says the why is not important, it is
this why or the understanding of the matter that separates us from the
machines like watson, since our sole purpose in life is not to win at a game,
but play/enjoy the game and most importantly "reuse the understanding" gained
in some other facet of life, a feat that I beleive no machine is capable of.

------
wccrawford
"O'Reilly is correct that these questions can only be addressed by mythmaking,
religion or philosophy, not by science."

... My jaw is on the floor. It drives me nuts when people go from 'We can't
explain that yet' to 'The only explanation is God.'

The tides are incredibly complex when you insist on 'why' all the way back to
the beginning of the universe. Everything is!

~~~
torstein
>He doesn't care how the tides work, tell him why they work. Why is the moon
at the right distance to provide a gentle tide, and exert a stabilizing effect
on earth's axis of rotation, thus protecting life here? Why does gravity work
the way it does? Why does anything at all exist rather than not exist?
O'Reilly is correct that these questions can only be addressed by mythmaking,
religion or philosophy, not by science.

Science doesn't really aim to answer the 'why'-questions, but rather the
'how'-questions. The scientific method boils down to falsifying hypothesis,
and it's a lot easier with 'how does the tide work?' than 'why does the tide
work (the way it does)?'.

Science can't say anything about 'Why does anything at all exist rather than
not exist?', because there is no way to test any of the answers. So it's left
to mythology, religion or philosophy to answer.

~~~
T-hawk
> Why is the moon at the right distance to provide a gentle tide, and exert a
> stabilizing effect on earth's axis of rotation, thus protecting life here?

A possible answer to this stems from the anthropic principle. We evolved in a
place with a moon because the moon helped us evolve. We don't see no moon
because complex life such as us would not have developed without it. A stable
rotation and gentle tide are conducive to the evolution of complex organisms;
tides were instrumental in getting life out of the seas and onto land.

"Why is the sun the way it is?" can be answered similarly. A smaller star has
too small a habitable zone where liquid water can exist. A larger star would
have burned out sooner than the 4.5 billion years it took to develop sapient
life. A double star has a much smaller set of stable planetary orbits. That
the sun is an appropriate star for our life on earth is not divine providence
or an enormously unlikely coincidence; it's the result of a universe-wide
scenario of statistical multiple endpoints.

~~~
borism
_it's the result of a universe-wide scenario of statistical multiple
endpoints_

totally agreed with you up to that point which I have hard time understanding.

so you say universe is kind of fractal and we happen to be in the right place
on that fractal, where all the ingredients come together?

~~~
T-hawk
Yes, but there's a causal relationship that I think you're not quite
expressing. We are where we are _because_ here is where all the ingredients
came together.

------
T_S_
The handshake example was illuminating. Three "equivalent" theories:

Theory A: Closed form formula function.

Theory B: "Algorithm". Still a function.

Theory C: Memoized function (constant time!)

According to the article "nobody" likes C, especially the article's Chomsky
straw man. If one had a procedure to convert C to A, then this whole issue
would become hairsplitting. Such a procedure would aim to convert a memoized
function back into a form that uses more symbols from a mathematical language.
A good criteria of success would be the description length of the resulting
procedure in the preferred language. One reason this could be useful to
science is that once you identify a value that is useful in many theories it
becomes part of the language. Making it available to the next problem may
speed up the search for a "good" description of the next phenomenon. Identical
procedures that appeared in various algorithms might acquire a special name.
One such value might be called "pi", another "foldr" and so on.

Of course there may be many good descriptions, just as there are many
languages. Also, the example could be extended to statistical modeling
situations by adding room for error terms in the suitability criteria.

So if, you have a general procedure to convert a table into a definition you
can make money and science at the same time!

------
stcredzero
_My conclusion is that 100% of these articles are more about "accurately
modeling the world" then they are about "providing insight," although they all
have some theoretical insight component as well._

Before you can figure out _why_ , you have to make sure you can accurately
characterize the _what_. So there's a lot of science that is focused on coming
up with a descriptive tool like an adhoc curve, before the underlying
principles are discovered.

I think Chomsky is afraid that statistical models will cause people to stop
looking for the underlying principles.

------
sethg
This essay made me think: Lojban (<http://www.lojban.org/tiki/la+lojban.+mo>),
among constructed languages, is the categorial language _par excellence_.
Every word has a well-defined range of meaning; the grammar can be parsed by
the same kinds of parsers used for programming languages; potential sources of
ambiguity, like plural references, associativity of modifiers, and negation,
have been rigorously (or tediously, depending on how you roll) nailed down.

Can there be such a thing as a conlang that demonstrates the ideal
_statistical_ grammar and semantics? (“All the words in this list are 60%
likely to be used as nouns and 40% likely to be used as verbs....” But in the
absence of a pre-existing linguistic community, how could you get students of
the language to use them in the right proportions?)

------
cma
Chomsky's April 8th lecture at Carelton University on language had several
thoughts on machine translation:

<http://www.youtube.com/watch?v=XbjVMq0k3uc>

(I think it even had the same bee-dance example)

------
double-z
The commentary has nothing to do with what Chomsky proposed. The author
defines success as "being successful at accomplishing a task". That has
nothing to do with science. Full stop.

------
macmac
Are Norvig's comments on the "I before E except before C." really vaild? Why
would one use a corpus for analysis of the rule, and not a dictionary? It
appears to me that "CIE" (P(CIE) = 0.0014) is more common than "CEI" (P(CEI) =
0.0005) because the words that contain the "exception" "CIE" are used more
frequently in the corpus than the words that follow the rule "CEI". Once you
know the limited number of exceptions (in the dictionary sense) the rule
appears to preserve its relevance.

~~~
jimbokun
I suppose a the most useful corpus for this rule would be spelling tests.

------
noahlt
Strangely appropriate is today's XKCD: <http://xkcd.com/904/>

~~~
kenjackson
Hmm... I never thought of it that way. That sports are a weighted random
number generator,but the various weights are unknown. And the commentators are
discussing theories as to what the weights are, and how derived. (Although the
cartoon seems to be saying the narratives are just about the numbers
generated, which is more cynical, and frankly less interesting).

~~~
yourcelf
Actually, Larry Birnbaum over at Northwestern is doing exactly that:
<http://infolab.northwestern.edu/projects/stats-monkey/>

They take the coded sports results, and automatically generate narratives
using statistical speech models. They have a startup that is doing it too,
don't recall the name of it....

EDIT: I believe this is it: <http://narrativescience.com/>

------
_grrr
I've been monitoring the page this post points to with a bookmarking tool
we've just released in beta. Here are the latest set of changes:

[http://app.bookmarkerpro.com/changes?fmt=html&id=2573](http://app.bookmarkerpro.com/changes?fmt=html&id=2573)

Quite a few revisions since first posted to HN!

~~~
_grrr
More revisions... <http://tinyurl.com/3sabdc9>

------
niels_olson
This whole theory vs observation argument exists at the very pinnacle of human
thought, expressed in the Copenhagen interpretation. If you want to contribute
to the human understanding of this, you'll have to beat Bohr and the
uncertainty principle.

~~~
Create
Fourier was there first.

~~~
niels_olson
My claim wasn't first, it was top. You up-end the Copenhagen interpretation,
show the universe really is deterministic, and every other argument on this
subject, in every discipline, collapses. As it is, the arguments are almost
certainly failed, but it's not quite a cinch. Because probability admits
determinism as a special case. One of the deep points of Norvig's essay.

------
galactus
It is interesting than on a completely different debate, chomsky takes
norvig's position (he is accused of not looking for a "theory" and "whys" and
he replies that it is pragmatic results that matter):

[http://mindfulpleasures.blogspot.com/2011/01/noam-chomsky-
on...](http://mindfulpleasures.blogspot.com/2011/01/noam-chomsky-on-derrida-
foucault-lacan.html)

------
davidmathers
Chomsky called the Watson computer that won Jeopardy "a bigger bulldozer." He
goes into more detail about his AI opinions here:
<http://www.framingbusiness.net/archives/1366>

------
borism
_And while it may seem crass and anti-intellectual to consider a financial
measure of success_

Why are other metrics Norvig provides like articles published or prevalence in
practical applications are considered more intellectual?

And besides, I don't think "accurately modeling the world" is the end of it.
Classical Newtonian mechanics correctly describe 99% of our activities in the
real world and were considered pinnacle of scientific achievement for several
centuries. Yet we know today that they're just a subset of General relativity
and Quantum mechanics.

