
On Chomsky and the Two Cultures of Statistical Learning (2011) - kercker
http://norvig.com/chomsky.html
======
mrow84
I think that Norvig hits the nail on the head near the beginning of his piece:

"I believe that Chomsky has no objection to this kind of statistical model
[the Newtonian model of gravitational attraction]. Rather, he seems to reserve
his criticism for statistical models like Shannon's that have quadrillions of
parameters, not just one or two."

This is no more than an objection to problems of fitting your chosen model to
data. If you only have a small number of free parameters, then you can fit
your model with a reasonable amount of data. If you have a large number of
parameters then you have to introduce some extra assumptions, as Norvig (of
course) acknowledges slightly earlier (described as "smoothing", in context):

"For example, a decade before Chomsky, Claude Shannon proposed probabilistic
models of communication based on Markov chains of words. If you have a
vocabulary of 100,000 words and a second-order Markov model in which the
probability of a word depends on the previous two words, then you need a
quadrillion (10^15) probability values to specify the model. The only feasible
way to learn these 10^15 values is to gather statistics from data and
introduce some smoothing method for the many cases where there is no data."

Thus, although both models are statistical, it is much easier to have
confidence in Newton's law of gravitation than it is in a Markov model of some
communication channel, because the data tell a clear picture. The imprecision
of Newton's law in certain parts of the problem space (unobserved during his
time) is a moot point - any such objections apply equally well to models with
many parameters, and then you _still_ have to accept that you have made extra
assumptions "outside" the scope of your model.

If you can explore your entire problem space, then you can build a complete
"model". If not, then having more parameters than data _requires_ additional
assumptions. Chomsky's point stands.

~~~
foobarqux
No, Chomsky's problem with statistical models is that you can't learn much
about the underlying system from them, and so it isn't science, which has as a
fundamental aim to understand the world.

Statistical models can be useful but they generally aren't meaningful to
scientific progress.

~~~
mrow84
For the purposes of this discussion, Norvig has defined a statistical model
as:

"a mathematical model which is modified or trained by the input of data
points."

He then illustrates how what would be considered a "scientific" model,
Newton's law of gravitation, is a statistical model under his definition, but
simple one with not many parameters. He contrasts this with a Markov model of
a communication channel with a large vocabulary, which has many parameters.
His argument is, then, that Chomsky dislikes statistical models with large
numbers of parameters, as stated in the passage I quoted before.

My point was that Chomsky's concerns are, with reference to Norvig's argument,
equivalent to concerns about model fitting, namely that to fit models with
many more parameters than you have data, you require additional assumptions
about your model _structure_. It is difficult (though not necessarily
_impossible_ ) to learn about the system from your model, because in order to
construct your model you have had to assume things about reality that you will
not be verifying against observations.

In the opposite case, where you have many more data than parameters, you can
fit your model with confidence, given only assumptions about your _sampling_
(which you address by being a good experimentalist). This is what allows you
to "learn about the underlying system" \- you have a model that describes
reality well _by itself_ , without requiring additional assumptions about the
nature of reality, so the structure of your _model_ reflects something about
the structure of _reality_ , and you can explore your model as though you were
exploring reality. Of course, sometimes it turns out the equivalence wasn't as
good as we thought, but often it provides us with new directions of
investigation.

Hopefully that clarifies the equivalence between the two statements - I
apologise for not making it more obvious earlier.

~~~
xixi77
It's more than just overfitting: e.g. if a neural net model with several
hundreds or thousands of parameters can be trained on a dataset of billions of
observations, and may even be OK at predicting behavior both in and out of
sample, but it still remains a black box, since we generally do not know what
phenomena particular parameters (or their combinations) represent.

On the other hand, IMO there is nothing wrong or unscientific with having
empirically estimated relationships as part of the model -- I just see them as
shortcuts whose purpose is to parcel the problem so as to allow other
analysis, and as something to potentially investigate further to see why the
relationship takes a particular form.

Some ML methods are more amenable to this type of analysis than others though.

~~~
mrow84
To your first point - as you increase the number of parameters in a model you
(quickly) begin to suffer from the curse of dimensionality. In a crude sense,
the data requirement for a similar level of confidence is exponential in the
number of parameters, so it can be difficult to use a high-dimensional model
to understand a problem, even with a billion (or trillion...) observations.
The best one can hope for is that, if there is a simple relationship hidden in
the data, your high-dimensional model captures that relationship in a way that
is amenable to extraction - I presume this is what you are saying in your
third paragraph.

To your second point - I agree. I too do not reject that there is utility in
constructing models that make no effort to match the form of the underlying
reality. However, the fact remains that in such cases it is very difficult to
use your model to gain deeper understanding, and as such these models simply
aren't useful for a lot of science in their current form, precisely because
they don't _tell_ you anything about reality. Now if someone were to devise a
way of extracting "intelligent", (meaning, sensible given existing
understanding) simplified relationships from high-dimensional models, that
might be a different matter...

------
fauigerzigerk
Let's consider the supreme court handshake problem, but say there is an
unwritten social law that forces judges to only ever initiate a handshake with
judges less senior than themselves. If the seniority of two particular judges
happened to be exactly equal, they would not shake hands at all.

Let's assume (I don't know if it is actually true or not) that in the history
of the supreme court, there have never been two judges of the exact same
seniority. In that case, a model learned from handshake data would not include
the slightest hint of this unwritten social law.

I think what Chomsky is saying is that if we do not understand the generative
principle behind any data, we cannot possibly know what circumstance might
completely invalidate our model. There may not be a way to smooth this out.

Language understanding, contrary to things like speech recognition, does not
lend itself very well to smoothing.

~~~
YeGoblynQueenne
The way Norvig tells it, language lends itself perfectly well to smoothing and
even Chomsky's favourite absurd phrase "colorless green ideas sleep furiously"
is assigned a (very small but non-zero) probability in some statistical models
of English.

Truth is you can't do much with statistical models of language without some
sort of way to account for what's missing from your data, which is always most
of language. On the other hand, anything you might do is never going to be
enough when that's the case: that you're missing the majority of language from
your training data.

~~~
fauigerzigerk
Some NLP tasks do indeed lend themselves well to smoothing, but I was thinking
of language understanding tasks like question answering where changing or
misunderstanding a single word in an entire novel can easily make a correct
answer completely incorrect.

I agree with your second paragraph, and if I understand Chomsky correctly,
that is part of why he argues in favor of a generative grammar. I can't say
that I completely understand how such a grammar would be linked to semantics
and experience though.

~~~
YeGoblynQueenne
Ah, OK, sorry for misunderstanding.

~~~
mcguire
Was that a pun?

Out of curiosity, how does Chomsky's generative model account for language
understanding?

~~~
foobarqux
It demonstrates that human languages all have a certain structure, so it must
be an innate faculty of human beings (as opposed to acquired). A neural
network could not tell you that. It also suggests further avenues of inquiry
(Why that particular structure?)

~~~
mcguire
I think you misunderstood my question. Here's my understanding of Chomskyian
linguistics:

1\. You have a certain concept you wish to express.

2\. You apply a generative grammar to the concept, producing a linguistic
statement.

3\. You express the statement in a linguistic performance.

4\. I perceive the linguistic performance.

5\. ¿I reverse the generative grammar to produce the concept?

6\. I understand the concept.

My understanding is that Chomsky is only interested in steps 2 and 5 (and that
he is explicitly uninterested in 3 and 4). But how does step 5 work?

~~~
foobarqux
I don't think much is understood about how linear externalizations of language
are deserialized into symbolic structures, what those structures are, and how
those are symbol structures are mapped into mental representations.

------
UhUhUhUh
Again, this boils down to the rationalist vs. empiricist stance. Hidden
variable vs. probabilistic approximation and so forth. The partisan error is
to consider these two stances as mutually exclusive. They are not, as
illustrated by decoherence for example. On the other hand, it is, I believe,
undeniable that the rationalist endeavor is much more complex than the
empiricist one and therefore also less linear. Less predictable and, please,
let's not forget, less financially profitable. Chomsky's position has always
been to advocate for the rationalist stance in a world conveniently inebriated
with its empiricist successes to the point of turning this one aspect of
thinking into a belief system. It is a waste of energy to argue for a monopoly
of the yin over the yang or the other way around. They are complementary but
not mutually exclusive. Privileging one over the other will result in an
increase of ideological/mystical bias and essentially miss the whole point: it
is their interaction, the transition from and to one another, that holds the
key to a global understanding.

------
atmosx
In this video[1] Varoufakis takes on modern economics and their models as a
way of understanding the real world and predict what's going to happen.

On a higher level, I believe that he speaks for most (if not all) social
sciences and what happens when they cross mathematical models who blindly try
to understand and predict _the real world_ through a flawed, limited
mathematical model.

[1] [https://youtu.be/L5AUAIzciLE?t=1355](https://youtu.be/L5AUAIzciLE?t=1355)

~~~
vezzy-fnord
Criticisms of mathematical modeling in economics were most definitely given by
Mises and Hayek. Historian of economic thought Mark Blaug was also very
critical of how neoclassical production theory (particularly Edgeworth's
formulation) distorted the meaning of "competition" from a dynamic process to
a quantity. Ironically, it is the same far-left heterodox demagogues like
Varoufakis (though he's hardly the worst) who nevertheless use it in this
sense to argue for mercantilist programs. Neoclassical when it suits them,
heterodox when it doesn't.

Anyone who uses the term "neoliberalism" is a charlatan. It is an absurd
conspiracy theory that has since blown into an amorphous meme out of Marxist
historiographers who struggle with the explanatory power (or lack thereof) of
their framework.

Mathematical models do have their uses in formalizing assumptions and
analyzing dependencies, so they're not all bad, even if the Cowles Commission
did go overboard.

(Also let's not forget that the modern methodology of economics came as a
result of the Keynesian research program, particularly since Hicks (1937)'s
introduction of IS-LM, first modeled by Keynes himself in 1933 as four
simultaneous equations.)

~~~
AimHere
What makes the use of the term 'neoliberalism' to be a conspiracy theory? I
thought it was a blanket term for the sorts of laissez-faire capitalist
ideologies that sprang up since the 1980s, and whose proponents did have a big
thing for namechecking Adam Smith and other classical liberal economists of
the 19th century. You're surely not going to deny that there has been some
sort of ideological drive in favour of free market reforms, privatization and
"free trade" agreements of the WTO/TTIP ilk, are you?

I don't think people who use the term think it refers specifically to shadowy
cabal of free marketeers conspiring with each other over some hidden agenda.
It's just a term for what a bunch of people, many of whom are in positions of
power, happen to openly think and do.

~~~
xixi77
One reason is that the term is exclusively used by the ideological opponents
of this "neoliberalism", and as a result it is necessarily ill-defined. A lot
of people describe themselves as libertarians of various types, or even as
classical liberals, but I've never met a person calling herself a neoliberal.
Outlining a difference between these ideologies might be a good start, and
would at least give the term some meaning, without it though it is nothing
more than handwaving at people the speaker doesn't like. Had that actually
been done, I might even call myself a neoliberal one day -- but as it stands,
I can't, because no one knows what it is :)

~~~
dragonwriter
> One reason is that the term is exclusively used by the ideological opponents
> of this "neoliberalism",

No, its not. Its used by defenders of neoliberalism quite a bit.

[http://www.econlib.org/library/Columns/y2010/Sumnerneolibera...](http://www.econlib.org/library/Columns/y2010/Sumnerneoliberalism.html)

[https://cambridgedevelopmentstudies.wordpress.com/2011/04/12...](https://cambridgedevelopmentstudies.wordpress.com/2011/04/12/in-
defense-of-neoliberalism-part-i/)

[http://www.themoneyillusion.com/?p=31603](http://www.themoneyillusion.com/?p=31603)

[http://www.independent.org/publications/tir/article.asp?a=94...](http://www.independent.org/publications/tir/article.asp?a=945)

> A lot of people describe themselves as libertarians of various types, or
> even as classical liberals, but I've never met a person calling herself a
> neoliberal.

Which says a lot more about who you do (and don't) know than it says about
anything else.

~~~
xixi77
Thanks for the references, these are good articles, and I am happy to see that
I was wrong and there are in fact some defenders of the term. Still, it is
very predominantly seen in a critical context -- quite unlike, for example,
"libertarianism" or "economic liberalism" or "classical liberalism".

But, even here it does not look well-defined at all: at best, authors simply
classify specific policies as "neoliberal".

------
faizshah
More info on Chomsky's argument here:
[http://www.theatlantic.com/technology/archive/2012/11/noam-c...](http://www.theatlantic.com/technology/archive/2012/11/noam-
chomsky-on-where-artificial-intelligence-went-wrong/261637/)

------
cschmidt
If two cultures isn't enough for you, there was an interesting blog post from
a year ago called "The Three Cultures of Machine Learning":

[http://cs.jhu.edu/~jason/tutorials/ml-
simplex.html](http://cs.jhu.edu/~jason/tutorials/ml-simplex.html)

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=2591154](https://news.ycombinator.com/item?id=2591154).

------
mcguire
Chomsky's aversion to statistical techniques is much deeper than most of this
discussion focuses on.

Here's an enlightening quote from Chomsky:

" _Linguistic theory is concerned primarily with an ideal speaker-listener, in
a completely homogeneous speech community, who know its (the speech community
's) language perfectly and is unaffected by such grammatically irrelevant
conditions as memory limitations, distractions, shifts of attention and
interest, and errors (random or characteristic) in applying his knowledge of
this language in actual performance. (Chomsky, 1965, p. 3)_"

Chomsky is uninterested in linguistic data of the kind used to build
statistical language models; those are "linguistic performances" and he is
only looking at "linguistic competence", the ability of an ideal speaker to
"produce and understand an infinite number of sentences in their language, and
to distinguish grammatical sentences from ungrammatical sentences."[1]

Now, I'm personally happy to criticise statistical techniques for their lack
of explanatory power. But I'm not willing to go further and say that data is
irrelevant. Chomsky is.

[1]
[https://en.wikipedia.org/wiki/Linguistic_competence](https://en.wikipedia.org/wiki/Linguistic_competence)

~~~
foldr
Chomsky is not saying that "data is irrelevant". _Aspects_ contains a detailed
discussion of lots of linguistic data points, as you'd know if you weren't
just cherry-picking quotations.

------
thanatropism
It seems to me that overarching theories of a Platonic bent miss the embodied-
ness, Dasein-ness of human activity.

Nevermind the debates about the ultimate nature of the human cognitive
process; fact of the matter is that _as observed_ , it's always-already
wrapped in emotional-social thinking. Enough that there's reason to question
the subject-object split altogether.

Now, maybe Chomsky is a kind of extreme social-cognitivist and his abstract
generative trees apply to societies as learning and meaning-producing wholes.
But on the face of facts, rather than metaphysical speculation as to the
nature of personality, intentionality and individuality, it would seem to me
that the statistical/machine learning approach already faces language as it
happens: as embodied in media, social context and so on.

In other words: I fail to see much value in an abstract account of "pure
language" as dissociated from the real communicative process as it happens
right now as you read me. Sure, "insights" \-- but it remains to be shown that
"pure linguistics" is a worthwhile endeavor on the level of "pure quantum
mechanics" as formal model.

~~~
Ologn
> I fail to see much value in an abstract account of "pure language" as
> dissociated from the real communicative process

But this assumption makes assumptions of purity as well. The communicative
process may just be a byproduct of a mental process which has little to do
with communication. A mutation happens tens of thousands of years ago (say
50,000 years ago), a change happens in the Broca (and/or Wernecke) area of the
brain, and suddenly a new mental process kicks off. This mental process can be
modeled as a state machine, and has the abilities and limitations of a state
machine. It also has known limitations of output, which Chomsky has talked
about.

You're assuming the mutations which gave rise to the brain changes which
created an internal language generator and parser have only one purpose -
communication. But that's an assumption on your part. The ability to
communicate may be just one byproduct of those changes which made things like
communication possible.

~~~
thanatropism
That's an interesting point.

Jacques Lacan insists on some ideas related to that. I've never been too fond
on psychoanalysis either, but I've been known to be wrong often.

------
pierrebai
One can try to divine what Chomsky really thinks, believes and means, but one
thing remain that always annoyed is is repated absolute stances and the way he
express is viewpoint as being the only right one. Often expressed with disdain
or dismissal of the opposition. You can disagree with someone, but doing it
elegantly is of higher value to me.

~~~
wfo
This is generally the philosophical tradition: you present ideas, wholly and
completely, as an explanation or a solution to a problem. You do not pre-
suppose disagreements for your opponents (this will almost certainly end up
being a straw man, if you let them make their own arguments they will make
them more favorably than you could. It is a sign of respect). You express your
idea as a firm, unyielding truth and wait until it is appropriately refuted,
then reassess.

~~~
mcguire
And when it is refuted, you assert something akin to:

" _Linguistic theory is mentalistic, since it is concerned with discovering a
mental reality underlying actual behavior. Observed use of language ... may
provide evidence ... but surely cannot constitute the subject-matter of
linguistics, if this is to be a serious discipline._ "

------
gcb0
it's important to know that chomsky used the same ideas in the 70s. he
defended the biologicism (not sure about english translation of the term)
which is basicaly machine learning in sociology. he fell on his face and is
probably now in a very good position to throw this criticism.

~~~
atdt
I'm not sure what you're talking about. Could you provide some references?
Chomsky was extremely critical of "the Bell Curve", if that's what you're
referring to. ([http://newlearningonline.com/new-
learning/chapter-6/chomsky-...](http://newlearningonline.com/new-
learning/chapter-6/chomsky-on-iq-and-inequality))

~~~
gcb0
Try to find any of his papers from before 84. he was not a b.f. skinner
fanboy, but pretty close, as was almost everyone on the field.

~~~
cma
He famously attacked Skinner's ideas in 1967, well before the 80s:

[https://chomsky.info/1967____/](https://chomsky.info/1967____/)

~~~
gcb0
That's not really my field. If i recall i will ask the expert to give me a
link to post here.

------
ntoshev
I wish there was a follow-up to this, Peter Norvig has hinted a few times he
is going to write one.

------
fouc
This seems a little ironic, I feel there's some incredibly useful things in
AI/ML that rarely get applied to real world problems, instead academics just
spend all their time trying to unravel the black box behavior and come up with
some model for it.

------
marmaduke
Same spirit as his critique of BF Skinner's theory of verbal behavior oh so
many years ago.

------
3pt14159
Chomsky is wrong that statistical models of language provide no insight. They
do.

Chomsky is right that language has _meaning_ and that many modern statistical
techniques essentially ignore this.

My take on the issue is that you can't separate linguistic command from true
intelligence / cognition. There's a long tail of tricks that us intelligent
people can use, but fundamentally they'll only be tricks. And if we truly get
something resembling a perfect linguistic-aware AI by this long tail of tricks
then we've probably accidentally created real cognition. Maybe after typing
this all out I finally understand what Turing meant.

~~~
foobarqux
What fundamental understanding have statistical models produced?

~~~
3pt14159
Here is a personal anecdote that will hopefully make clear what I'm talking
about. When I first used LDA I was blown away at how context could disarm
homonyms. This type of insight isn't exactly mind blowing after the fact, but
it's still compelling enough to expect that statistical models produce
insights that are broadly applicable without having to resort to proving it.

~~~
foldr
Didn't we already know that homonyms can be disambiguated based on context?

------
rooundio
tl;dr "Classical" (natural) scientist: "X is only understood once I found
first principles that x can be reduced to." "Modern" scientist: "X might be so
complicated that there are no first principles that x can be reduced to.
Rather, finding a neural network that can do x is the best I can do, and it
explains why x can be done."

~~~
foldr
I think you should delete the "first" in both cases. Chomsky merely wants a
principled model. He's not making any pretense that we're anywhere close to
understanding language at a fundamental level. Norvig isn't really interested
in principled models at all.

------
zump
What does this mean for the YC chatbot startups?!

------
dschiptsov
The language as a phenomena is obviously neither purely functional nor purely
statistical. Purity is an abstract nonsense, an abstract category of
abstractions.

It is obvious from serious psychological studies of the process of a language
acquisition, that it is similar to training a neural network - there is some
knowledge representation grows up in the brain, but the process of
training/learning is possible due to having appropriate machinery in the
brain.

It seems, like we have more that two apriory notions - of time and space, we,
perhaps, have, apriory notions of a thing (noun), process (verb) and attribute
(adjective) and even predicate at very least, as reflections of our
perceptions of physical universe around us with sensory input procession
machinery we happen to evolve.

It is a mutualy recursive process - we evolved our "inner representation" of
reality constrained by senses, but nature selects, in some cases, those with
more correct representations.

How these apriory notions maps to sounds - details of phonology and morphology
is rather irrelevant - we evolved machinery for that. This is why, there is no
fundamental, principal differences between human languages. The difference in
in a degree, not in a kind.

It seems also that we learn not the rules (schools are very recent
innovations), but "weights" by being exported to the medium of a local spoken
language. Children do it on their own, at least in remote areas, like among
nomads of Himalaya, no worse than Americans. This, by the way, is prof that we
have everything we need to be Buddha or Einstein.

How exactly training occurs is absolutely unknown but it has nothing to do
with probabilities. Nature knows nothing about probabilities, but it obviously
"knows" rates - how often something happen. Animals "know" how often something
happen.

Probabilities is an invention of the mind, which leads to so many errors in
cases where not all possible outcomes and its caused are know, which is almost
always the case. Nature could not rely on such faulty tool.

So, like every naturally complex system, it has both "procedures" and
"weighted" data. Language capacity is hardwired, but grammar "grows" according
to exposure.

To speak about hows, and especially how-exactlys in terms of either pure
procedures or pure statistics is misleading. It is both.

And Mr.Chimsky is right - mere data, leave alone probabilistic models,
describe nothing about principles behind what is going on. They does not even
describe what's going on correctly, only some approximation to an overview of
something unknown being partially observed.

The more or less correct model, as a philosophy, must be grounded in reality,
especially in that part of it which we call the mind. It has been pointed out,
that mind itself is possible because of hardwired apriory notions (grounded in
physical universe) of succession and distance, so models should be augmented
with these notions too. Pure statistics is nothing.

------
indubitably
This is totally off-topic, but man, Norvig writes some miserable HTML.

