
On Chomsky and the Two Cultures of Statistical Learning (2011) - gajju3588
http://norvig.com/chomsky.html
======
pesenti
I used to think the same way. But after spending the last few years getting
frustrated trying to create more complex linguistic systems, e.g., dialog
systems or scientific articles understanding, I am coming to the conclusion
that the current statistical approach is a dead end. It’s actually impeding
the field because it’s working so well for certain tasks that when people are
trying to build systems with real understanding, they can’t match the
performance obtained by gigantic language models. But the answer is not
Chomsky, it’s semantic grounding in the real world rather than looking at
language as a sequence of symbols.

~~~
cgearhart
I agree. I think Norvig is right that interpretation and understanding may be
inherently probabilistic problems, but the accuracy of statistical models sure
seems to just be a matter of "playing the odds"; correctly performing some
task doesn't imply any real "understanding" by the model. If the system had
"semantic grounding" as you call it, that would go a long way towards
disambiguating during interpretation & understanding tasks.

------
d_burfoot
Let me rephrase the debate, in a way that can hopefully clarify the main point
of contention:

C: Your statistical models are woefully inadequate at describing language.

N: That inadequacy is related particularly to Markov models and Ngram models.
More sophisticated statistical models will be adequate.

C: Then why haven't you built the more sophisticated models? Why are you still
using Markov models and Ngrams?

N: Those work well enough for engineering applications.

The attitude of "it works well enough for engineering" is what Chomsky is
actually criticizing. And that criticism is entirely valid: an empirical
scientist would never claim that a theory is true because it can be used in
engineering.

It's funny to me that Norvig holds up the PCFG as an example of a new and
improved statistical model of language. The PCFG is actually terrible in many
ways, the most obvious of which is that it doesn't take into account the Theta
Criterion [1], one of the most fundamental phenomena of language. An example
of this rule is that a noun phrase can only have one determiner. This
restriction is so strong that it will never be violated in any kind of
professionally composed text. But it is very awkward to try to encode this
rule in a PCFG (you essentially have to split the NP symbol into DetNP vs
UndetNP). I wrote a blog post describing the problems of the PCFG formalism:

[https://ozoraresearch.wordpress.com/2017/03/17/chuckling-
a-b...](https://ozoraresearch.wordpress.com/2017/03/17/chuckling-a-bit-at-
microsoft-and-the-pcfg-formalism/)

[1]:
[https://en.wikipedia.org/wiki/Theta_criterion](https://en.wikipedia.org/wiki/Theta_criterion)

------
seagullz
Further elaborations by Chomsky on what he meant:
[https://www.theatlantic.com/technology/archive/2012/11/noam-...](https://www.theatlantic.com/technology/archive/2012/11/noam-
chomsky-on-where-artificial-intelligence-went-wrong/261637/)

Some video clips of this interview by Yarden Katz:
[http://yarden.github.io/pages/chomsky/](http://yarden.github.io/pages/chomsky/)

------
eirikma
This is a fantastic challenge for computer engineers: create an interpreter or
compiler for a programming language, based on statistic modeling. How likely
is it that combining the token "foo" with "bar" using a mathematical operator
is a valid construct, based on experience from other programs? What will the
resulting value usually be, based on prior experience?

------
wglb
I am also reminded of the unreasonable effectiveness of data
article:[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf)

------
melling
What’s the difference between a probabilistic model and a statistical model?

~~~
cgearhart
> A statistical model is a mathematical model which is modified or trained by
> the input of data points.

> A probabilistic model specifies a probability distribution over possible
> values of random variables, e.g., P(x, y), rather than a strict
> deterministic relationship, e.g., y = f(x).

For example, by these definitions, we could use a linear regression as a
statistical model that is not a probabilistic model; we could make a bayes
network (using known distributions) as a probabilistic model that is not a
statistical model; and we could make a Hidden Markov Model trained on sample
data that would be both a statistical and probabilistic model.

~~~
Rainymood
In the linear regression model the points (x,y) have the following
relationship (in the most simple example)

> y_t = x_t + e_t

where e_t is some error term. Would this be a statistical or probabilistic
model?

We do know that the best predictor is just the conditional expectation (in a
linear setting bla bla)

> y_pred = E[x_t + e_t | .. ] = x_t

Or is this what you mean with "model"? The predictor? Sorry for being a bit
confused.

~~~
in9
Usually, every statistical model is a special case of a probabilistic model in
which its parameters have been estimated by data.

The regression E[Y|X] is basically the mean of a gaussian distribution of Y
given X with sigma set to the error term. The whole gaussian distribution part
is the probabilistic model. But to estimate wich parameters make up this model
in a particular application (together with how to check if it is valid or not)
is the statistical modeling part.

------
dpf
Previous discussion:

[https://news.ycombinator.com/item?id=11951444](https://news.ycombinator.com/item?id=11951444)
(2016)

[https://news.ycombinator.com/item?id=2591154](https://news.ycombinator.com/item?id=2591154)
(2011)

------
ronilan
They both must rattle, an Epic Rap Battle.

