"I believe that Chomsky has no objection to this kind of statistical model [the Newtonian model of gravitational attraction]. Rather, he seems to reserve his criticism for statistical models like Shannon's that have quadrillions of parameters, not just one or two."
This is no more than an objection to problems of fitting your chosen model to data. If you only have a small number of free parameters, then you can fit your model with a reasonable amount of data. If you have a large number of parameters then you have to introduce some extra assumptions, as Norvig (of course) acknowledges slightly earlier (described as "smoothing", in context):
"For example, a decade before Chomsky, Claude Shannon proposed probabilistic models of communication based on Markov chains of words. If you have a vocabulary of 100,000 words and a second-order Markov model in which the probability of a word depends on the previous two words, then you need a quadrillion (10^15) probability values to specify the model. The only feasible way to learn these 10^15 values is to gather statistics from data and introduce some smoothing method for the many cases where there is no data."
Thus, although both models are statistical, it is much easier to have confidence in Newton's law of gravitation than it is in a Markov model of some communication channel, because the data tell a clear picture. The imprecision of Newton's law in certain parts of the problem space (unobserved during his time) is a moot point - any such objections apply equally well to models with many parameters, and then you _still_ have to accept that you have made extra assumptions "outside" the scope of your model.
If you can explore your entire problem space, then you can build a complete "model". If not, then having more parameters than data _requires_ additional assumptions. Chomsky's point stands.
One small point to add: Chomksy's interest has been in identifying the abstract characteristics of the language center of the human brain, which, for various reasons, does not seem likely to work like a Markov model.
Analogously, one could look at the inputs and outputs of the human heart and potentially imagine a variety of physical structures that would explain them, and some of those physical structures would be biologically real/plausible and others would not.
Things like constraints on working memory, exposure to input, and cross-language studies have informed the constraints that Chomsky has proposed to determine what kind of model would best capture the essential quality of the brain system.
Statistical models can be useful but they generally aren't meaningful to scientific progress.
The problem with Chomsky's knowledge based approach is that it seems to be domain specific. You can learn the fundamentals of music but you can't apply it to physics. You can learn the fundamental mechanics of weather but you can't apply it to psychology etc..
General learning though requires exactly that and there the statistical model seems to be much better because it doesn't run into the domain problem.
"a mathematical model which is modified or trained by the input of data points."
He then illustrates how what would be considered a "scientific" model, Newton's law of gravitation, is a statistical model under his definition, but simple one with not many parameters. He contrasts this with a Markov model of a communication channel with a large vocabulary, which has many parameters. His argument is, then, that Chomsky dislikes statistical models with large numbers of parameters, as stated in the passage I quoted before.
My point was that Chomsky's concerns are, with reference to Norvig's argument, equivalent to concerns about model fitting, namely that to fit models with many more parameters than you have data, you require additional assumptions about your model structure. It is difficult (though not necessarily impossible) to learn about the system from your model, because in order to construct your model you have had to assume things about reality that you will not be verifying against observations.
In the opposite case, where you have many more data than parameters, you can fit your model with confidence, given only assumptions about your sampling (which you address by being a good experimentalist). This is what allows you to "learn about the underlying system" - you have a model that describes reality well by itself, without requiring additional assumptions about the nature of reality, so the structure of your model reflects something about the structure of reality, and you can explore your model as though you were exploring reality. Of course, sometimes it turns out the equivalence wasn't as good as we thought, but often it provides us with new directions of investigation.
Hopefully that clarifies the equivalence between the two statements - I apologise for not making it more obvious earlier.
On the other hand, IMO there is nothing wrong or unscientific with having empirically estimated relationships as part of the model -- I just see them as shortcuts whose purpose is to parcel the problem so as to allow other analysis, and as something to potentially investigate further to see why the relationship takes a particular form.
Some ML methods are more amenable to this type of analysis than others though.
To your second point - I agree. I too do not reject that there is utility in constructing models that make no effort to match the form of the underlying reality. However, the fact remains that in such cases it is very difficult to use your model to gain deeper understanding, and as such these models simply aren't useful for a lot of science in their current form, precisely because they don't tell you anything about reality. Now if someone were to devise a way of extracting "intelligent", (meaning, sensible given existing understanding) simplified relationships from high-dimensional models, that might be a different matter...
I agree with your second sentence - the key question is indeed whether the model is explanatory rather than merely predictive. I offered a definition for explanatory as being when "the structure of your model reflects something about the structure of reality, and you can explore your model as though you were exploring reality", which isn't a terrible attempt, from my experience.
At the risk of repeating myself ad nauseam, the relationship to model fitting is found in the presence or absence of additional assumptions required for finding your fit. The difference is between having a very low-dimensional model that fits the data and requires few if any extra assumptions to fit (fitting the model being equivalent to "validating your theory", in this context), or a very high-dimensional model that fits the data (making no claim to "theory"), but by definition requires extra assumptions to get the fit.
In another sub-thread you said:
> If you model a system using the smallest possible mathematical model you don't, from that act alone, understand how the system works.
This is correct inasmuch as the understanding doesn't leap forth immediately, but if the model is a good representation of the data (an important if), then modelling a system in a parsimonious way possible does provide you with understanding, pretty much for free, by looking for systems with a similar structure and learning about their properties. As an example, if you have some random variable, and you realise that it might be modelled with a Poisson distribution, then (assuming you are correct) you immediately gain a lot of understanding, because there is an enormous amount of literature exploring the implications of such a model.
This is what substantiates the link between model fitting and the explanatory vs. predictive question. If you can successfully fit a small model to a problem, without adding assumptions, then that model gives you understanding, by virtue of being a good representation of the data, and having structure. That is simply not the case with the high-dimensional models used in machine learning.
I would be interested to see a counter example - a small model that fits a particular set of data well, but does not provide any explanatory power.
> I would be interested to see a counter example - a small model that fits a particular set of data well, but does not provide any explanatory power.
As I said above, a linear regression typically doesn't lend itself to understanding. In fact, I would be interested in an example of a statistical model that does provide any meaningful explanation. Most ML and AI researchers don't even seem to pursue scientific understanding as a goal; predictive power is their measure of success. That is Chomsky's criticism.
Science has the fundamental aim to make testable predictive models of the world; any "understanding" other than that represented by a testable predictive model is irrelevant to science, except insofar as it might provide intuition on which to found hypotheses of better predictive models.
Statistical models, to the extent that they are testable and predictive, are exactly the kind of thing that science is about.
To use one of Chomsky's examples, if you built a deep neural net that could predict the weather with 100% accuracy then you have performed an amazing feat of engineering that is incredibly useful to the world. But you haven't necessarily learned anything about the weather.
Of course, were the model really 100% accurate, then converting it to a the most parsimonious perfectly accurate model would be simply a matter of reduction; the real problem is that real neural net models are not 100% accurate, and are often far more complex than more accurate models, and often aren't convenient to analyze to deduce the more-accurate and simpler model.
Has this ever been done? It strikes me as impossible.
> Of course, were the model really 100% accurate, then converting it to a the most parsimonious perfectly accurate model would be simple a matter of reduction
There are entire sub-fields of physics focused on reduced order models and phenomenalogical prediction. It is not 'simply a matter of reduction', it is non-trivial and the resulting models are almost always highly imperfect.
> It strikes me as impossible.
"Impossible" is probably, in principle, not quite right, but when I said "nontrivial", I really did mean "almost certainly impractical in virtually all real cases".
> There are entire sub-fields of physics focused on reduced order models and phenomenalogical prediction. It is not 'simply a matter of reduction', it is non-trivial and the resulting models are almost always highly imperfect.
"Simply" there meant that it was just reduction, not that reduction is necessarily simple. The point was that in real-world cases, reduction to an equivalent model isn't the only problem.
Chomsky is a Grand System Builder in the style of the medieval and renaissance thinkers. Data is, at best, irrelevant and, at worst, a confusing distraction that gets in the way of his program, which is elaborating on his system, basing it entirely on intuition.
One advantage of such grand systems is that they can be made to look like the mathematical model, based on "intuitively obvious" axioms and built by thinking deeply in the comfort of one's armchair.
Let's assume (I don't know if it is actually true or not) that in the history of the supreme court, there have never been two judges of the exact same seniority. In that case, a model learned from handshake data would not include the slightest hint of this unwritten social law.
I think what Chomsky is saying is that if we do not understand the generative principle behind any data, we cannot possibly know what circumstance might completely invalidate our model. There may not be a way to smooth this out.
Language understanding, contrary to things like speech recognition, does not lend itself very well to smoothing.
Truth is you can't do much with statistical models of language without some sort of way to account for what's missing from your data, which is always most of language. On the other hand, anything you might do is never going to be enough when that's the case: that you're missing the majority of language from your training data.
I agree with your second paragraph, and if I understand Chomsky correctly, that is part of why he argues in favor of a generative grammar. I can't say that I completely understand how such a grammar would be linked to semantics and experience though.
Out of curiosity, how does Chomsky's generative model account for language understanding?
1. You have a certain concept you wish to express.
2. You apply a generative grammar to the concept, producing a linguistic statement.
3. You express the statement in a linguistic performance.
4. I perceive the linguistic performance.
5. ¿I reverse the generative grammar to produce the concept?
6. I understand the concept.
My understanding is that Chomsky is only interested in steps 2 and 5 (and that he is explicitly uninterested in 3 and 4). But how does step 5 work?
Somewhat similar to Buridan's Ass.
... an ass that is equally hungry and thirsty is placed precisely midway between a stack of hay and a pail of water. Since the paradox assumes the ass will always go to whichever is closer, it will die of both hunger and thirst since it cannot make any rational decision to choose one over the other.
Yes the machine could guess. It could even guess correctly. But I think what Chomsky means is that we have no valid scientific reason to believe that it would.
No, what Chomsky is saying is that if you don't know the generative principle behind the data, then you don't know the generative principle behind the data. You don't understand a system just because you can faithfully predict it.
So you'd have a model where handshakes are dependent on the relative heights, and it would fail because some senior judge who is short goes around and initiates.
And then you'd come to the correct one, if the data is there for it. Like a hypothesis that's never refuted.
Of course then there's noise. If it's a general rule, but on casual Fridays...
If you have a limited dataset containing the actual generating data, say heights, seniority, age, hair color, day of the week, gender, etc, can't you end up discovering at least the parts that are exposed in the data?
I guess scientists don't guess completely at random though, do they? They use some sort of heuristics, or their instinct (I've no idea what is, no) to decide which hypotheses are worth pursuing.
Most statistical models aren't very good at taking background knowledge like that into account. Not to mention most are also sensitive to "noise" (which is to say, useful information that they can't make use of).
On a higher level, I believe that he speaks for most (if not all) social sciences and what happens when they cross mathematical models who blindly try to understand and predict the real world through a flawed, limited mathematical model.
If we can do this eventually, then perhaps we can make our way to models that capture a larger fraction of the truth and can be tested against reality.
Nonetheless, many of these models are still falsifiable. For example, the textbook model of the benefits of trade due to comparative advantage only contains two countries and two goods. Although extremely simplistic, the theory does have falsifiable predictions and there is good empirical evidence supporting the broad theory. But this simple model will not be able to accurately predict how much a particular country will benefit from free trade.
Even if you did have all of that data, you would still be missing a lot of data. For example, an important factor in economics is asymmetric information. An actor's decision may be optimal given the information available to them at the time even if the decision is suboptimal given perfect information. So for an even more accurate model, you would also need to have data on what each actor knows at any given point in time.
Furthermore, an actor probably does not have the ability to calculate the optimal outcome given the information available and there are innumerable subjective factors in economic decision making.
Big data will help you find some kind estimate, but it will not suddenly make economic modelling simple.
Anyone who uses the term "neoliberalism" is a charlatan. It is an absurd conspiracy theory that has since blown into an amorphous meme out of Marxist historiographers who struggle with the explanatory power (or lack thereof) of their framework.
Mathematical models do have their uses in formalizing assumptions and analyzing dependencies, so they're not all bad, even if the Cowles Commission did go overboard.
(Also let's not forget that the modern methodology of economics came as a result of the Keynesian research program, particularly since Hicks (1937)'s introduction of IS-LM, first modeled by Keynes himself in 1933 as four simultaneous equations.)
I don't think people who use the term think it refers specifically to shadowy cabal of free marketeers conspiring with each other over some hidden agenda. It's just a term for what a bunch of people, many of whom are in positions of power, happen to openly think and do.
No, its not. Its used by defenders of neoliberalism quite a bit.
> A lot of people describe themselves as libertarians of various types, or even as classical liberals, but I've never met a person calling herself a neoliberal.
Which says a lot more about who you do (and don't) know than it says about anything else.
But, even here it does not look well-defined at all: at best, authors simply classify specific policies as "neoliberal".
It is always used to critisize things from the left, which creates an impression that the term describes economically right-of-center ideologies, but that's IMO just because of the difference in vocabulary between different groups -- e.g. you may see the same thing described as neoliberal by the left, liberal by the right, and "statist" by the libertarians.
Thinking about it, "statist" is a similarly empty term used exclusively by opponents -- I've never seen anyone calling herself a "statist" either.
> It is an absurd conspiracy theory that has since blown into an amorphous meme out of Marxist historiographers who struggle with the explanatory power (or lack thereof) of their framework.
"neoliberalism" as a term is no more (and, arguably, much less) an invention of Marxist historiography than is "capitalism"; its true that both terms were coined by critics of the systems they describe, but that's not uncommon.
When I've heard him speak, he attacks overly simplistic mathematical models which badly fit complex real world behaviour.
If Chomsky's generative grammar provided a better model of language, Google translate would be running of that rather than statistical models.
Chomsky's ideas don't explain the butterflies. In the same way Varoufakis targets theory led economics which don't explain recent events.
Stats led models take the opposite approach, explaining the butterflies very well. The trade off is decreased ability to act in novel situations.
Chomskian grammars (CFGs) are used widely in compilers and similar tools to model computer languages and even for limited subsets of natural language they don't do half bad.
The problem with phrase structure grammars is that they're very costly to develop and maintain and so far there's never been one such grammar that can model the whole of a natural language.
If there was a way to learn CFGs with good coverage from text they'd be in much wider use, but unfortunately grammar induction is hard.
Also, Google has its own political reasons not to want to use grammars. They champion neural networks and statistical AI. It's their schtick, innit.
Here's an enlightening quote from Chomsky:
"Linguistic theory is concerned primarily with an ideal speaker-listener, in a completely homogeneous speech community, who know its (the speech community's) language perfectly and is unaffected by such grammatically irrelevant conditions as memory limitations, distractions, shifts of attention and interest, and errors (random or characteristic) in applying his knowledge of this language in actual performance. (Chomsky, 1965, p. 3)"
Chomsky is uninterested in linguistic data of the kind used to build statistical language models; those are "linguistic performances" and he is only looking at "linguistic competence", the ability of an ideal speaker to "produce and understand an infinite number of sentences in their language, and to distinguish grammatical sentences from ungrammatical sentences."
Now, I'm personally happy to criticise statistical techniques for their lack of explanatory power. But I'm not willing to go further and say that data is irrelevant. Chomsky is.
Nevermind the debates about the ultimate nature of the human cognitive process; fact of the matter is that as observed, it's always-already wrapped in emotional-social thinking. Enough that there's reason to question the subject-object split altogether.
Now, maybe Chomsky is a kind of extreme social-cognitivist and his abstract generative trees apply to societies as learning and meaning-producing wholes. But on the face of facts, rather than metaphysical speculation as to the nature of personality, intentionality and individuality, it would seem to me that the statistical/machine learning approach already faces language as it happens: as embodied in media, social context and so on.
In other words: I fail to see much value in an abstract account of "pure language" as dissociated from the real communicative process as it happens right now as you read me. Sure, "insights" -- but it remains to be shown that "pure linguistics" is a worthwhile endeavor on the level of "pure quantum mechanics" as formal model.
But this assumption makes assumptions of purity as well. The communicative process may just be a byproduct of a mental process which has little to do with communication. A mutation happens tens of thousands of years ago (say 50,000 years ago), a change happens in the Broca (and/or Wernecke) area of the brain, and suddenly a new mental process kicks off. This mental process can be modeled as a state machine, and has the abilities and limitations of a state machine. It also has known limitations of output, which Chomsky has talked about.
You're assuming the mutations which gave rise to the brain changes which created an internal language generator and parser have only one purpose - communication. But that's an assumption on your part. The ability to communicate may be just one byproduct of those changes which made things like communication possible.
Jacques Lacan insists on some ideas related to that. I've never been too fond on psychoanalysis either, but I've been known to be wrong often.
I don't believe this is the case with psychology or linguistics. Of course, I could be wrong; in particular, they may work as "applied pseudosciences" (much like economics, which is really useful) even though we will never arrive at their foundations.
EDIT: More succintly put: the objects of interest of chemistry always arrived as abstracted from context, while the objects of linguistics and psychology are the context themselves.
What a spectacularly obtuse and improbable sentence! Its mere existence underlines its point. Nice.
Maybe "embeddedness" is the better word. The symbol system that is language is never the whole story of communication with language; we hardly code expression that's verbal but nonlexical (tone of voice, prosody, etc.); nor we're able to incorporate the layers upon layers of material mediation (reproduction and transmission technologies, air, light) into our generative account of linguistic phenomena.
Which is maybe another way of putting that what you say isn't what you mean - it's what's out there.
Maybe it's still a stupid point, but I thought it was worth trying to restate it once more.
"Linguistic theory is mentalistic, since it is concerned with discovering a mental reality underlying actual behavior. Observed use of language ... may provide evidence ... but surely cannot constitute the subject-matter of linguistics, if this is to be a serious discipline."
As for disdain, while I agree he can be overly dry sometimes, seeing the kind of stuff he often has to put up with, around political subjects anyway, I'd say he shows a lot of patience, too. If more people would do their part most of the more unpleasant subjects he debates people on wouldn't even be an issue. He wouldn't be able to use harsh words for, say, war criminals and their apologists, if those didn't exist in the first place. It's not actually his or anyone's job to try and help clean up the mess others are making, they do that on the side because it's required to look in the mirror.
^ This is what and who matters. Without a (decent) world, all the rest ain't happening anyway. He's the kind of person I'd imagine to be rather friendly to, say, a cleaning lady, and that to me are the more significant bits of the number that makes up grace. At least from what I see, he's sometimes a dick to people who are used to be lauded (and well paid), he's a champion of people who are used to get shit on.
> "Courage is indispensible because in politics not life but the world is at stake." -- Hannah Arendt, "Between Past and Future"
We're now at the point where we're not just worried about dying, or getting jailed, or losing a friend or two; but about merely being offended. Crooks and madmen aren't content anymore with merely getting away with it, now they want respect, too. I'm not even sure how polemic or exaggerated that is. Criminals are tired of laundering money so to speak, they just want to whitewash the crimes themselves and do them in the open. You can respectfully disagree in a way that leaves room for you not actually being sure of what you're saying, and that's it.
Also, saying "this house is on fire" is not an "absolute stance" on fires or building safety, and if the house is on fire, and as long as no serious efforts are made to change that, it's simply principled to repeat this over and over, instead of coming up with something "new" for the sake of it. Hannah Arendt again:
> The ceaseless, senseless demand for original scholarship in a number of fields, where only erudition is now possible, has led either to sheer irrelevancy, the famous knowing of more and more about less and less, or to the development of a pseudo-scholarship which actually destroys its object.
I think this could also be applied to discussion of politics. Every time a government or the NSA does something, a bunch of people say "why does this surprise anyone?", as if something that sucked on day 1 would somehow be less alarming on day 50, with the same or even increased levels of suckage.
Same for Chomsky criticizing his own, which is still valid. Why does it have to be new, or elegant, or otherwise pleasing? If a society shits its pants, that's already not something pleasant to point out, but the more years pass, the longer it keeps sitting in and adding to it, the less pleasant it becomes. It's not anyone's job to add sugar to the medicine, or give it with a big smile, or anything. I think it's awesome if they do, but still perfectly fine if they don't. Insofar his or anyone's arguments have merit, you can always repackage it without any vitriol or snark, and they stay intact. That is what really matters, and that kind of elegance is also in the ear of the listener.
Chomsky is right that language has meaning and that many modern statistical techniques essentially ignore this.
My take on the issue is that you can't separate linguistic command from true intelligence / cognition. There's a long tail of tricks that us intelligent people can use, but fundamentally they'll only be tricks. And if we truly get something resembling a perfect linguistic-aware AI by this long tail of tricks then we've probably accidentally created real cognition. Maybe after typing this all out I finally understand what Turing meant.
[Edit] No really, take the ideal gas law. To get a real understanding of a given situation, you would have to know the position and momentum of all of the molecules of the gas. But the ideal gas law says, no, you don't need all of that to get a very good approximation; all you need to know to get the pressure of a gas is the temperature, volume, and amount of gas.
Physics went through this same argument with thermodynamics and statistical mechanics. Predictive, statistical models pretty much won. And then along came quantum mechanics---there's nothing descriptive there at all.
* Statistical mechanics: https://en.wikipedia.org/wiki/Statistical_mechanics
It is obvious from serious psychological studies of the process of a language acquisition, that it is similar to training a neural network - there is some knowledge representation grows up in the brain, but the process of training/learning is possible due to having appropriate machinery in the brain.
It seems, like we have more that two apriory notions - of time and space, we, perhaps, have, apriory notions of a thing (noun), process (verb) and attribute (adjective) and even predicate at very least, as reflections of our perceptions of physical universe around us with sensory input procession machinery we happen to evolve.
It is a mutualy recursive process - we evolved our "inner representation" of reality constrained by senses, but nature selects, in some cases, those with more correct representations.
How these apriory notions maps to sounds - details of phonology and morphology is rather irrelevant - we evolved machinery for that. This is why, there is no fundamental, principal differences between human languages. The difference in in a degree, not in a kind.
It seems also that we learn not the rules (schools are very recent innovations), but "weights" by being exported to the medium of a local spoken language. Children do it on their own, at least in remote areas, like among nomads of Himalaya, no worse than Americans. This, by the way, is prof that we have everything we need to be Buddha or Einstein.
How exactly training occurs is absolutely unknown but it has nothing to do with probabilities. Nature knows nothing about probabilities, but it obviously "knows" rates - how often something happen. Animals "know" how often something happen.
Probabilities is an invention of the mind, which leads to so many errors in cases where not all possible outcomes and its caused are know, which is almost always the case. Nature could not rely on such faulty tool.
So, like every naturally complex system, it has both "procedures" and "weighted" data. Language capacity is hardwired, but grammar "grows" according to exposure.
To speak about hows, and especially how-exactlys in terms of either pure procedures or pure statistics is misleading. It is both.
And Mr.Chimsky is right - mere data, leave alone probabilistic models, describe nothing about principles behind what is going on. They does not even describe what's going on correctly, only some approximation to an overview of something unknown being partially observed.
The more or less correct model, as a philosophy, must be grounded in reality, especially in that part of it which we call the mind. It has been pointed out, that mind itself is possible because of hardwired apriory notions (grounded in physical universe) of succession and distance, so models should be augmented with these notions too. Pure statistics is nothing.