True, probability theory is merely a tool among many and shouldn't be used to attempt to solve every problem there is.
However, the article is misleading about the initial gif being any sort of argument against Bayesian probabilistic approaches, as it's precisely an argument for them. It's an argument against using point estimates and summary stats, as they obviously loose a lot of information and can be misleading (you can get the intuition from a CS perspective as well: compressing a 2D image into 4 numbers is necessarily an extremely lossy compression). Fully Bayesian approaches would maintain the distributions all the way through the calculations, rather than collapse them at some point to one or a few numbers summarizing the distribution.
So the author quotes Judea Pearl, whose causal calculus led to PGMs playing a critical in computer science and science as a whole, and somehow uses that to support the idea that probability theory is useless? That is absurd and out of context.
Take a look at Kaldi or any state of the art platform for sound to speech and NLP analysis, or take a look at deep belief machines/networks. They all roughly extend probabilistic graphical models with deep architectures and some nonlinearity to do what couldn't be done before. (See also the Neural Hawkes Process [https://arxiv.org/abs/1612.09328].)
The article seems to primarily be about promoting his book instead of having a cohesive argument.
Yes, nonlinearity matters, but I'm not remotely convinced that tossing probability is a good thing.
That gif is wildly intriguing to me. Probability theory is the bread and butter of mainstream science. Have people been doing it wrong this whole time?
I want to dig deeper into this but two things come to mind:
2. Andrei Khrennikov argues that quantum models can be expressed with Kolmogorovian (classical) probability theory. He points out that many probability theorists have glossed over an entire subtlety of Kolmogorov's work: that probability spaces need not be finite dimensional, that they can be subspaces of higher dimensional, even infinite dimensional probability spaces. Khrennikov's philosophical convictions are not widely shared by quantum theorists, but his math is solid (he seems to be in pursuit of quantum theories with "hidden variables", like Einstein was looking for). http://www.sciencedirect.com/science/article/pii/S0375960103...
Even in the gif shown in this article, you might suppose that each point set could be modeled in higher dimensional terms, for example adding some kind of discrete curl-like measure. But even without going there: something as simple as a changing to a polar basis should reveal different statistics.
And isn't this kind of what deep learning is all about? Going between higher and lower dimensional representations?
I'm a bit slow today but I'm not sure I got what the GIF was about. It is well known that the mean (1st moment) and stdev/variance (2nd moment) are insufficient to uniquely describe an arbitrary probability distribution except in the case of the normal distribution.
Therefore it shouldn't be surprising that that you can find probability distributions that share the first two moments, but differ wildly in the higher moments (skewness, kurtosis) or even the number of modes (bimodal, trimodal, etc.).
Not everything in the world is Gaussian, but we knew that.
I'm not sure why that gif is so interesting to you; that wasn't the reason I submitted this piece. Average and standard deviation are hardly useful data points in evaluating a dataset, and also don't have anything to do with probability theory. The second half of the post contains the bulk of its interesting content (as I read it).
Maybe I'm missing something? To me, that gif illustrates sets of samples that are on the surface compatible with the the same Gaussian distribution, even if a Gaussian distribution is "obviously" lacking.
I think 0xBABAD00C's comment is on the money. The gif is almost a non-sequitur and does not seem to support the conclusion the article is trying to reach.
The second part of the article makes a claim that is worth discussing -- that probabilistic inference does not work well in nonlinear domains.
I think this is true in one sense, but in many practical situations we can usually to get something to work well enough for it to be useful. Assuming we can (approximately) split up the domain into multiple regions and get probability distributions for each region (with transition functions between them), we can continue to make inferences under nonlinearity. That said, this is a concession. Perhaps there are techniques in deep learning that more general in nature.
As someone who has seen enough of the math, I strongly feel that probability just makes ideas seem "sexier" in ML; often you don't really need probability, or you only need it to use Jensen. Worse, using probabilities generally masks deeper ideas from surfacing. Bayesian-ism, has become another cargo-cult, in many ways.
Conventional probability theory, as commonly applied, seems to have some major blind spots. Using deep learning, Yann LeCun's team was able to find uniquely different sets of points that had the same statistical properties. They are calling into question the approach of throwing probability theory at every and any problem. Nonlinear systems in particular often exhibit behavior that is not easy to grasp via probability theory.
But deep learning is terrible at learning nonlinear systems which are easy to model with things like stochastic differential equations... this is just another case of machine learning being pushed too far.
However, the article is misleading about the initial gif being any sort of argument against Bayesian probabilistic approaches, as it's precisely an argument for them. It's an argument against using point estimates and summary stats, as they obviously loose a lot of information and can be misleading (you can get the intuition from a CS perspective as well: compressing a 2D image into 4 numbers is necessarily an extremely lossy compression). Fully Bayesian approaches would maintain the distributions all the way through the calculations, rather than collapse them at some point to one or a few numbers summarizing the distribution.