The most convincing rationale for adopting a Bayesian perspective is contained the collected works of Jim Berger, which I see is cited by the author... but not used in the manuscript.
* Of course, a Bayesian is just a statistician that uses Bayesian techniques even when it's not appropriate -- Andrew Gelman
Well, the one thing that they ought to actually give a shit about is generalization. All else (Bayesian or not) is only in service of that.
Deep learning systems require too much data (and often not raw data), crazy amounts of compute, and still fail under comically irrelevant tweaks of the problem (eg: adversarial noise).
Yes, the PAC learning model starts off by assuming that the test set had the same “data distribution” as the train set. It’s obviously not exactly the same (otherwise learning is kinda irrelevant), but deep learning seems incredibly fragile in its ability to generalize. It seems to require crazy amounts of domain randomization during training to avoid overfitting.
Eg: The recent OpenAI Rubik’s cube demonstration (while an impressive achievement) was an indication of the fundamental silliness of deep RL (at least as of now). To solve the problem, they had to use classical Rubik’s cube solving algorithms, and completely model the physics of the situation (including friction on the surfaces!) and use that model to generate training data while varying every parameter for domain randomization. All that structured prior knowledge is absolutely unfeasible for even a slightly more complicated situation. And after all that, the robot had something like a 20% success rate IIRC. Impressive work on the team’s part, no doubt... but the conceptual framework does not inspire confidence.
Since data stands in as proxy for the environment, a NN will only generalize insofar as it has the data to do so. There is no 3rd dimension in an infinite number of 2D photographs -- an no "bayesianification" will put it there.
But that is how all learning works. Even the most advanced physics models are verified using experimental data. If data is found that contradicts the model the model is deemed faulty and will be adjusted to match the insights gained from the new data. DL just automates that approach aiming for the minimum gap between 'theory' and 'data' where the theory is expressed as a number of weights, thus encoding 'insight' in a - for now - non-reversible way. Just like wetware typically has not clue how it arrived at a certain conclusion, we use the words intuition and synonyms when actually we mean 'guess based on past experience'.
And for inventing new physics, there are two very different approaches. One is perhaps similar to machine learning, where people look at an existing model, find experimental data that doesn't match the model, and adjust the model slightly until it matches the new data. But another mode is where you have a model with certain mathematical properties, then you discover new mathatics that give more properties, and you only then ask for more data to check these extensions.
As someone who has a baby in the house, I assure you that it is indeed how a human learns and refines motor skills, at the very least - by repeated experiment and play, with thousands of permutations of position, speed, weight, and situation. They also try to infer causal models .
Just what do you think 'play' is for a child, if not (in part) an opportunity to safely explore its surroundings and refine mental models?
Even children, after some point, start developing a model and then designing experiments to test hypotheses to refine that model, which is not at all how machine learning works.
Many of these criticisms generalize to 'first make it work, then make it better, then make it fast'.
Being able to solve some of these problems at all is the kicker, whether or not it is efficient is at the moment not nearly as important as being able to solve them in principle.
I suspect that in the very near future we will see something that extracts the actual insights required from a DL model that makes it work which then can be used to power an analytical solution that performs a lot faster than the model itself, and hopefully with more resistance against noise.
I've already seen several glimpses of this and I'm hoping for some kind of breakthrough where DL is used to continue where feature engineering left off.
There are deep (not universally applicable) assumptions in that framing... It works only if you didn't "make it work" in a way that fundamentally limits/handicaps the future steps, or gives no insights in that direction.
Based on my understanding of the subject, I think that much recent progress in deep learning is less of a breakthrough than it's commonly made out to be, and that there are fundamental conceptual reasons why the approach is limited, almost like achieving the first step of drawing an owl: https://i.kym-cdn.com/photos/images/newsfeed/000/572/078/d6d...
But the jury is out on that one, and I think there is enough room for reasonable people to differ in their opinions. ¯\_(ツ)_/¯
Maybe folks making snapchat filters don't care, but this is absolutely vital if you're doing something with a low margin of error (self-driving cars, financial work, etc...). If your neural net can tell you if it thinks it could be making an error, that information is absolutely vital for keeping your system from messing up big time.
The problem is that none of the Bayesian methods for machine learning work well and quickly (please do correct me if I'm wrong--I will become my boss's favorite person). If they did, many many practitioners would be very excited to use them.
The basic idea is brilliant: take a deep neural network with dropout, keep the dropout at evaluation time, run it several times and deduce the uncertainty from the observed variance (with a bit of math).
This lets you add an uncertainty estimation to the output of any deep neural network that uses dropout.
Usually you train a neural net that's as large as possible with a runtime that's within your latency and GPU budget. Running the net even twice will blow up your budget.
Also, I haven't seen a great evaluation on how well this method works in practice. For example, how well does predicted uncertainty capture mistakes on a real-world dataset?
There are loads of methods based on variational autoencoders that seem to work fairly well for various problems.
Detecting if we see an expresso or a cappucino, sure fine.
But uncertainty on a CT scan would be extremely valuable just to avoid people (even trained professionals) falling into the common bias of trusting the output.
Say the model outputs 90% chance of nothing, 10% chance of a deadly condition that needs to be investigated ASAP. Doctors are often overworked and tired and can miss it. Knowing where the certainties and uncertainties of the model lies would allow the doctor to compare that to its own training "Ah but it didn't at all judge based on X and Y" and limit decision machine based on "trusting the algorithms".
This is a similar problem that rating system based on just a number of stars are facing, you need to read the comments and the why just to filter out astroturfing.
Coming back to ML, this is why I think more emphasis should be given to LIME , SHAP  and also Microsoft's Explainable Boosting Machine  which ironically is not a gradient boosting algorithm but a Generalized Additive Model (i.e. like GLM but Squared instead of Linear)
Regarding your specific point about how do you quantify error regarding your chosen prior? The answer in a Bayesian framework lies in things like credibility intervals, posterior distributions, posterior predictive distributions, depending on what you’re actually trying to quantify.
Very long story short, the latter allows you to sample from a distribution of predictions - for which you can then form a variety of error estimates (and the distribution doesn’t have to be parametric). That will quantify uncertainty for the model you’ve made using the prior you chose (and the data). But usually we wouldn’t recommend changing your prior just to get lower uncertainty unless you have some rationality behind changing the prior. Priors should really be chosen based on domain knowledge etc etc etc.
I highly recommend you  cause it's free, hands on and if you have solid stats will go fast and easy.
Anyways, the thing with Bayesian is you need to look at the problems from a different optic and learn new names for the same things you know from classic (Maximum likelihood is now called mean a posteriori and whatnot).
If I had to summarize the whole uncertainty thing I would say: Both in classic and Bayesian it holds that a function of random variables yields another random variable. Models are functions of random variables. As such, every point estimate you give has a full distribution behind. In classic this is accounted for with asymptotics and confidence intervals. In classics you have priors all the same, but using defaults and not talking about them.
In Bayesian you talk about your priors, and if you have extra information about your inference problem you include it there. Or you abuse the priors to make the model yield results that will make your boss/client happy (I worked as an economics researcher at a big bank, the whole department was Bayesian). Or choose priors that make you custom model actually compute. Or use priors that do regularization (like a classic lasso)... Different priors are different models and you need to do your model checking / selection.
I am not so sure what you mean about continuous priors. Any gamma should be continuous. But I don't think that is what you mean.
Spot on. Personally, I always enjoy Alex Kendall and Yarin Gal's writing on the topic, for example:
There are alternative ways of getting uncertainty estimates that are far less work.