The most convincing rationale for adopting a Bayesian perspective is contained the collected works of Jim Berger, which I see is cited by the author... but not used in the manuscript.
* Of course, a Bayesian is just a statistician that uses Bayesian techniques even when it's not appropriate -- Andrew Gelman
Well, the one thing that they ought to actually give a shit about is generalization. All else (Bayesian or not) is only in service of that.
Deep learning systems require too much data (and often not raw data), crazy amounts of compute, and still fail under comically irrelevant tweaks of the problem (eg: adversarial noise).
Yes, the PAC learning model starts off by assuming that the test set had the same “data distribution” as the train set. It’s obviously not exactly the same (otherwise learning is kinda irrelevant), but deep learning seems incredibly fragile in its ability to generalize. It seems to require crazy amounts of domain randomization during training to avoid overfitting.
Eg: The recent OpenAI Rubik’s cube demonstration (while an impressive achievement) was an indication of the fundamental silliness of deep RL (at least as of now). To solve the problem, they had to use classical Rubik’s cube solving algorithms, and completely model the physics of the situation (including friction on the surfaces!) and use that model to generate training data while varying every parameter for domain randomization. All that structured prior knowledge is absolutely unfeasible for even a slightly more complicated situation. And after all that, the robot had something like a 20% success rate IIRC. Impressive work on the team’s part, no doubt... but the conceptual framework does not inspire confidence.
Since data stands in as proxy for the environment, a NN will only generalize insofar as it has the data to do so. There is no 3rd dimension in an infinite number of 2D photographs -- an no "bayesianification" will put it there.
But that is how all learning works. Even the most advanced physics models are verified using experimental data. If data is found that contradicts the model the model is deemed faulty and will be adjusted to match the insights gained from the new data. DL just automates that approach aiming for the minimum gap between 'theory' and 'data' where the theory is expressed as a number of weights, thus encoding 'insight' in a - for now - non-reversible way. Just like wetware typically has not clue how it arrived at a certain conclusion, we use the words intuition and synonyms when actually we mean 'guess based on past experience'.
And for inventing new physics, there are two very different approaches. One is perhaps similar to machine learning, where people look at an existing model, find experimental data that doesn't match the model, and adjust the model slightly until it matches the new data. But another mode is where you have a model with certain mathematical properties, then you discover new mathatics that give more properties, and you only then ask for more data to check these extensions.
As someone who has a baby in the house, I assure you that it is indeed how a human learns and refines motor skills, at the very least - by repeated experiment and play, with thousands of permutations of position, speed, weight, and situation. They also try to infer causal models .
Just what do you think 'play' is for a child, if not (in part) an opportunity to safely explore its surroundings and refine mental models?
Even children, after some point, start developing a model and then designing experiments to test hypotheses to refine that model, which is not at all how machine learning works.
Many of these criticisms generalize to 'first make it work, then make it better, then make it fast'.
Being able to solve some of these problems at all is the kicker, whether or not it is efficient is at the moment not nearly as important as being able to solve them in principle.
I suspect that in the very near future we will see something that extracts the actual insights required from a DL model that makes it work which then can be used to power an analytical solution that performs a lot faster than the model itself, and hopefully with more resistance against noise.
I've already seen several glimpses of this and I'm hoping for some kind of breakthrough where DL is used to continue where feature engineering left off.
There are deep (not universally applicable) assumptions in that framing... It works only if you didn't "make it work" in a way that fundamentally limits/handicaps the future steps, or gives no insights in that direction.
Based on my understanding of the subject, I think that much recent progress in deep learning is less of a breakthrough than it's commonly made out to be, and that there are fundamental conceptual reasons why the approach is limited, almost like achieving the first step of drawing an owl: https://i.kym-cdn.com/photos/images/newsfeed/000/572/078/d6d...
But the jury is out on that one, and I think there is enough room for reasonable people to differ in their opinions. ¯\_(ツ)_/¯
Maybe folks making snapchat filters don't care, but this is absolutely vital if you're doing something with a low margin of error (self-driving cars, financial work, etc...). If your neural net can tell you if it thinks it could be making an error, that information is absolutely vital for keeping your system from messing up big time.
The problem is that none of the Bayesian methods for machine learning work well and quickly (please do correct me if I'm wrong--I will become my boss's favorite person). If they did, many many practitioners would be very excited to use them.
The basic idea is brilliant: take a deep neural network with dropout, keep the dropout at evaluation time, run it several times and deduce the uncertainty from the observed variance (with a bit of math).
This lets you add an uncertainty estimation to the output of any deep neural network that uses dropout.
Usually you train a neural net that's as large as possible with a runtime that's within your latency and GPU budget. Running the net even twice will blow up your budget.
Also, I haven't seen a great evaluation on how well this method works in practice. For example, how well does predicted uncertainty capture mistakes on a real-world dataset?
There are loads of methods based on variational autoencoders that seem to work fairly well for various problems.
Detecting if we see an expresso or a cappucino, sure fine.
But uncertainty on a CT scan would be extremely valuable just to avoid people (even trained professionals) falling into the common bias of trusting the output.
Say the model outputs 90% chance of nothing, 10% chance of a deadly condition that needs to be investigated ASAP. Doctors are often overworked and tired and can miss it. Knowing where the certainties and uncertainties of the model lies would allow the doctor to compare that to its own training "Ah but it didn't at all judge based on X and Y" and limit decision machine based on "trusting the algorithms".
This is a similar problem that rating system based on just a number of stars are facing, you need to read the comments and the why just to filter out astroturfing.
Coming back to ML, this is why I think more emphasis should be given to LIME , SHAP  and also Microsoft's Explainable Boosting Machine  which ironically is not a gradient boosting algorithm but a Generalized Additive Model (i.e. like GLM but Squared instead of Linear)
Regarding your specific point about how do you quantify error regarding your chosen prior? The answer in a Bayesian framework lies in things like credibility intervals, posterior distributions, posterior predictive distributions, depending on what you’re actually trying to quantify.
Very long story short, the latter allows you to sample from a distribution of predictions - for which you can then form a variety of error estimates (and the distribution doesn’t have to be parametric). That will quantify uncertainty for the model you’ve made using the prior you chose (and the data). But usually we wouldn’t recommend changing your prior just to get lower uncertainty unless you have some rationality behind changing the prior. Priors should really be chosen based on domain knowledge etc etc etc.
I highly recommend you  cause it's free, hands on and if you have solid stats will go fast and easy.
Anyways, the thing with Bayesian is you need to look at the problems from a different optic and learn new names for the same things you know from classic (Maximum likelihood is now called mean a posteriori and whatnot).
If I had to summarize the whole uncertainty thing I would say: Both in classic and Bayesian it holds that a function of random variables yields another random variable. Models are functions of random variables. As such, every point estimate you give has a full distribution behind. In classic this is accounted for with asymptotics and confidence intervals. In classics you have priors all the same, but using defaults and not talking about them.
In Bayesian you talk about your priors, and if you have extra information about your inference problem you include it there. Or you abuse the priors to make the model yield results that will make your boss/client happy (I worked as an economics researcher at a big bank, the whole department was Bayesian). Or choose priors that make you custom model actually compute. Or use priors that do regularization (like a classic lasso)... Different priors are different models and you need to do your model checking / selection.
I am not so sure what you mean about continuous priors. Any gamma should be continuous. But I don't think that is what you mean.
Spot on. Personally, I always enjoy Alex Kendall and Yarin Gal's writing on the topic, for example:
There are alternative ways of getting uncertainty estimates that are far less work.
Most of contemporary AI is really just a combination of probability theory and some upper division math classes in an executable form. None of it is magic and the more people that know the vocabulary the less likely people are to buy into the hype.
If you want a high level overview of all this then Melanie Mitchell has a good book as well: https://melaniemitchell.me/aibook/. She does a really good job of putting everything into the right context and dispelling the marketing hype about the coming singularity and human obsolescence. In one of the chapters she covers deep reinforcement learning and it's one of the best high level explanations I've come across yet.
For those of us who think probability theory captures a large fraction of what intelligence does, this reads a lot like "don't believe the hype behind nuclear weapons! It's really just physics and some upper division chemistry in an explosive form. None of it is magic..."
Maybe this argument will convince people who think that the human brain is magic? But for those of us who think that the human brain is not magic, but proof that general intelligence is actually computable and only needs 20 watts... If probability theory and upper division math in executable form is all it takes to be dangerous, that wouldn't be surprising.
It only takes physics and upper division chemistry in explosive form to level a city, after all. Not magic.
Would this book be a good one for that?
Any other recommendations?
So I'd say it depends on your use cases. If you want high level understanding then go with Melanie Mitchell's book. If you want more technical details then d2l.ai or fast.ai are good choices.
There's also "Deep Reinforcement Learning Hands-On" by Maxim Lapan. That one is also pretty good but I'm not far enough along to give a review one way or another.
wow thanks for opening my eyes.
I guess I should just get together with my undergrad friends, implement the next DeepMind competitor, and collect our $400m from Google.
The funds must flow, the people must stay, and work. Together.
Vision (the right vision) is a huge differentiator and ingredient x. I think you (I) can pull together a team with knowledge, practice, management and people structures, and make it work. The folks who do epoch making stuff have the amazing vision thing as well.
It's easy to come up with model families where a given data set has a high total probability because it has high probability in every model in the family, so the total probability on its own cannot function as a general epistemological measurement.
Unless you're already modeling a neural net it's extremely unlikely that the model family represented by your Bayesian Deep Learning system includes anything representing the actual data-generating process. It's not just me saying this; Gelman & Shalizi point this out in their "Philosophy and the Practice of Bayesian Statistics":
> ...it is hard to claim that the prior distributions used in applied work represent statisticians’ states of knowledge and belief before examining their data, if only because most statisticians do not believe their models are true, so their prior degree of belief in all of ϴ is not 1 but 0. The prior distribution is more like a regularization device, akin to the penalization terms added to the sum of squared errors when doing ridge regression and the lasso (Hastie, Tibshirani, & Friedman, 2009) or spline smoothing (Wahba, 1990)
(Although they're only talking about the prior here, it applies equally well to the total probability, which is just the mean of the data given the prior.)
It makes some sense to talk about Bayesian methods quantifying epistemological information when you have some good reason to believe that some portion of your model family accurately captures the data-generating process, or at least the parts of that process you care about and are relevant for the predictions you want to make. But that's almost never the case for non-parametric methods.
And in the former case, models will be parameterized by meaningful causal variables & their effect-strength; in the latter case, parameters have no explanatory role.
And finally that: only in the explanatory case can a bayesian model be interpreted epistemically.
I think I agree with this -- the relevant epistemic interpretation of a model is how well it fits to the world -- NOT -- to data! Data is the means by which models are selected. So if a model is not explanatory (ie., about the world) there is no sense in which it "fits"; and thus no epistemic interpretation.
I love the idea of man and machine working together and wonder.
So, if you're proposing some theory, first implement it and demonstrate it. Otherwise, it's just vaporware.
Sometimes those functions work as classifiers, but other times they are just peculiar functions between two domains (imagine style transfer models or GANs).
Is Bayesian Deep Learning worth looking into if I don't have much interest in statistical applications but much more just using deep learning as general purpose functions? For that matter, what about causal inference?
If you manage to build a model with a high accuracy and a sensible uncertainty on the output, you can use that information to do a lot of great things such as :
- obviously apply the method to domains that require uncertainty for legal or technical reasons (simulation)
- adding sample to improve you knowledge around uncertain input (active learning)
- use an optimized betting strategy that takes risks into account (bayesian optimization)