
The Case for Bayesian Deep Learning - hardmaru
https://cims.nyu.edu/~andrewgw/caseforbdl/
======
clircle
I'm a self-described Bayesian* at my day job, but the author needs to do
better to convince me that the Bayesian approach is worth it in the deep
learning space. As far as I can tell, deep learning folks don't give two shits
about uncertainty intervals, much less marginalization. All that matters is
minimizing that test error as fast as possible. So what if you get a posterior
for each parameter... Who cares about the parameters in a neural network as
long as the predictions seem well calibrated?

The most convincing rationale for adopting a Bayesian perspective is contained
the collected works of Jim Berger, which I see is cited by the author... but
not used in the manuscript.

* Of course, a Bayesian is just a statistician that uses Bayesian techniques even when it's not appropriate -- Andrew Gelman

~~~
lostdog
> As far as I can tell, deep learning folks don't give two shits about
> uncertainty intervals, much less marginalization.

Maybe folks making snapchat filters don't care, but this is absolutely vital
if you're doing something with a low margin of error (self-driving cars,
financial work, etc...). If your neural net can tell you if it thinks it could
be making an error, that information is absolutely vital for keeping your
system from messing up big time.

The problem is that none of the Bayesian methods for machine learning work
well and quickly (please do correct me if I'm wrong--I will become my boss's
favorite person). If they did, many many practitioners would be very excited
to use them.

~~~
nestorD
After a bit of search I found something that appears to work both well and
quickly:
[http://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801aa...](http://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801aa532c1ce.html)

The basic idea is brilliant: take a deep neural network with dropout, keep the
dropout at evaluation time, run it several times and deduce the uncertainty
from the observed variance (with a bit of math).

This lets you add an uncertainty estimation to the output of any deep neural
network that uses dropout.

~~~
lostdog
Yes, this is the best I've seen too. Unfortunately, "quickly" is a matter of
perspective.

Usually you train a neural net that's as large as possible with a runtime
that's within your latency and GPU budget. Running the net even twice will
blow up your budget.

Also, I haven't seen a great evaluation on how well this method works in
practice. For example, how well does predicted uncertainty capture mistakes on
a real-world dataset?

------
sgstfevsh
For folks that want to start making sense of what's being said I recommend
going through d2l.ai, fast.ai, and deeplearningbook.org.

Most of contemporary AI is really just a combination of probability theory and
some upper division math classes in an executable form. None of it is magic
and the more people that know the vocabulary the less likely people are to buy
into the hype.

If you want a high level overview of all this then Melanie Mitchell has a good
book as well:
[https://melaniemitchell.me/aibook/](https://melaniemitchell.me/aibook/). She
does a really good job of putting everything into the right context and
dispelling the marketing hype about the coming singularity and human
obsolescence. In one of the chapters she covers deep reinforcement learning
and it's one of the best high level explanations I've come across yet.

~~~
jbay808
> Most of contemporary AI is really just a combination of probability theory
> and some upper division math classes in an executable form. None of it is
> magic and the more people that know the vocabulary the less likely people
> are to buy into the hype

For those of us who think probability theory captures a large fraction of what
intelligence does, this reads a lot like "don't believe the hype behind
nuclear weapons! It's really just physics and some upper division chemistry in
an explosive form. None of it is magic..."

Maybe this argument will convince people who think that the human brain is
magic? But for those of us who think that the human brain is not magic, but
proof that general intelligence is actually computable and only needs 20
watts... If probability theory and upper division math in executable form is
all it takes to be dangerous, that wouldn't be surprising.

It only takes physics and upper division chemistry in explosive form to level
a city, after all. Not magic.

~~~
whataretheodds
What do you suppose is dangerous about linear algebra and basic multivariable
calculus? I personally don't think there is any danger in more people knowing
more linear algebra and calculus.

~~~
IAmEveryone
The nuclear bombs were an analogy, and the point it is trying to prove is not
"danger", but "powerful".

------
AlexCoventry
> _The BMA [Bayesian Model Average, or total probability] represents epistemic
> uncertainty — that is, uncertainty over which setting of weights
> (hypothesis) is correct, given limited data_

It's easy to come up with model families where a given data set has a high
total probability because it has high probability in every model in the
family, so the total probability on its own cannot function as a general
epistemological measurement.

Unless you're already modeling a neural net it's extremely unlikely that the
model family represented by your Bayesian Deep Learning system includes
anything representing the actual data-generating process. It's not just me
saying this; Gelman & Shalizi point this out in their "Philosophy and the
Practice of Bayesian Statistics":

> _...it is hard to claim that the prior distributions used in applied work
> represent statisticians’ states of knowledge and belief before examining
> their data, if only because most statisticians do not believe their models
> are true, so their prior degree of belief in all of ϴ is not 1 but 0. The
> prior distribution is more like a regularization device, akin to the
> penalization terms added to the sum of squared errors when doing ridge
> regression and the lasso (Hastie, Tibshirani, & Friedman, 2009) or spline
> smoothing (Wahba, 1990)_

(Although they're only talking about the prior here, it applies equally well
to the total probability, which is just the mean of the data given the prior.)

It makes some sense to talk about Bayesian methods quantifying epistemological
information when you have some good reason to believe that some portion of
your model family accurately captures the data-generating process, or at least
the parts of that process you care about and are relevant for the predictions
you want to make. But that's almost never the case for non-parametric methods.

~~~
mjburgess
It seems like what you're saying is: Bayesian methods may be used as models of
causal processes, or as mere mechanisms to generate predictions.

And in the former case, models will be parameterized by meaningful causal
variables & their effect-strength; in the latter case, parameters have no
explanatory role.

And finally that: only in the explanatory case can a bayesian model be
interpreted epistemically.

I think I agree with this -- the relevant epistemic interpretation of a model
is _how well it fits to the world_ \-- NOT -- to data! Data is the means by
which models are selected. So if a model is not explanatory (ie., about the
world) there is no sense in which it "fits"; and thus no epistemic
interpretation.

~~~
AlexCoventry
Yes, without some kind of semantics to the model, Bayesian methods are
essentially an elaborate form of regularization.

------
riku_iki
Did any model within bayesian DL achieve SOTA in any more or less popular
benchmark?

~~~
benrbray
I interned for a team working in this area, and my supervisor was quite happy
with their results using Bayesian Deep Learning on ImageNet:
[https://arxiv.org/abs/1906.02506](https://arxiv.org/abs/1906.02506)

------
sixdimensional
Does anybody on this thread know about Bayesian subjective probability? I have
been fascinated by it for years, and have recently come to wonder if it
represents something interesting in the way it factors expert knowledge with
an easily explainable mathematical/probability based approach.

I love the idea of man and machine working together and wonder.

~~~
clircle
Do you have a specific question?

------
1024core
In today's ML world, theory follows practice. You have to demonstrate SOTA
results in a problem area before people will pay attention to your theory.

So, if you're proposing some theory, first implement it and demonstrate it.
Otherwise, it's just vaporware.

------
etaioinshrdlu
So to me, deep learning means learning very complex functions from data.

Sometimes those functions work as classifiers, but other times they are just
peculiar functions between two domains (imagine style transfer models or
GANs).

Is Bayesian Deep Learning worth looking into if I don't have much interest in
statistical applications but much more just using deep learning as general
purpose functions? For that matter, what about causal inference?

~~~
nestorD
> Is Bayesian Deep Learning worth looking into if I don't have much interest
> in statistical applications but much more just using deep learning as
> general purpose functions?

If you manage to build a model with a high accuracy and a sensible uncertainty
on the output, you can use that information to do a lot of great things such
as :

    
    
      - obviously apply the method to domains that require uncertainty for legal or technical reasons (simulation)
      - adding sample to improve you knowledge around uncertain input (active learning)
      - use an optimized betting strategy that takes risks into account (bayesian optimization)
    

Gaussian process are a prime example of that but I am not aware of a deep
learning approach that realises this (yet).

~~~
nestorD
I stand corrected, after some research, I found a paper that offers a
promising way to get good uncertainty information from a deep learning model :
[http://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801aa...](http://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801aa532c1ce.html)

------
benrbray
Related: My former supervisor (Dr. Emtiyaz Khan @ RIKEN AIP) held a NeurIPS
2019 Tutorial on "Deep Learning with Bayesian Principles". He gives a nice
high-level overview of this area of research.

