
A Sober Look at Bayesian Neural Networks - hardmaru
https://jacobbuckman.com/2020-01-17-a-sober-look-at-bayesian-neural-networks/
======
manthideaal
The critique is about the importance of priors in BNN. In my humble
understanding of Bayesian reasoning the argument to defend any prior is that
with enought data the learning method converges to the real distribution, so
if the result of any learning method depends heavily of any prior assumption
then that assumption is crucial and in no way can it be taken randomly. On the
other hand, it is well known that deep learning can learn any random model, so
in the end I think all of this is about the bias-variance trade off. If your
prior has an infinite number of adjustable parameters (zero bias) then the
variance becomes infinite ( your result will depend and becomes equal to the
training set).

So in practice one should choose the prior with the minimum number of
parameters (bias) that shows a good learning performance on the available
training set.

Anyway, trying to measure how a prior generalizes or not in BNN seems to be
another way of thinking about bias-variance, if there is more than this, I
would like to know.

~~~
jacobbuckman
Have you seen the latest research on double descent? Here's a good intro, with
references to some of the foundational work: [https://openai.com/blog/deep-
double-descent/](https://openai.com/blog/deep-double-descent/)

It seems bias-variance doesn't apply to neural networks at all! So your
intuitions are good, but there's definitely more to the story.

~~~
manthideaal
About the work you cite, I think that double descent is simply because the
extra number of parameters (low bias) used produces a large variance when the
input data is small, but as more data is introduced the extra parameters don't
play any role, that is they are prunned. So the high bias is relative to the
quantity of availabe information (training data). So the second descent starts
when the extra parameters are prunned, in practice their coefficients tends to
zero, the system learns that those coefficient don't play any role.

------
bjornsing
Lack of good priors is definitely a weak spot for BNNs. I also like the
concept of Generalization-Agnostic Priors. But...

> So viewed through this lens, BNNs with arbitrary priors are nothing more
> than an architectural decision. A BNN is just a neural network that maps its
> input to a distribution over outputs; the prior is just a hyperparameter of
> the model. Just making the network Bayesian bought us nothing. It will only
> be helpful if we find a good prior, and validate that we are actually doing
> accurate inference. If you personally believe that exploring this space of
> priors (similar to exploring the space of architectures or hyperparameters)
> is particularly promising, then that is a good reason to keep working on
> BNNs.

Isn't this overstating it a bit... Making the network Bayesian (even if the
priors are generalization-agnostic) still bought us something: after training
we have a posterior distribution over weights that contains _both_ weights
that generalize and weights that don't generalize. That's better than being
unlucky and only having weights that don't generalize at the end of training.
I'm not saying this advantage is easy to achieve in practice, or benefit from.
But it's not nothing.

Also Bayesian inference as a framework has a distinct advantage over the usual
aimless architectural twiddling that so much "research" seems to focus on:
it's intellectually/mathematically sound.

~~~
jvanderbot
Agree. Throwing prior weights into hyperparameter sets is not helpful. If each
hyperparameter or architecture decision were rigorously tied to an
interpretable event or process, we'd have much better explainability and
traceability. Its a step in the right direction to try.

------
psv1
The paragraph beginning with "Let’s consider how we might apply the Bayesian
framework..." where he introduces the notation is a great example of
everything I hate about mathematical notation. We have big-F, small-f, f-of-x,
f-sub-x, f-star, big-F-star... and then he decides to abbreviate what he just
introduced. If I didn't know what's happening and I was trying to understand
this for the first time, I would have no chance and would just give up right
there.

~~~
jpeanuts
The reason those are all "f"s is that they are all versions of the same thing:
the function mapping features to outputs, or approximations of it. The capital
"F"s refer to random variables/processes describing the same function (using
capitals for RVs and lower-case for samples is standard practice in
statistics).

By using this notation he is drawing careful distinctions between the various
approximations he's using. I think it's pretty good writing.

~~~
psv1
It's great that's it's consistent. My problem is that the notation only makes
sense _if you already understand_ the very thing that he's trying to explain
with this notation.

~~~
tel
That's not entirely true. The point being made is about the consequences of
the design being set up with that notation. That design and that notation is
reasonably general. It requires some familiarity with notation around
mathematical statistics, modeling, Bayesian formalisms, and random variables.

The thing he's trying to explain is how those things interact and what their
behavior is.

------
CrazyStat
The title of the article is at odds with the purely speculative nature of the
core part of the argument--the section titled "Are Current BNNs
Generalization-Agnostic?". The authors admit themselves that it is pure
speculation. They present a speculative class of generalization-agnostic
priors, speculate that commonly used priors might belong to this class, and
then speculate about why we haven't observed commonly used priors belonging to
this class even though they speculatively might belong to this class. This
argument is not very convincing, to say the least.

Generalizability is a function of both the network structure and the priors.
Non-Bayesian networks without priors can be seen as pseudo-Bayesian networks
with (improper) flat priors. The use of independent priors that are unimodal
at zero, like the independent normal priors mentioned in the article, will
tend to shrink weights toward zero, which makes the function smoother.
Smoother functions tend to generalize better (measuring generalizability as
the gap between in-sample and out-of-sample error). The classic bias-variance
trade-off is that they will generally perform worse in-sample and may also
perform worse out-of-sample. Heuristically, if you take any neural network and
make it Bayesian by putting priors on the weights, it will (averaging over
datasets) generalize not worse than the original network. It may have worse
absolute performance.

The authors also argue that we can't be confident in the uncertainties from
BNNs until we have a good theoretical understanding of NN generalization in
general. They ignore the possibility of building confidence in the
uncertainties, both in general and for any particular network, by empirical
observation--the same way we've built confidence in generalization of non-
Bayesian NNs, since, as the article points out, we don't yet have a good
theoretical understanding of NN generalization.

------
nafizh
Frankly, just solve an important problem with BNNs in a meaningful way with
advantages that only BNNs bring (e.g uncertainty estimation) and you won’t
have to write blogposts defending them.

~~~
tw1010
There's value in collaboratively discussing half-baked ideas in public before
any real applications have been produced.

Imagine this article as just another in a stream of posts trying to "think
aloud" about BNNs, without any immediate pressure of applications for it.

------
tel
My summary:

Neural nets have massive "capacity" which means that, in the face of finite
data sets, they can both (a) reasonably represent generalizable and non-
generalizable functions and (b) can take on priors which do not distinguish
between those classes of functions. The upshot is that after training, the
posterior weight of robust/generalizable models will equal that of
fragile/non-generalizable ones.

We need to believe that the priors that we actually use don't have that
property if we're to believe in the posteriors produced by BNNs. Should we?

Today, priors in networks arise mostly out of network topology since
initialization methods are somewhat constrained by practicalities in training.
The article criticizes those who would assert that network topology (+
initialization) leads to a reasonable prior in the space of effective input ->
output functions as realized by the network.

To put that in different terms, you might imagine an argument saying that
network topologies are biologically inspired and thus represent a decent
approximation of the space of "achievable" implementations of functions in the
given task. But does an argument like this say anything about the
generalization capability of functions favored by this prior? You might
characterize this as "Easy" versus "Correct".

I'm not trying to actually represent argumentation that neural network
topologies actually are reasonable in shaping "uninformative" Gaussian priors
in the weight space into "uninformative" and "generalizable" priors in the
function space. There may be some really good arguments out there. But, if
we're going to understand NNs as reasonable Bayesian processes, then that
question needs to be interrogated.

------
1e-9
> We should ask, “what evidence are you providing that your priors are any
> good?”

This is valid. Anyone pursuing a Bayesian approach should be asking themselves
this question about every prior they use. To fully benefit from a Bayesian
framework, one needs to construct models with understandable parameters for
which there is some sound theoretical or practical insight that can be
embedded with priors and that is not well-represented by the training data.
Doing this can help your solution avoid the kind of wildly unpredictable and
costly mistakes you might get if you used a completely blackbox approach. For
critical applications, this can be highly useful. If you can't come up with
priors that are clearly beneficial, then you are likely better off using a
non-Bayesian approach.

------
YeGoblynQueenne
EDIT: I misread the quote below- it applies to a distribution over functions,
not examples. My bad and thanks to one of the authors of the post for politely
correcting me in the replying comment.

>> But there is one core problem with the Bayesian framework. In practice, we
never have access to the prior distribution Pr(f)! Who could ever claim to
know the real-world distribution of functions that solve classification tasks?
Not us, and certainly not Bayesians. Instead, BNNs simply choose an arbitrary
prior distribution q(f) over functions, and Bayesian inference is performed to
compute q(f∣D). The question of whether q(f) is close to the true distribution
Pr(f) is swept under the rug.

This is true but it's also nothing new: it's the standard PAC-Learning
assumption that the examples (the dataset) are drawn from the same
distribution as the target theory (the real-world distribution).

This assumption and the complete impossibility to verify it in practice is not
unique to Bayesian Neural Nets. It is true for _every_ machine learning
algorithm.

And this is certainly no surprise for machine learning researchers (or, if it
is, it is really concerning that it is). So the done thing in machine learning
research is to demonstrate that, under PAC-Learning assumptions, a certain
technique or algorithm _can_ correctly identify a hypothesis that approximates
a "true" function to within some amount of error.

I mean to say, when people publish papers reporting a new SOTA on such-and-
such dataset, they are not really claiming that their technique somehow finds
the "true" distribution of the real-world process that generated the data in
their dataset. They're claiming "we can correctly classify instances in this
dataset and if PAC-Learning assumptions hold, this technique should also work
in real-world data from the same domain".

Of course this is often left implicit- and the article makes me wonder to what
extent this is because researches tend to forget or even :gasp: ignore it
completely. A disturbing thought.

~~~
cgel
Author. There is a misunderstanding in your argument. Your point is about the
dataset being sampled from the true distribution. We are happy with that
assumption (it's orthogonal to our point).

The problem we have is that to apply Bayes rule you NEED a prior distribution
over the correct functions, applied to the points on the dataset, and to the
points outside of the dataset. In other words, one thing is assuming that the
dataset is representative of the (unknown) classification task, the other is
to assume that you know what the distribution over classification tasks is.

~~~
YeGoblynQueenne
Hi. Yes, I see- I misunderstood this. My apologies for the hasty reading of
your post.

But, in that case, there does exist a very good generalisation prior on
function space that is well known and well understood: the simplest hypothesis
(e.g. the one with the smallest minumum description length) is always better
(because it results in a reduction of the hypothesis search space with a
corresponding reduction to the size of the error on unseen data while keeping
the number of examples constant).

See:

Occam's Razor (Blumer and friends):

[https://www.sciencedirect.com/science/article/pii/0020019087...](https://www.sciencedirect.com/science/article/pii/0020019087901141)

Quoting from the abstract:

 _We show that a polynomial learning algorithm, as defined by [ "A theory of
the learnable", Valiant 1984], is obtained whenever there exists a polynomial-
time method of producing, for any sequence of observations, a nearly minimum
hypothesis that is consistent with these observations._

Would that begin to address your concerns?

------
salty_biscuits
I am going to be obtuse and say that since Bayes theorem is a non
controversial rule about conditional probability, if you interpret your NN as
a probability distribution over outputs that is updated by data, it is always
able to be interpreted as "Bayesian" and can be a helpful way to examine what
your implict priors are (i.e. via architecture or regularization terms) to see
if they are reasonable for the problem at hand. It is no surprise that weak
uninformative priors are sort of useless. Explicit priors shine when you know
something about the actual problem (say some moments or some invariances in
the problem set up).

------
earlyadopter2
Shouldn’t the priors be updated and improved each step to be closer to a good
prior, and that’s why inaccurate priors may be acceptable? (Do BNNs not
iteratively update the (next step’s) prior with the previous step’s
posterior?) I haven’t worked on BNNs, but since Bayesian are always talking
about updating their priors I thought this would be the case.

~~~
1e-9
You are describing a recursive Bayesian approach, which can have significant
computational and storage advantages for filtering (for example, Kalman
filters). For this to work well, the prior must be able to adequately
represent the learning of the posterior, which may be practical with a self-
conjugate prior or a Monte Carlo approximation such as what particle filters
use. In practice, for nontrivial machine learning applications, self-conjugate
distributions rarely model the problem well and good approximations of the
posterior into a concise prior are rarely practical.

------
mbeex
The author starts with: P(A|B) = P(B|A)P(B)/P(A)

This is Bayes the wrong way around. The last part should be P(A)/P(B). I have
no hand in the dispute (reading here for the first time about it), but not
getting the basics right is not very convincing.

~~~
tw1010
It's not "not getting the basics right", it's a simple typo. A thing like that
shouldn't invalidate a whole article (unless you have skin in the game for the
opponent argument).

Why does HN have a pattern of dismissing whole articles due to simple typos?
It's as if we're so habituated to skim and do tldr-reading that our brain is
working overdrive to find the slightest excuse not to have to do any type of
reading beyond surface level.

~~~
mbeex
> unless you have skin in the game for the opponent argument

I stated explicitly otherwise (I'm not even in research), something you
certainly couldn't miss. I'm not quite sure, if your allegation backfires.

Especially in hot temper, people make errors. But just then they should avoid
making an easy target. The authors site also has some problems with rendering
math (some does, some shows still dollar signs) and this adds to a first
impression of sloppyness. I t is his decision to go public - but then he has
to take the consequences.

~~~
cgel
Author here. Sorry that the typo and the render errors affected you so much.
We don't see any rendering issues on our end, if you tell us what browser you
are using maybe we can replicate and fix them.

~~~
mbeex
> Sorry that the typo and the render errors affected you so much.

It seems to concern some commentators here to a much greater extent,
concluding from the whole downvoting dance. I'm old-school, I received my
master in mathematics more than 25 years ago. Being in stochastics, I simply
spotted an error and also some dispute, the latter from the context in the
article. I want to mention, that in the past people had considered a behavior
like mine as helpful, rigour was a value and especially when under fire,
people were expected to try even harder.

> what browser you are using maybe we can replicate and fix them

FF 72.0.1 on Win 10 here. If it helps, in the following sequence:

\--- snip ---

Bayes’s Rule simply says that for any two non-independent random variables $A$
and $B$, seeing that $B$ took a specific value $b$ changes the distribution of
the random variable $A$. In standard lingo, the term Pr(A=a) is called the
prior, Pr(B=b∣A=a) is the likelihood, and Pr(A=a∣B=b) is the posterior.

\---snap---

all the single capitals (A,B...) appear embraced with dollar signs, all the
Pr(...) expression are correct (even after removing uBlock/noScript
restrictions)

~~~
cgel
Thanks! We just fixed an issue. I was able to test it on FF for MacOS but it
would be very helpful if you could confirm the problem is fixed in your end.

Also, I'll admit that messing up Bayes rule in a blog post criticizing
Bayesian Neural Netowrks is pretty comical. Should have taken the time to
proof read the whole thing.

~~~
mbeex
> if you could confirm the problem is fixed

Yes, it's working :)

------
_jmw125
some feedback for the authors:

\- some simple 1d plots to illustrate your points would be very illuminating;
without them, i'm not very convinced by your arguments

\- 'generalization-agnostic' is very frustrating as a term, i'm sure you can
think of something clearer

\- i'm not sure that your argument 'a BNN is only as good as its prior' is any
better than 'a NN is only as good as its initialisation', yes any NN model is
for sure a victim of local optima, but for most practioners this is good
enough

