A Sober Look at Bayesian Neural Networks

manthideaal · on Jan 18, 2020

The critique is about the importance of priors in BNN. In my humble understanding of Bayesian reasoning the argument to defend any prior is that with enought data the learning method converges to the real distribution, so if the result of any learning method depends heavily of any prior assumption then that assumption is crucial and in no way can it be taken randomly. On the other hand, it is well known that deep learning can learn any random model, so in the end I think all of this is about the bias-variance trade off. If your prior has an infinite number of adjustable parameters (zero bias) then the variance becomes infinite ( your result will depend and becomes equal to the training set).

So in practice one should choose the prior with the minimum number of parameters (bias) that shows a good learning performance on the available training set.

Anyway, trying to measure how a prior generalizes or not in BNN seems to be another way of thinking about bias-variance, if there is more than this, I would like to know.

jacobbuckman · on Jan 18, 2020

Have you seen the latest research on double descent? Here's a good intro, with references to some of the foundational work: https://openai.com/blog/deep-double-descent/

It seems bias-variance doesn't apply to neural networks at all! So your intuitions are good, but there's definitely more to the story.

manthideaal · on Jan 19, 2020

About the work you cite, I think that double descent is simply because the extra number of parameters (low bias) used produces a large variance when the input data is small, but as more data is introduced the extra parameters don't play any role, that is they are prunned. So the high bias is relative to the quantity of availabe information (training data). So the second descent starts when the extra parameters are prunned, in practice their coefficients tends to zero, the system learns that those coefficient don't play any role.

bjornsing · on Jan 18, 2020

Lack of good priors is definitely a weak spot for BNNs. I also like the concept of Generalization-Agnostic Priors. But...

> So viewed through this lens, BNNs with arbitrary priors are nothing more than an architectural decision. A BNN is just a neural network that maps its input to a distribution over outputs; the prior is just a hyperparameter of the model. Just making the network Bayesian bought us nothing. It will only be helpful if we find a good prior, and validate that we are actually doing accurate inference. If you personally believe that exploring this space of priors (similar to exploring the space of architectures or hyperparameters) is particularly promising, then that is a good reason to keep working on BNNs.

Isn't this overstating it a bit... Making the network Bayesian (even if the priors are generalization-agnostic) still bought us something: after training we have a posterior distribution over weights that contains both weights that generalize and weights that don't generalize. That's better than being unlucky and only having weights that don't generalize at the end of training. I'm not saying this advantage is easy to achieve in practice, or benefit from. But it's not nothing.

Also Bayesian inference as a framework has a distinct advantage over the usual aimless architectural twiddling that so much "research" seems to focus on: it's intellectually/mathematically sound.

jvanderbot · on Jan 18, 2020

Agree. Throwing prior weights into hyperparameter sets is not helpful. If each hyperparameter or architecture decision were rigorously tied to an interpretable event or process, we'd have much better explainability and traceability. Its a step in the right direction to try.

psv1 · on Jan 18, 2020

The paragraph beginning with "Let’s consider how we might apply the Bayesian framework..." where he introduces the notation is a great example of everything I hate about mathematical notation. We have big-F, small-f, f-of-x, f-sub-x, f-star, big-F-star... and then he decides to abbreviate what he just introduced. If I didn't know what's happening and I was trying to understand this for the first time, I would have no chance and would just give up right there.

jpeanuts · on Jan 18, 2020

The reason those are all "f"s is that they are all versions of the same thing: the function mapping features to outputs, or approximations of it. The capital "F"s refer to random variables/processes describing the same function (using capitals for RVs and lower-case for samples is standard practice in statistics).

By using this notation he is drawing careful distinctions between the various approximations he's using. I think it's pretty good writing.

psv1 · on Jan 18, 2020

It's great that's it's consistent. My problem is that the notation only makes sense if you already understand the very thing that he's trying to explain with this notation.

tel · on Jan 18, 2020

That's not entirely true. The point being made is about the consequences of the design being set up with that notation. That design and that notation is reasonably general. It requires some familiarity with notation around mathematical statistics, modeling, Bayesian formalisms, and random variables.

The thing he's trying to explain is how those things interact and what their behavior is.

jacobbuckman · on Jan 18, 2020

Hey, author here. The writing on this section was a bit tricky to get right, but we did our best to keep it as clean as possible while still being precise about the concepts we were considering. And it's definitely not perfect; I've just made some small edits to hopefully make things a bit more clear.

This blog post is a response to ongoing discussion with the Bayesian community, so it was primarily aimed at a more technical audience. If you have any suggestions for how to make the writing more accessible, without becoming so overly expository that the mathematically-robust folk lose interest, I would love to discuss them.

ssivark · on Jan 18, 2020

So here’s one thing about mathematical notation. If it uses similar symbols/scripts/subscripts that means that the objects are related, up to the minor difference expressed. So choosing “f” for everything is often a deliberate and well-motivated decision (Very much like naming variables). Doing that well is an art form.

Unfortunately, yes, it does take some time to get used to it (both reading, and generating such names), but IMHO it’s far better notation than otherwise.

CrazyStat · on Jan 18, 2020

The title of the article is at odds with the purely speculative nature of the core part of the argument--the section titled "Are Current BNNs Generalization-Agnostic?". The authors admit themselves that it is pure speculation. They present a speculative class of generalization-agnostic priors, speculate that commonly used priors might belong to this class, and then speculate about why we haven't observed commonly used priors belonging to this class even though they speculatively might belong to this class. This argument is not very convincing, to say the least.

Generalizability is a function of both the network structure and the priors. Non-Bayesian networks without priors can be seen as pseudo-Bayesian networks with (improper) flat priors. The use of independent priors that are unimodal at zero, like the independent normal priors mentioned in the article, will tend to shrink weights toward zero, which makes the function smoother. Smoother functions tend to generalize better (measuring generalizability as the gap between in-sample and out-of-sample error). The classic bias-variance trade-off is that they will generally perform worse in-sample and may also perform worse out-of-sample. Heuristically, if you take any neural network and make it Bayesian by putting priors on the weights, it will (averaging over datasets) generalize not worse than the original network. It may have worse absolute performance.

The authors also argue that we can't be confident in the uncertainties from BNNs until we have a good theoretical understanding of NN generalization in general. They ignore the possibility of building confidence in the uncertainties, both in general and for any particular network, by empirical observation--the same way we've built confidence in generalization of non-Bayesian NNs, since, as the article points out, we don't yet have a good theoretical understanding of NN generalization.

nafizh · on Jan 18, 2020

Frankly, just solve an important problem with BNNs in a meaningful way with advantages that only BNNs bring (e.g uncertainty estimation) and you won’t have to write blogposts defending them.

tw1010 · on Jan 18, 2020

There's value in collaboratively discussing half-baked ideas in public before any real applications have been produced.

Imagine this article as just another in a stream of posts trying to "think aloud" about BNNs, without any immediate pressure of applications for it.

bitforger · on Jan 18, 2020

Although I believe this specific blog post is criticizing BNNs rather than defending them.

tel · on Jan 18, 2020

My summary:

Neural nets have massive "capacity" which means that, in the face of finite data sets, they can both (a) reasonably represent generalizable and non-generalizable functions and (b) can take on priors which do not distinguish between those classes of functions. The upshot is that after training, the posterior weight of robust/generalizable models will equal that of fragile/non-generalizable ones.

We need to believe that the priors that we actually use don't have that property if we're to believe in the posteriors produced by BNNs. Should we?

Today, priors in networks arise mostly out of network topology since initialization methods are somewhat constrained by practicalities in training. The article criticizes those who would assert that network topology (+ initialization) leads to a reasonable prior in the space of effective input -> output functions as realized by the network.

To put that in different terms, you might imagine an argument saying that network topologies are biologically inspired and thus represent a decent approximation of the space of "achievable" implementations of functions in the given task. But does an argument like this say anything about the generalization capability of functions favored by this prior? You might characterize this as "Easy" versus "Correct".

I'm not trying to actually represent argumentation that neural network topologies actually are reasonable in shaping "uninformative" Gaussian priors in the weight space into "uninformative" and "generalizable" priors in the function space. There may be some really good arguments out there. But, if we're going to understand NNs as reasonable Bayesian processes, then that question needs to be interrogated.

1e-9 · on Jan 18, 2020

> We should ask, “what evidence are you providing that your priors are any good?”

This is valid. Anyone pursuing a Bayesian approach should be asking themselves this question about every prior they use. To fully benefit from a Bayesian framework, one needs to construct models with understandable parameters for which there is some sound theoretical or practical insight that can be embedded with priors and that is not well-represented by the training data. Doing this can help your solution avoid the kind of wildly unpredictable and costly mistakes you might get if you used a completely blackbox approach. For critical applications, this can be highly useful. If you can't come up with priors that are clearly beneficial, then you are likely better off using a non-Bayesian approach.

YeGoblynQueenne · on Jan 18, 2020

EDIT: I misread the quote below- it applies to a distribution over functions, not examples. My bad and thanks to one of the authors of the post for politely correcting me in the replying comment.

>> But there is one core problem with the Bayesian framework. In practice, we never have access to the prior distribution Pr(f)! Who could ever claim to know the real-world distribution of functions that solve classification tasks? Not us, and certainly not Bayesians. Instead, BNNs simply choose an arbitrary prior distribution q(f) over functions, and Bayesian inference is performed to compute q(f∣D). The question of whether q(f) is close to the true distribution Pr(f) is swept under the rug.

This is true but it's also nothing new: it's the standard PAC-Learning assumption that the examples (the dataset) are drawn from the same distribution as the target theory (the real-world distribution).

This assumption and the complete impossibility to verify it in practice is not unique to Bayesian Neural Nets. It is true for _every_ machine learning algorithm.

And this is certainly no surprise for machine learning researchers (or, if it is, it is really concerning that it is). So the done thing in machine learning research is to demonstrate that, under PAC-Learning assumptions, a certain technique or algorithm _can_ correctly identify a hypothesis that approximates a "true" function to within some amount of error.

I mean to say, when people publish papers reporting a new SOTA on such-and-such dataset, they are not really claiming that their technique somehow finds the "true" distribution of the real-world process that generated the data in their dataset. They're claiming "we can correctly classify instances in this dataset and if PAC-Learning assumptions hold, this technique should also work in real-world data from the same domain".

Of course this is often left implicit- and the article makes me wonder to what extent this is because researches tend to forget or even :gasp: ignore it completely. A disturbing thought.

cgel · on Jan 18, 2020

Author. There is a misunderstanding in your argument. Your point is about the dataset being sampled from the true distribution. We are happy with that assumption (it's orthogonal to our point).

The problem we have is that to apply Bayes rule you NEED a prior distribution over the correct functions, applied to the points on the dataset, and to the points outside of the dataset. In other words, one thing is assuming that the dataset is representative of the (unknown) classification task, the other is to assume that you know what the distribution over classification tasks is.

YeGoblynQueenne · on Jan 18, 2020

Hi. Yes, I see- I misunderstood this. My apologies for the hasty reading of your post.

But, in that case, there does exist a very good generalisation prior on function space that is well known and well understood: the simplest hypothesis (e.g. the one with the smallest minumum description length) is always better (because it results in a reduction of the hypothesis search space with a corresponding reduction to the size of the error on unseen data while keeping the number of examples constant).

See:

Occam's Razor (Blumer and friends):

https://www.sciencedirect.com/science/article/pii/0020019087...

Quoting from the abstract:

We show that a polynomial learning algorithm, as defined by ["A theory of the learnable", Valiant 1984], is obtained whenever there exists a polynomial-time method of producing, for any sequence of observations, a nearly minimum hypothesis that is consistent with these observations.

Would that begin to address your concerns?

salty_biscuits · on Jan 18, 2020

I am going to be obtuse and say that since Bayes theorem is a non controversial rule about conditional probability, if you interpret your NN as a probability distribution over outputs that is updated by data, it is always able to be interpreted as "Bayesian" and can be a helpful way to examine what your implict priors are (i.e. via architecture or regularization terms) to see if they are reasonable for the problem at hand. It is no surprise that weak uninformative priors are sort of useless. Explicit priors shine when you know something about the actual problem (say some moments or some invariances in the problem set up).

earlyadopter2 · on Jan 18, 2020

Shouldn’t the priors be updated and improved each step to be closer to a good prior, and that’s why inaccurate priors may be acceptable? (Do BNNs not iteratively update the (next step’s) prior with the previous step’s posterior?) I haven’t worked on BNNs, but since Bayesian are always talking about updating their priors I thought this would be the case.

1e-9 · on Jan 18, 2020

You are describing a recursive Bayesian approach, which can have significant computational and storage advantages for filtering (for example, Kalman filters). For this to work well, the prior must be able to adequately represent the learning of the posterior, which may be practical with a self-conjugate prior or a Monte Carlo approximation such as what particle filters use. In practice, for nontrivial machine learning applications, self-conjugate distributions rarely model the problem well and good approximations of the posterior into a concise prior are rarely practical.

mbeex · on Jan 18, 2020

The author starts with: P(A|B) = P(B|A)P(B)/P(A)

This is Bayes the wrong way around. The last part should be P(A)/P(B). I have no hand in the dispute (reading here for the first time about it), but not getting the basics right is not very convincing.

tw1010 · on Jan 18, 2020

It's not "not getting the basics right", it's a simple typo. A thing like that shouldn't invalidate a whole article (unless you have skin in the game for the opponent argument).

Why does HN have a pattern of dismissing whole articles due to simple typos? It's as if we're so habituated to skim and do tldr-reading that our brain is working overdrive to find the slightest excuse not to have to do any type of reading beyond surface level.

mattlutze · on Jan 18, 2020

Critical and technical literature needs to be held to a standard.

When the rhetor introduces errors in the artifact, the rhetor's ethos with the audience is diminished.

The more fundamental the error (getting a basic equation wrong I guess?) the more trust you lose with a knowledgeable audience. If the author doesn't see that a fundamental issue was introduced, they may not have been expert enough to not introduce additional errors; the reader must spend more time double-checking the components of the argument rather than thinking about the argument itself.

If someone comes to this article as a novice in the topic and stores the error as a fact, they may end up at least confused when approaching it again in the future. HN tends to have an audience representing deep knowledge in many fields, who end up providing a thorough and varied set of quality filters. These quality filters are also really helpful to the novice who may otherwise miss the typo.

jph00 · on Jan 18, 2020

> Critical and technical literature needs to be held to a standard.

That's a sloppy statement. You haven't defined what standard. Clearly, everything is held to "a standard"; making that an entirely empty claim.

> When the rhetor introduces errors in the artifact, the rhetor's ethos with the audience is diminished.

A "rhetor" is a teacher of rhetoric. This is not the correct word in this case. In this case, the correct word is the more general "author", since the post was not teaching rhetoric. Further, the entire point of "ethos" in rhetoric is that we shouldn't be so lazy as to allow minor issues cloud our judgement.

> The more fundamental the error (getting a basic equation wrong I guess?) the more trust you lose with a knowledgeable audience.

Quite the opposite. A knowledgeable audience can decide whether to trust something based on the actual content, rather than minor surface issues. Only a lazy or uninformed audience need get distracted by typos.

mattlutze · on Jan 23, 2020

> That's a sloppy statement. You haven't defined what standard.

Actually, I think I established a basis of discussion then later illustrated this basis with the way HN has grown and tends to enforce its standard.

> A "rhetor" is a teacher of rhetoric. This is not the correct word in this case.

A rhetor is a person practicing rhetoric, or a person delivering persuasive or effective communication. Teachers are indeed a subset of that, but also public speakers, negotiators, and e.g. authors who write to influence their audiences' understanding or perception.

I think you may also misunderstand the function of ethos in rhetoric.

> A knowledgeable audience can decide whether to trust something based on the actual content.

Perhaps we're in agreement? If the content reflects reflects clear understanding it can improve the efficacy of a rhetorical artifact, while sloppiness can reduce its persuasiveness.

Errors can diminish ethos. Correcting errors can amplify it: https://jacobbuckman.com/2020-01-17-a-sober-look-at-bayesian...

sjg007 · on Jan 18, 2020

I mean it’s kinda a big typo.

tw1010 · on Jan 18, 2020

Not really, it's one of the most common mistakes I see in bayesian calculations. If the author was basing a lengthy series of calculations on that first step, it would be worse (but in this case the expression is quickly replaced by a corrected version for the classification discussion).

toxik · on Jan 18, 2020

Do you have data to support that it’s the most common mistake? It seems obvious to me from P(A|B) = P(A, B)/P(B) and P(A, B) = P(B|A) P(A)

tw1010 · on Jan 18, 2020

I didn't say it was one of the most common mistakes, I said it's one of the most common mistakes I have observed. Purely subjective.

macleginn · on Jan 18, 2020

I have a similar experience, but precisely because of this I am very careful to check the formula each time I have to type it. Mistakes in the formulas are not "just typos", they are a very annoying and potentially harmful kind of typos, and we must take great care in order to avoid them.

tw1010 · on Jan 18, 2020

Like I said above, "typos" in sequences of calculations are obviously problematic, and lead (almost always) to mistakes in the final result. In this case that's not applicable since there's no "second arithmetic manipulation" following the typo:ed one. (The author replaces the incorrect one with the correct bayesian equation in the next section.)

mbeex · on Jan 18, 2020

> unless you have skin in the game for the opponent argument

I stated explicitly otherwise (I'm not even in research), something you certainly couldn't miss. I'm not quite sure, if your allegation backfires.

Especially in hot temper, people make errors. But just then they should avoid making an easy target. The authors site also has some problems with rendering math (some does, some shows still dollar signs) and this adds to a first impression of sloppyness. I t is his decision to go public - but then he has to take the consequences.

j7ake · on Jan 18, 2020

It's unfortunate the first equation was mistyped, one way to check is that the Bayes's rule can be derived from a rule in conditional probability:

P(A,B) = P(B,A)

P(A|B) * P(B) = P(B|A) * P(A)

The formulation in the blog post would say instead:

P(A|B) * P(A) = P(B|A) * P(B)

which does not really make sense.

cgel · on Jan 18, 2020

Author here. Sorry that the typo and the render errors affected you so much. We don't see any rendering issues on our end, if you tell us what browser you are using maybe we can replicate and fix them.

mbeex · on Jan 18, 2020

> Sorry that the typo and the render errors affected you so much.

It seems to concern some commentators here to a much greater extent, concluding from the whole downvoting dance. I'm old-school, I received my master in mathematics more than 25 years ago. Being in stochastics, I simply spotted an error and also some dispute, the latter from the context in the article. I want to mention, that in the past people had considered a behavior like mine as helpful, rigour was a value and especially when under fire, people were expected to try even harder.

> what browser you are using maybe we can replicate and fix them

FF 72.0.1 on Win 10 here. If it helps, in the following sequence:

--- snip ---

Bayes’s Rule simply says that for any two non-independent random variables $A$ and $B$, seeing that $B$ took a specific value $b$ changes the distribution of the random variable $A$. In standard lingo, the term Pr(A=a) is called the prior, Pr(B=b∣A=a) is the likelihood, and Pr(A=a∣B=b) is the posterior.

---snap---

all the single capitals (A,B...) appear embraced with dollar signs, all the Pr(...) expression are correct (even after removing uBlock/noScript restrictions)

cgel · on Jan 18, 2020

Thanks! We just fixed an issue. I was able to test it on FF for MacOS but it would be very helpful if you could confirm the problem is fixed in your end.

Also, I'll admit that messing up Bayes rule in a blog post criticizing Bayesian Neural Netowrks is pretty comical. Should have taken the time to proof read the whole thing.

mbeex · on Jan 18, 2020

> if you could confirm the problem is fixed

Yes, it's working :)

sjg007 · on Jan 18, 2020

What is the takeaway from this article anyway?

cgel · on Jan 18, 2020

Author here. The blog post contains a clear explanation of Bayesian Inference and provides careful arguments about its potential benefits and limitations. Further, Bayes rule was typed correctly when it was being used to show that the posterior q(f^*|D) \approx q(f_\theta|D).

We understand Bayes rule... just had a typo and when proof reading the article we didn't check the first equation because: Who would mess up Bayes rule?

inciampati · on Jan 18, 2020

Is the posterior really an update of our prior? That doesn't make any sense to me. It's P(A|B). I can't then use it as P(A) in another inference based on different observations. What am I missing about your description of Bayesian inference?

cgel · on Jan 18, 2020

Imagine you have two random variables A and B, which are 0 with prob 0.5 and 1 with prob 0.5. They just have the property that when A=1 then B is always 0 and vice versa. Thus, when you have seen the value of B, that clearly has changed the distribution of A. You should read P(A) as: the distribution of A when I know nothing about the world. And P(A|B=0) as: the distribution of A when I know that B took on value 0.

_jmw125 · on Jan 18, 2020

some feedback for the authors:

- some simple 1d plots to illustrate your points would be very illuminating; without them, i'm not very convinced by your arguments

- 'generalization-agnostic' is very frustrating as a term, i'm sure you can think of something clearer

- i'm not sure that your argument 'a BNN is only as good as its prior' is any better than 'a NN is only as good as its initialisation', yes any NN model is for sure a victim of local optima, but for most practioners this is good enough