The critique is about the importance of priors in BNN. In my humble understanding of Bayesian reasoning the argument to defend any prior is that with enought data the learning method converges to the real distribution, so if the result of any learning method depends heavily of any prior assumption then that assumption is crucial and in no way can it be taken randomly. On the other hand, it is well known that deep learning can learn any random model, so in the end I think all of this is about the bias-variance trade off. If your prior has an infinite number of adjustable parameters (zero bias) then the variance becomes infinite ( your result will depend and becomes equal to the training set).
So in practice one should choose the prior with the minimum number of parameters (bias) that shows a good learning performance on the available training set.
Anyway, trying to measure how a prior generalizes or not in BNN seems to be another way of thinking about bias-variance, if there is more than this, I would like to know.
About the work you cite, I think that double descent is simply because the extra number of parameters (low bias) used produces a large variance when the input data is small, but as more data is introduced the extra parameters don't play any role, that is they are prunned. So the high bias is relative to the quantity of availabe information (training data). So the second descent starts when the extra parameters are prunned, in practice their coefficients tends to zero, the system learns that those coefficient don't play any role.
Lack of good priors is definitely a weak spot for BNNs. I also like the concept of Generalization-Agnostic Priors. But...
> So viewed through this lens, BNNs with arbitrary priors are nothing more than an architectural decision. A BNN is just a neural network that maps its input to a distribution over outputs; the prior is just a hyperparameter of the model. Just making the network Bayesian bought us nothing. It will only be helpful if we find a good prior, and validate that we are actually doing accurate inference. If you personally believe that exploring this space of priors (similar to exploring the space of architectures or hyperparameters) is particularly promising, then that is a good reason to keep working on BNNs.
Isn't this overstating it a bit... Making the network Bayesian (even if the priors are generalization-agnostic) still bought us something: after training we have a posterior distribution over weights that contains both weights that generalize and weights that don't generalize. That's better than being unlucky and only having weights that don't generalize at the end of training. I'm not saying this advantage is easy to achieve in practice, or benefit from. But it's not nothing.
Also Bayesian inference as a framework has a distinct advantage over the usual aimless architectural twiddling that so much "research" seems to focus on: it's intellectually/mathematically sound.
Agree. Throwing prior weights into hyperparameter sets is not helpful. If each hyperparameter or architecture decision were rigorously tied to an interpretable event or process, we'd have much better explainability and traceability. Its a step in the right direction to try.
The paragraph beginning with "Let’s consider how we might apply the Bayesian framework..." where he introduces the notation is a great example of everything I hate about mathematical notation. We have big-F, small-f, f-of-x, f-sub-x, f-star, big-F-star... and then he decides to abbreviate what he just introduced. If I didn't know what's happening and I was trying to understand this for the first time, I would have no chance and would just give up right there.
The reason those are all "f"s is that they are all versions of the same thing: the function mapping features to outputs, or approximations of it. The capital "F"s refer to random variables/processes describing the same function (using capitals for RVs and lower-case for samples is standard practice in statistics).
By using this notation he is drawing careful distinctions between the various approximations he's using. I think it's pretty good writing.
It's great that's it's consistent. My problem is that the notation only makes sense if you already understand the very thing that he's trying to explain with this notation.
That's not entirely true. The point being made is about the consequences of the design being set up with that notation. That design and that notation is reasonably general. It requires some familiarity with notation around mathematical statistics, modeling, Bayesian formalisms, and random variables.
The thing he's trying to explain is how those things interact and what their behavior is.
Hey, author here. The writing on this section was a bit tricky to get right, but we did our best to keep it as clean as possible while still being precise about the concepts we were considering. And it's definitely not perfect; I've just made some small edits to hopefully make things a bit more clear.
This blog post is a response to ongoing discussion with the Bayesian community, so it was primarily aimed at a more technical audience. If you have any suggestions for how to make the writing more accessible, without becoming so overly expository that the mathematically-robust folk lose interest, I would love to discuss them.
So here’s one thing about mathematical notation. If it uses similar symbols/scripts/subscripts that means that the objects are related, up to the minor difference expressed. So choosing “f” for everything is often a deliberate and well-motivated decision (Very much like naming variables). Doing that well is an art form.
Unfortunately, yes, it does take some time to get used to it (both reading, and generating such names), but IMHO it’s far better notation than otherwise.
The title of the article is at odds with the purely speculative nature of the core part of the argument--the section titled "Are Current BNNs Generalization-Agnostic?". The authors admit themselves that it is pure speculation. They present a speculative class of generalization-agnostic priors, speculate that commonly used priors might belong to this class, and then speculate about why we haven't observed commonly used priors belonging to this class even though they speculatively might belong to this class. This argument is not very convincing, to say the least.
Generalizability is a function of both the network structure and the priors. Non-Bayesian networks without priors can be seen as pseudo-Bayesian networks with (improper) flat priors. The use of independent priors that are unimodal at zero, like the independent normal priors mentioned in the article, will tend to shrink weights toward zero, which makes the function smoother. Smoother functions tend to generalize better (measuring generalizability as the gap between in-sample and out-of-sample error). The classic bias-variance trade-off is that they will generally perform worse in-sample and may also perform worse out-of-sample. Heuristically, if you take any neural network and make it Bayesian by putting priors on the weights, it will (averaging over datasets) generalize not worse than the original network. It may have worse absolute performance.
The authors also argue that we can't be confident in the uncertainties from BNNs until we have a good theoretical understanding of NN generalization in general. They ignore the possibility of building confidence in the uncertainties, both in general and for any particular network, by empirical observation--the same way we've built confidence in generalization of non-Bayesian NNs, since, as the article points out, we don't yet have a good theoretical understanding of NN generalization.
Frankly, just solve an important problem with BNNs in a meaningful way with advantages that only BNNs bring (e.g uncertainty estimation) and you won’t have to write blogposts defending them.
Neural nets have massive "capacity" which means that, in the face of finite data sets, they can both (a) reasonably represent generalizable and non-generalizable functions and (b) can take on priors which do not distinguish between those classes of functions. The upshot is that after training, the posterior weight of robust/generalizable models will equal that of fragile/non-generalizable ones.
We need to believe that the priors that we actually use don't have that property if we're to believe in the posteriors produced by BNNs. Should we?
Today, priors in networks arise mostly out of network topology since initialization methods are somewhat constrained by practicalities in training. The article criticizes those who would assert that network topology (+ initialization) leads to a reasonable prior in the space of effective input -> output functions as realized by the network.
To put that in different terms, you might imagine an argument saying that network topologies are biologically inspired and thus represent a decent approximation of the space of "achievable" implementations of functions in the given task. But does an argument like this say anything about the generalization capability of functions favored by this prior? You might characterize this as "Easy" versus "Correct".
I'm not trying to actually represent argumentation that neural network topologies actually are reasonable in shaping "uninformative" Gaussian priors in the weight space into "uninformative" and "generalizable" priors in the function space. There may be some really good arguments out there. But, if we're going to understand NNs as reasonable Bayesian processes, then that question needs to be interrogated.
> We should ask, “what evidence are you providing that your priors are any good?”
This is valid. Anyone pursuing a Bayesian approach should be asking themselves this question about every prior they use. To fully benefit from a Bayesian framework, one needs to construct models with understandable parameters for which there is some sound theoretical or practical insight that can be embedded with priors and that is not well-represented by the training data. Doing this can help your solution avoid the kind of wildly unpredictable and costly mistakes you might get if you used a completely blackbox approach. For critical applications, this can be highly useful. If you can't come up with priors that are clearly beneficial, then you are likely better off using a non-Bayesian approach.
EDIT: I misread the quote below- it applies to a distribution over functions, not examples. My bad and thanks to one of the authors of the post for politely correcting me in the replying comment.
>> But there is one core problem with the Bayesian framework. In practice, we
never have access to the prior distribution Pr(f)! Who could ever claim to
know the real-world distribution of functions that solve classification tasks?
Not us, and certainly not Bayesians. Instead, BNNs simply choose an arbitrary
prior distribution q(f) over functions, and Bayesian inference is performed to
compute q(f∣D). The question of whether q(f) is close to the true distribution
Pr(f) is swept under the rug.
This is true but it's also nothing new: it's the standard PAC-Learning
assumption that the examples (the dataset) are drawn from the same
distribution as the target theory (the real-world distribution).
This assumption and the complete impossibility to verify it in practice is not
unique to Bayesian Neural Nets. It is true for _every_ machine learning
algorithm.
And this is certainly no surprise for machine learning researchers (or, if it
is, it is really concerning that it is). So the done thing in machine learning
research is to demonstrate that, under PAC-Learning assumptions, a certain
technique or algorithm _can_ correctly identify a hypothesis that approximates
a "true" function to within some amount of error.
I mean to say, when people publish papers reporting a new SOTA on
such-and-such dataset, they are not really claiming that their technique
somehow finds the "true" distribution of the real-world process that generated
the data in their dataset. They're claiming "we can correctly classify
instances in this dataset and if PAC-Learning assumptions hold, this technique
should also work in real-world data from the same domain".
Of course this is often left implicit- and the article makes me wonder to
what extent this is because researches tend to forget or even :gasp: ignore it
completely. A disturbing thought.
Author. There is a misunderstanding in your argument. Your point is about the dataset being sampled from the true distribution. We are happy with that assumption (it's orthogonal to our point).
The problem we have is that to apply Bayes rule you NEED a prior distribution over the correct functions, applied to the points on the dataset, and to the points outside of the dataset. In other words, one thing is assuming that the dataset is representative of the (unknown) classification task, the other is to assume that you know what the distribution over classification tasks is.
Hi. Yes, I see- I misunderstood this. My apologies
for the hasty reading of your post.
But, in that case, there does exist a very good generalisation prior on function
space that is well known and well understood: the simplest hypothesis (e.g.
the one with the smallest minumum description length) is always better
(because it results in a reduction of the hypothesis search space with a
corresponding reduction to the size of the error on unseen data while keeping
the number of examples constant).
We show that a polynomial learning algorithm, as defined by ["A theory of the
learnable", Valiant 1984], is obtained whenever there exists a polynomial-time
method of producing, for any sequence of observations, a nearly minimum
hypothesis that is consistent with these observations.
I am going to be obtuse and say that since Bayes theorem is a non controversial rule about conditional probability, if you interpret your NN as a probability distribution over outputs that is updated by data, it is always able to be interpreted as "Bayesian" and can be a helpful way to examine what your implict priors are (i.e. via architecture or regularization terms) to see if they are reasonable for the problem at hand. It is no surprise that weak uninformative priors are sort of useless. Explicit priors shine when you know something about the actual problem (say some moments or some invariances in the problem set up).
Shouldn’t the priors be updated and improved each step to be closer to a good prior, and that’s why inaccurate priors may be acceptable? (Do BNNs not iteratively update the (next step’s) prior with the previous step’s posterior?) I haven’t worked on BNNs, but since Bayesian are always talking about updating their priors I thought this would be the case.
You are describing a recursive Bayesian approach, which can have significant computational and storage advantages for filtering (for example, Kalman filters). For this to work well, the prior must be able to adequately represent the learning of the posterior, which may be practical with a self-conjugate prior or a Monte Carlo approximation such as what particle filters use. In practice, for nontrivial machine learning applications, self-conjugate distributions rarely model the problem well and good approximations of the posterior into a concise prior are rarely practical.
This is Bayes the wrong way around. The last part should be P(A)/P(B). I have no hand in the dispute (reading here for the first time about it), but not getting the basics right is not very convincing.
It's not "not getting the basics right", it's a simple typo. A thing like that shouldn't invalidate a whole article (unless you have skin in the game for the opponent argument).
Why does HN have a pattern of dismissing whole articles due to simple typos? It's as if we're so habituated to skim and do tldr-reading that our brain is working overdrive to find the slightest excuse not to have to do any type of reading beyond surface level.
Critical and technical literature needs to be held to a standard.
When the rhetor introduces errors in the artifact, the rhetor's ethos with the audience is diminished.
The more fundamental the error (getting a basic equation wrong I guess?) the more trust you lose with a knowledgeable audience. If the author doesn't see that a fundamental issue was introduced, they may not have been expert enough to not introduce additional errors; the reader must spend more time double-checking the components of the argument rather than thinking about the argument itself.
If someone comes to this article as a novice in the topic and stores the error as a fact, they may end up at least confused when approaching it again in the future. HN tends to have an audience representing deep knowledge in many fields, who end up providing a thorough and varied set of quality filters. These quality filters are also really helpful to the novice who may otherwise miss the typo.
> Critical and technical literature needs to be held to a standard.
That's a sloppy statement. You haven't defined what standard. Clearly, everything is held to "a standard"; making that an entirely empty claim.
> When the rhetor introduces errors in the artifact, the rhetor's ethos with the audience is diminished.
A "rhetor" is a teacher of rhetoric. This is not the correct word in this case. In this case, the correct word is the more general "author", since the post was not teaching rhetoric. Further, the entire point of "ethos" in rhetoric is that we shouldn't be so lazy as to allow minor issues cloud our judgement.
> The more fundamental the error (getting a basic equation wrong I guess?) the more trust you lose with a knowledgeable audience.
Quite the opposite. A knowledgeable audience can decide whether to trust something based on the actual content, rather than minor surface issues. Only a lazy or uninformed audience need get distracted by typos.
> That's a sloppy statement. You haven't defined what standard.
Actually, I think I established a basis of discussion then later illustrated this basis with the way HN has grown and tends to enforce its standard.
> A "rhetor" is a teacher of rhetoric. This is not the correct word in this case.
A rhetor is a person practicing rhetoric, or a person delivering persuasive or effective communication. Teachers are indeed a subset of that, but also public speakers, negotiators, and e.g. authors who write to influence their audiences' understanding or perception.
I think you may also misunderstand the function of ethos in rhetoric.
> A knowledgeable audience can decide whether to trust something based on the actual content.
Perhaps we're in agreement? If the content reflects reflects clear understanding it can improve the efficacy of a rhetorical artifact, while sloppiness can reduce its persuasiveness.
Not really, it's one of the most common mistakes I see in bayesian calculations. If the author was basing a lengthy series of calculations on that first step, it would be worse (but in this case the expression is quickly replaced by a corrected version for the classification discussion).
I have a similar experience, but precisely because of this I am very careful to check the formula each time I have to type it. Mistakes in the formulas are not "just typos", they are a very annoying and potentially harmful kind of typos, and we must take great care in order to avoid them.
Like I said above, "typos" in sequences of calculations are obviously problematic, and lead (almost always) to mistakes in the final result. In this case that's not applicable since there's no "second arithmetic manipulation" following the typo:ed one. (The author replaces the incorrect one with the correct bayesian equation in the next section.)
> unless you have skin in the game for the opponent argument
I stated explicitly otherwise (I'm not even in research), something you certainly couldn't miss. I'm not quite sure, if your allegation backfires.
Especially in hot temper, people make errors. But just then they should avoid making an easy target. The authors site also has some problems with rendering math (some does, some shows still dollar signs) and this adds to a first impression of sloppyness. I t is his decision to go public - but then he has to take the consequences.
Author here. Sorry that the typo and the render errors affected you so much. We don't see any rendering issues on our end, if you tell us what browser you are using maybe we can replicate and fix them.
> Sorry that the typo and the render errors affected you so much.
It seems to concern some commentators here to a much greater extent, concluding from the whole downvoting dance. I'm old-school, I received my master in mathematics more than 25 years ago. Being in stochastics, I simply spotted an error and also some dispute, the latter from the context in the article. I want to mention, that in the past people had considered a behavior like mine as helpful, rigour was a value and especially when under fire, people were expected to try even harder.
> what browser you are using maybe we can replicate and fix them
FF 72.0.1 on Win 10 here. If it helps, in the following sequence:
--- snip ---
Bayes’s Rule simply says that for any two non-independent random variables $A$ and $B$, seeing that $B$ took a specific value $b$ changes the distribution of the random variable $A$. In standard lingo, the term Pr(A=a) is called the prior, Pr(B=b∣A=a) is the likelihood, and Pr(A=a∣B=b) is the posterior.
---snap---
all the single capitals (A,B...) appear embraced with dollar signs, all the Pr(...) expression are correct (even after removing uBlock/noScript restrictions)
Thanks! We just fixed an issue. I was able to test it on FF for MacOS but it would be very helpful if you could confirm the problem is fixed in your end.
Also, I'll admit that messing up Bayes rule in a blog post criticizing Bayesian Neural Netowrks is pretty comical. Should have taken the time to proof read the whole thing.
Author here. The blog post contains a clear explanation of Bayesian Inference and provides careful arguments about its potential benefits and limitations. Further, Bayes rule was typed correctly when it was being used to show that the posterior q(f^*|D) \approx q(f_\theta|D).
We understand Bayes rule... just had a typo and when proof reading the article we didn't check the first equation because: Who would mess up Bayes rule?
Is the posterior really an update of our prior? That doesn't make any sense to me. It's P(A|B). I can't then use it as P(A) in another inference based on different observations. What am I missing about your description of Bayesian inference?
Imagine you have two random variables A and B, which are 0 with prob 0.5 and 1 with prob 0.5. They just have the property that when A=1 then B is always 0 and vice versa. Thus, when you have seen the value of B, that clearly has changed the distribution of A. You should read P(A) as: the distribution of A when I know nothing about the world. And P(A|B=0) as: the distribution of A when I know that B took on value 0.
- some simple 1d plots to illustrate your points would be very illuminating; without them, i'm not very convinced by your arguments
- 'generalization-agnostic' is very frustrating as a term, i'm sure you can think of something clearer
- i'm not sure that your argument 'a BNN is only as good as its prior' is any better than 'a NN is only as good as its initialisation', yes any NN model is for sure a victim of local optima, but for most practioners this is good enough
So in practice one should choose the prior with the minimum number of parameters (bias) that shows a good learning performance on the available training set.
Anyway, trying to measure how a prior generalizes or not in BNN seems to be another way of thinking about bias-variance, if there is more than this, I would like to know.