
Making a neural network say “I Don’t Know”: Bayesian NNs using Pyro and PyTorch - paraschopra
https://towardsdatascience.com/making-your-neural-network-say-i-dont-know-bayesian-nns-using-pyro-and-pytorch-b1c24e6ab8cd
======
chrisfosterelli
Uh, the lead of the article claims it achieves "97% accuracy on MNIST". If you
are removing 12% of the MNIST images at your neural network's discretion, you
can't claim the 96% accuracy on the remaining set as your overall accuracy!
You're essentially self-selecting for test data the classifier is confident
in, so of course that accuracy will be high.

If this was allowed then I'd make an ImageNet classifier which just discards
all images except for one I'm really good at classifying, then claim I have
100% accuracy on ImageNet.

~~~
paraschopra
My main point is not about aggregate accuracy, it's about comparing confidence
across classes of datasets.

Edit: the entire point of bayesian approach is that you can make decisions on
a loss function where you can make a tradeoff on the (business) cost of making
a wrong decision and (business) regret of not making a decision. For real
world applications, accuracy is a useless metric. What matters is the net
(business) benefit or loss.

~~~
chrisfosterelli
Your article bolds the following in the introductory paragraph:

> My final classifier had accuracy of ~97% on MNIST and it refused to classify
> white noise and the majority of unrelated (non-MNIST) images.

I'm saying that you are making a misleading claim when you say this. It is
categorically untrue that you achieve 97% on MNIST with this approach.
Accuracy on a standardized test set makes no sense when you select a subset
from the test set for maximal accuracy. I'm not saying that your approach
doesn't have value or make sense in some business cases, because I think the
overall idea is interesting but these statements are not correct.

~~~
pacala
Correct. That's why people use
[https://en.wikipedia.org/wiki/F1_score](https://en.wikipedia.org/wiki/F1_score)
instead of just accuracy.

------
Fragoel2
I have limited knowledge of machine learning so forgive me if I am wrong. I
recall that some classification techniques provide you a confidence score of
the classification... so can't you just treat a classification score lower
than a certain threshold as "I don't know"?

~~~
warent
This question was answered in the comments section of the article.

Basically Softmax will only give you values close to 1 or close to 0, so
there's no room for your own interpretation or threshold.

[https://medium.com/@paraschopra/softmax-amplifies-small-
diff...](https://medium.com/@paraschopra/softmax-amplifies-small-differences-
so-probabilities-that-come-out-of-it-get-skewed-towards-either-cf1c8c50fab9)

~~~
anonymousDan
Do you need to do softmax? Why not just look at the underlying amplitudes of
the output before applying softmax? Or is it that they won't be normalized?

~~~
cultus
Yes, they wouldn't be normalized so you can't interpret it as a probability.
Softmax is the function that normalizes it. However, even that you can't
interpret as a probability of some outcome, because you do not take into
account the prior.

For that, you need actual Bayesian techniques.

~~~
anonymousDan
Can you elaborate on what you mean by not taking into account the prior? What
is the output of aoftmax if not a probability?

~~~
cultus
Well, it's a probability of the output _given_ the fixed, fitted model
parameters. This is the likelihood. This is not the probability of getting
that output, since we don't know what the parameters should really be. See the
inverse probability fallacy [0]. This confusion is literally the main reason
behind the replication crisis of scientific research.

In ML, you typically have a model f(w;u), and you are trying to fit the
parameters w to some data, given some hyperparameters u that represent some
kind of prior information. In frequentist techniques, these could be
regularization parameters.

In frequentist techniques, the hyperparameters the fitted parameters are
assumed fixed, and not random. In contrast, Bayesian methods assume it is
random as well. This extra uncertainty is neglected by the predictions of
frequentist techniques.

Nonetheless, the uncertainty is real.

[0]
[https://en.wikipedia.org/wiki/Confusion_of_the_inverse](https://en.wikipedia.org/wiki/Confusion_of_the_inverse)

~~~
anonymousDan
Thanks!

------
jl2718
In Christopher Bishop’s PRML book, he shows a procedure for getting the full
posterior of a neural net using estimates of the Hessian evaluated at the
point of inference.

Any idea how sampling compares in performance and computational efficiency?

~~~
program_whiz
SVI still uses sampling under the hood. Any probabalistic programming
framework (such as pyro) uses random sampling, there's no other way to
estimate the curves of a complex set of interacting probability distributions.
Obviously there are some optimizations for simple distributions that are known
(e.g. we know what the expected values of simple gaussians will be, and the
PDF or CDF of a given sample).

------
tdgunes
For people who are interested, there is an upcoming NIPS paper that might be
relevant for this problem:

Paper: [http://arxiv.org/abs/1806.01768](http://arxiv.org/abs/1806.01768)

Demonstration:
[https://muratsensoy.github.io/uncertainty.html](https://muratsensoy.github.io/uncertainty.html)

------
sharemywin
I thought you were going adding a class called I don't know.

~~~
svantana
The nice thing about using likelihood as the training criterion, you don't
even need an extra category, you can just combine the outputs with basic
decision theory, like so:

    
    
      thres = wrong_cost / (wrong_cost+right_gain)
      response = (max(out_probs) > thres) ? argmax(out_probs) : "I don't know"
    

While the bayesian methods can theoretically perform better for out-of-set
inputs, it feels like an oversight to not include this simple method as a
baseline. My gut feeling is that this would perform equally well, if not
better.

~~~
paraschopra
This works if you get nicely balanced probabilities out of neural network.

Softmax which is popular way to transform network outputs to probabilities
pushes probabilities either to zero or one.

There may be other ways to transform outputs to probabilities where your
approach could work but I'm not aware of them.

Also to reply to OP's comments, to have a class like 'I don't know' you need
to have a dataset of things that don't look like your training dataset. And
you'll always miss examples of things that don't belong to your input class.
Bayesian solves it in a nice way by not requiring this infinite not-input
training data.

------
mmm1111
the author of this article would do well to admit that he "doesn't know"
bayesian inference very well. unfortunately, the usage of pyro is incorrect
here. in particular data subsampling is handled incorrectly (there may be more
issues). the weights of the network represent a global random variable.
consequently, when doing data-subsampling care must be taken that the ELBO
that is constructed is scaled correctly. that is not done here. probabilistic
programming languages like pyro are a great tool for making bayesian inference
easier and more accessible, but they can be produce arbitrarily bad results if
the user uses them incorrectly.

~~~
jlmaccal
Could you elaborate on how to do this correctly? This is one of the only
examples of using pyro to build a Bayesian NN and it would be really valuable
have a correct implementation.

------
jameszhao00
How does this compare to Dropout based methods?
[http://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801aa...](http://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801aa532c1ce.html)

~~~
sjg007
My guess is you use dropout during training.

------
sharemywin
found this online(might be useful):

Bayesian Approach: Pros and Cons

    
    
     	Pros	
    

Can calculate explicit probabilities for hypotheses.

Can be used for statistically-based learning.

Can in some cases outperform other learning methods.

Prior knowledge can be (incrementally) combined with newer knowledge to better
approximate perfect knowledge.

Can make probabilistic predictions.

    
    
     	Cons	
    

Require initial knowledge of many probabilities.

Can become computationally intractable.

[http://www.ru.is/faculty/thorisson/courses/v2008/gervigreind...](http://www.ru.is/faculty/thorisson/courses/v2008/gervigreind/Bayesian.html)

~~~
paraschopra
> Require initial knowledge of many probabilities.

This is not a really big issue. You can _always_ use uninformative priors,
which is simply the Bayesian way of saying "I don't know" what the probability
distribution looks like upfront. Then the data can help you infer the right
shape of probabilities.

> Can become computationally intractable.

This is a bummer, but modern methods like variational inference and better
sampling techniques are making Bayesian analysis on large-scale data and
complex models tractable.

I linked a video the article that I highly recommend watching. Variational
Bayes and Beyond: Bayesian Inference for Big Data
[https://www.youtube.com/watch?v=DYRK0-_K2UU](https://www.youtube.com/watch?v=DYRK0-_K2UU)

------
mlevental
sorry for posting twice in this thread but i figure it's a good place to ask:
what other methods are there for getting a nn to report "i don't know"

~~~
JD557
My Master thesis supervisor had a fun trick for this, that can be used not
only on on NNs, but also on other classifiers:
[https://arxiv.org/pdf/1011.3177.pdf](https://arxiv.org/pdf/1011.3177.pdf)

The gist is, you train a single classifier that will work as two classifiers
(with certain restrictions): one that says "false/not-false" and another that
says "true/not-true".

If the output of the classifier is "not-false" and "not-true", then you
consider that as a "I don't know".

~~~
yboris
Is this what you mean: For example, in MNIST, your model would output 20
probabilities: (1-yes: 0.8, 1-no: 0.2, 2-yes: 0.1, 2-no: 0.9, etc) where each
pair sums to 1?

------
mlevental
how effective would this be in deep nets with small data sets?

edit: does anyone know of any papers that talk about this technique formally?

~~~
paraschopra
Bayesian techniques are effective with small data sets if you build your model
carefully. For example, if you are able to encode the generating process of
your dataset, even a _single_ example could be used for effective training.
For example, see this
[http://web.mit.edu/jgross/Public/lake_etal_cogsci2011.pdf](http://web.mit.edu/jgross/Public/lake_etal_cogsci2011.pdf)

By providing a good generative model, you are shifting the effort from
collecting more data to carefully coding your expert knowledge. In my view,
this is an exciting paradigm as it allows combination of human knowledge (as
generating process and prjors) that can be overcome (in posterior
distributions) if enough data is accumulated against it.

------
rgrieselhuber
Socrates was on to something.

