
My Trouble with Bayes - another
http://themultidisciplinarian.com/2016/01/21/my-trouble-with-bayes/
======
wuch
If you consider subjective priors to be a problem, this can be addresses to
some degree using so called "objective priors". They are objective in a sense
that if two people agree on underlying principles of how priors should be
assigned, then they will get the same priors. Catch is that you must decide
what principles to use, as they are not objective themselves.

Updating multiple times on the same evidence can be bad - as it overstates
evidence you have for some hypothesis - but you could do much worse. Instead
of discovering that H implies E, suppose instead that you conditioned on H,
which as later turned out is logically inconsistent. This is in general
serious mistake regardless if what you are doing have word "frequentist" or
"Bayes" attached to it, but consequences are not necessarily always the same.
Larry Wasserman in chapter titled "Strengths and Weaknesses of Bayesian
Inference" of his "All of Statistics" have an example concerned with
estimating a normalizing constant. He compares the two approaches, frequentist
one which works just fine, and Bayes one which fails miserably. There is no
additional commentary so I always wondered if he never realized that
derivation makes inconsistent assumptions, or he realized that but intended to
show that frequentist comes out just fine. Ex falso quod libet.

Regarding the raven paradox, the underlying reasoning and conclusions always
appeared to me to perfectly natural and reasonable. I think it is to great
detriment for mathematics and statistics, that people come up with catch names
with word "paradox" in it, for things that are merely unintuitive to them. For
example Simpson's paradox is a simple observation that: probability of an
event, is not a simple average of event probability across all groups, instead
those within group event probabilities should be also weighted by relative
group sizes. Whats paradoxical about that?

Regarding negation of H, not being a real hypothesis -- this is only true if
you claim that you somehow consider all alternative hypotheses. I don't think
people claim that. I seems to me that it is rather taken to represent only
those alternative hypothesis that are under consideration given your modeling
assumptions. Then it is perfectly fine and valid. I like how Jaynes avoided
this kind of misinterpretations by conditioning everything on background
information used and other assumptions. Let all of those be represented by B.
Then you would talk about P(H|B) and P(~H|B), which makes it clearer that you
don't talk all unknown unknowns.

------
dkbrk
I have yet to see any convincing criticism of Bayesian reasoning when
performed using priors derived from the Principle of Maximum Entropy [0]. By
incorporating all information available and nothing more, such a prior neither
makes unwarranted assumptions nor throws away information (as is very commonly
done with other methods). In principle the process of generating such a prior
demands absolutely no subjectivity, rather it it the result of a logical
deduction from all information available. In practice some information may be
difficult to specify or incorporate, however this is automatically accounted
for by the Principle of Maximum Entropy as it guarantees that nothing is
assumed that is unspecified and being unable to incorporate some information
will merely result in all possibilities being considered without bias. In the
very worst case when you have no relevant information which can be
incorporated this regresses to an uninformative prior (such as the uniform
distribution) which are easily and rigourously handled by this principle even
in far more complex cases where other approaches fail entirely. Furthermore,
given a different prior this process can tell you exactly what additional
(unwarranted) assumptions are made.

Once the priors are specified, the actual process of Bayesian Reasoning is
formal logical reasoning generalised to the case when you possess incomplete
information. It tells you exactly the degree of belief you can assign to a
proposition given some information; and given this information assigning
either less or more belief than the precise amount this deductive process
tells you are equally grave mistakes.

For further information on the Principle of Maximum Entropy I recommend
reading Prior Probabilities (1968) [1] and chapters 11 and 12 of Probability
Theory: The Logic of Science [2]. If you are unconvinced of the theoretical
validity or universality of Bayesian reasoning I heartily recommend reading
chapters 1 and 2 of Probability Theory: The Logic of Science [2].

[0]:
[https://en.wikipedia.org/wiki/Principle_of_maximum_entropy](https://en.wikipedia.org/wiki/Principle_of_maximum_entropy)
[1]:
[http://bayes.wustl.edu/etj/articles/prior.pdf](http://bayes.wustl.edu/etj/articles/prior.pdf)
[2]:
[http://bayes.wustl.edu/etj/prob/book.pdf](http://bayes.wustl.edu/etj/prob/book.pdf)

------
mcguire
I'd like to point out that frequentist statistics suffer from just as many
philosophical shenanigans. (For examples, see any introduction to Bayesian
statistics.)

If you want "rationality", you're going to have to look elsewhere.

------
ced
_This is on everyone’s short list of problems with Bayes. In the simplest
interpretation of Bayes, old evidence has zero confirming power. If evidence E
was on the books long ago and it suddenly comes to light that H entails E, no
change in the value of H follows. This seems odd – to most outsiders anyway._

I don't understand what he's referring to here. If we now know that H entails
E, then that means our model of the world changed, and thus our posterior on H
changed as well. Did I miss anything?

It's an interesting article, but there is a lot of debate around the
interpretation of Bayesian inference, and IMO there are answers to be found.
In particular, Andrew Gelman argues against the subjective interpretation:
[http://www.stat.columbia.edu/~gelman/research/published/phil...](http://www.stat.columbia.edu/~gelman/research/published/philosophy_online4.pdf)
It's the best article I've read on the subject.

~~~
jsprogrammer
E was already known, and thus, taken into account.

------
graycat
So, he finds that intuitive guesstimates of probabilities are not very
effective. No joke.

While probability is a really nice and useful theoretical construct, often in
practice getting an accurate numerical estimate is from challenging to not
doable.

Broadly there are three approaches:

(1) For something like coin flipping, just call it 1/2 and move on!

(2) There are stacks of theorems that can help, sometimes a lot. E.g., there
is the renewal theorem that says that lots of stochastic arrival processes
converge to Poisson processes, and there often in practice a good estimate of
the arrival rate is easy and, then, a lot more probabilities just drop right
out of various expressions for Poisson processes. Can also make use of the
central limit theorem, the law of large numbers, the martingale inequality, a
Markov assumption, etc. Here one of the best little tricks is to use intuition
to justify independence (generally much more effective than using intuition in
estimation of Bayes _priors_ ) and, then, exploit that assumption.

(3) Start with a Bayes _prior_ or whatever the heck but have an iterative
scheme where can have several or many iterations and for that scheme have some
solid theorems that the scheme converges. Then, iterate your way there.

~~~
elcritch
Great post. Couldn't have said it any better myself. Point 3 is essential,
IMHO. Superforecasting by Phillip Tetlock & Dan Gardner [1] relates an
excellent description of this process in the realm of human forecasting even
though they don't phrase it as a Bayesian approach. Essentially they found
that those best able to predict world events continuously honed their
estimates using an iterative process updating what really could be described
as the priors of the superforcasters.

It's an enlightening read as they describe some of the processes used to hone
intuited estimates using an outward and inward looking processes. I'm going to
have to look into what you mean by using intuition to judge independence. Any
good sources on that?

[1]:
[https://en.m.wikipedia.org/wiki/Superforecasting](https://en.m.wikipedia.org/wiki/Superforecasting)

~~~
graycat
> I'm going to have to look into what you mean by using intuition to judge
> independence. Any good sources on that?

If look at the relevant theorems, then will conclude that intuitively random
variables X and Y are independent if and only if knowledge of one of them does
nothing to tell you more about the other one.

If don't have independence, then can often have conditional independence,
e.g., random variable X, Y are conditionally independent given Z implies

E[XY|Z] = E[X|Z] E[Y|Z]

So,

X = number of auto accidents today in LA

Y = number of auto accidents today in NY.

\----------

In a family with two children

X = gender of the first child

Y = gender of the second child

\---------

At a Web site

X = number of new user sessions from 10:00 AM to 10:01 AM

Y = number of new user sessions from 10:01 AM to 10:02 AM

\-----------

On a commercial airline flight

X = height of the first passenger to buy a ticket

Y = height of the 10th passenger to buy a ticket

Etc. Just intuition from knowledge of one of these two tells you nothing more
about the other one (we assume we know the two distributions -- that we get
insight into a distribution doesn't count).\

Then using independence and very little more, can use the main limit theorems,
law of large numbers, central limit theorem, etc.

If don't have independence, then sometimes can fake it: Given a list of n
phone numbers, if toss them into a bucket and stir it and pull out the numbers
and label them as X_1, X_2, ..., then in common situations can assume that the
X's are independent.

There are tests for independence. Likely the first to know about is a chi-
squared test on a table of values of the two variables.

------
tansey
It seems like most of these complaints center around the idea of the inherent
subjectivity of the prior. In cases like astronomy and other hard sciences,
the prior reflects actual scientific knowledge and is not really subjective at
all. In cases where we don't have that kind of evidence, empirical Bayes
methods work very well by just peaking at some subset of the data and finding
a good point estimate for the prior.

I'm also not sure why the OP thinks that calculating the normalizing constant
is a huge issue. Most of the time you rarely need it since you're likely going
to end up doing an MCMC or some other sampling method for the posterior, in
which case you only need proportionality.

There are lots of problems with Bayesian methods in practice, but most of them
revolve around the scalability of modern methods to massive data sets and very
complicated models. Many Bayesians tend to think that it's absolutely crucial
to quantify uncertainty and that the added computational cost and human effort
is worthwhile. In practice, point estimate methods to find MAP or even just
maximum likelihood values work really well for most problems. If you look at
the trend in most machine learning, for instance, generally people find a cool
way to solve some problem with good performance (e.g. SGD + Deep Nets), then
some Bayesian lab spends a few years trying to interpret everything as a
generative model and coming up with a clever way to sample everything (e.g.
Lawernce Carin's lab at Duke has done a lot of this work in Deep Bayesian
Nets). The end result is usually better, but by then most people have moved on
to a newer problem and the appeal of getting a marginal boost in performance
is harder for me to see. The Bayesian nonparametrics crowd has historically
done a pretty good job of hitting a sweet spot of compromise on this by
keeping a Bayesian view but still (usually) treating everything as an
optimization problem first (e.g. variational inference methods).

~~~
skybrian
It seems like your attitude towards statistical errors is largely going to
depend on the risk of making bad predictions.

For example, the problem with the denominator containing "unknown unknowns"
isn't much of an issue if you're searching photographs or optimizing ad
revenue. It's much more important for something safety-related like building
an airplane or a driverless car.

Finance is in the middle: not directly safety related, but modeling tail risk
the wrong way could bankrupt the company.

------
indiana-b
I do not understand the problem people have with priors in Bayesian
methodology. Yes, it is true that a poor choice of prior can affect results.
But classical, frequentist techniques incorporate priors implicitly: a flat
prior indicating we have no information other than the data. And just as a
poor Bayesian prior based on subjective belief can ruin an analysis, a non
informative prior implicitly made can be just as catastrophic. And it is truly
a rare case when there is absolutely nothing known about a process, and in
these cases, a flat prior _is_ the kind of poor prior that these people are so
afraid of.

~~~
cwyers
Right. All human endeavor involves subjectivity in some fashion, Bayesian
thinking just lets you quantify it and handle it explicitly.

~~~
jsprogrammer
I believe this is a contradiction. How do you quantify subjectivity?

~~~
bordercases
Let's start with something lighter: how do you categorize subjective states of
mind? But we do through words like "happy", "sad", "good-will". Additionally
we can make statements as to the degree of sadness, or degree of happiness.
Psychologists do this professionally with their survey instruments, but even
the layman is capable of saying "I was most happy at my wedding and least
happy at my great-grandfather's funeral", which sets bounds with two points of
reference.

Oops! Anything with degrees of effect and bounds is basically a finite
ordering. And a finite ordering can be given a rank, with numbers. So now we
have given numbers to subjective states.

This is obviously quite handwavy, but the core ideas are here. I might be
missing something. The big differences are that probability assigns a
continuous measure to what we believe is Belief. How do you get an infinite,
uncountable number of experiences?

One way is to have the acceptable accuracy be less than an infinite number of
digits. That's always going to happen in practice but it's not quite what
we're looking for. One other way to do it is to use preferences and expected
value. Do you think your knowledge is complete enough such that you'd be
indifferent to betting an X amount of money on a roullette wheel with given
probability of winning P? Then P is your degree of belief.

In the end there are very good reasons to doubt subjective probability
estimates. But as long as one works with bounds and acceptable degrees of
error, then you at least get a sense of how it is you're going wrong.

~~~
jsprogrammer
You can maybe gain some sense, but I'm not sure if it can exactly tell you how
you're going wrong (except possibly to say, " _if_ you believe these things,
it makes no sense to also believe this other thing"; where a thing is a
probability value assignment -- the danger is in believing that such analysis
can give you positive knowledge [for instance, "I believe these things, so
this other things _must_ be the case"). I believe you need a total ordering,
not just a finite, partial ordering. Maybe you can say wedding is good,
funeral is bad, but how would you incorporate something like the birth of a
child, the taste of some food, etc? Basically, you have to reduce all aspects
of subjectivity to an integer, which could very well be impossible (making the
activity of trying to turn subjectivity into integers highly suspect).

I understand that economic games sometimes use degree of belief, but in the
example of a roulette wheel, we can actually count up all of the possibilities
and assign numbers based on that. I don't think we can count all of the
possible subjective states.

>How do you get an infinite, uncountable number of experiences?

I experience this all the time (though I can't say they are infinite, it is
certainly more than I can count).

~~~
bordercases
I lost my reply because the site went down! I'll give a brief one here.

> integers

Or a matrix. Or with some dimension but not others. A total ranking is
implausible and lossy, but partial rankings for some traits is tractable: you
just have to be careful. If motivated by a decision the use of probability
becomes more clear, since you can declare what kinds of errors you can handle
and what you can't, relative to the information that you specifically want
from an event.

> positive information

This is a problem with statistics, not Bayes. Null hypothesis testing with its
p-values and t-tests can only reject a known distribution, not telling you the
real distribution without testing for all of them. At that point it can be as
prone to GIGO as Bayesian methods are.

There are some statisticians who dissolve the whole Bayes vs Not debate by
focusing on the optimisation of loss functions. Although it lacks the
philosophical pyrotechnics of subjective probability, in practice it's
probably the most reasonable approach: do what works.

