
How Bayes’ Rule Emerged Triumphant from Two Centuries of Controversy - evanb
http://www.mcgrayne.com/the_theory_that_would_not_die__how_bayes__rule_cracked_the_enigma_code__hunted_d_107493.htm
======
arcanus
Bayesian here. 'The theory that would not die' is a wonderful read, and notes
how many scientists used Bayesian techniques (subjective probability) in a
variety of contexts while the field was still unpopular (arguably, heretical)
in the mainstream statistical community.

Bayes was nevertheless used to inform ballistics calculations, help crack the
enigma code, or inform search patterns for lost nuclear weapons.

I don't think we have seen the end of Bayes, either, as it is very useful for
uncertainty quantification in the engineering Sciences, machine learning
techniques, or even discovering the Higgs Boson.

~~~
amasad
As someone new to Bayesianism I'd be interested in hearing your experience
applying it in day-to-day life. How useful do you think it is to the ordinary
Joe?

From my brief experience, after learning Bayes, my intuition about things
involving probabilities grew very different than the people around me. For
example, my friends were planning a skydiving trip and I googled the name of
the skydiving business and found that they had fatal incidents in the past. My
friends were convinced that given that they had incidents either the
probability that we would have an incident using them doesn't change. Or
worse, that since it has already happened, it's now less likely that an
incident would happen. Wat.

~~~
mikekchar
The one place I've used Bayes (hopefully properly!) is in a spaced repetition
flash card program. Usually spaced repetition algorithms wait a certain amount
of time based on how many times you have seen and remembered a card. The more
times you have remembered it, the longer you wait. It then creates a schedule
for each day. You review the cards that have "expired" their wait time.

I wanted to turn this upside down. Instead I sorted the cards by how likely
you would be to remember them (a function of the number of times you have
already remembered it and the amount of time that has passed since you last
saw it). I put the least likely to be remembered cards first and the most
likely to be remembered cards last.

As you reviewed the cards you would either remember them or not remember them.
I supposed that cards grouped together had a similar probability of being
remembered (only valid if my sort algorithm was correct, but I had some
confidence on that since I based it on somebody else's research ;-) ). I then
used Bayes to estimate the probability of getting the card correct.

So instead of scheduling the cards, I simply had the user keep reviewing until
I had a certain confidence that there was a 90% or more of getting the cards
correct. At which point I left the rest for another day (the probability goes
down over time). Interestingly as the test is binary (remembered/not
remembered) Bayes could be simplified down to getting it right a certain
number of times in a row. This was simpler for the user to understand without
significantly reducing the accuracy of the estimate.

Gratifyingly the system worked incredibly well. I added an option to continue
drilling the cards after you hit the 90% mark and I _very_ rarely got into a
situation where the estimate was wrong.

~~~
kybernetikos
That sounds very interesting. Is your program available anywhere?

~~~
mikekchar
See my reply to mtrimpe below. I maintained this program for quite a long
time, but realistically my choice of development platform was a poor one ;-)
Also my code was pretty awful as I was experimenting with several strange
ideas and also writing Ruby code is if I had spent the last 20 years writing
C++ code (which... um... might have been true...)

You can likely get it to work for some definitions of "work" on a Linux box,
but anything else would require serious effort ;-)

Link in case you don't see the other message:
[https://github.com/mikekchar/JLDrill](https://github.com/mikekchar/JLDrill)

------
hacker42
I could never quite understand the divide between Bayesian statistics and
frequentist statistics. Both seem to be ultimately about counting the
frequency by which something occurs and normalizing this frequency with
respect to the number of all possible outcomes. Bayesian statistics
essentially is concerned with the application of the Bayesian updating
technique by which one can iteratively improve a distribution over the values
a parameter toward the true distribution using Bayes' rule. One can prove that
given sufficiently many updates, the initial distribution does not matter as
long it is non-zero for all possible values.

What I am not quite seeing is how this has philosophical implications which
divide the field into Frequentists and Bayesians. It rather seems that
Bayesian statistics is just a frequentist method that is helpful for dealing
with noisy measurements as one can always recover from a bad distributions
using more updates.

~~~
twanvl
Bayesian and frequentist approaches ultimately have a different notion of
probability.

In the frequentist approach, a probability of 10% means that if you repeat an
experiment many times, roughly 1 out of 10 times you will observe an event.

In Baysian statistics, a probability of 10% means that you are that certain
about the event happening. So you would be willing to bet at 10 to 1 odds on
the event happening. There doesn't have to be any repetition of the experiment
for the probability to make sense. And as you can hopefully see, there is
always a prior, that is, your belief about the event before doing any
experiments.

~~~
eximius
Another way to look at it is this way:

P(H|D) = P(D|H) P(H) / P(D)

Bayesians are interested in the probablity of various hypotheses h in H given
some data D.

Frequentists calculate the probability of some data given a hypothesis
(p-value is not strictly a probability but it can be one - it is ALWAYS a
measure of extremity of data coming from the assumed hypothesis, which can be
considered a relative probability).

Most interesting to me is that the Bayesian formula _includes_ P(D|H) which is
basically what the frequentists are calculating. In this sense, the question
Bayesians answer is far closer to what we want to ask and far more powerful.
In practice, the frequentist approach is often more than enough, though. The
tradeoff is computability and simplicity.

~~~
hacker42
Interesting, I've never thought about the likelihood as a confidence, but it
makes sense. But sometimes the confidence also seems to reflect the opposite
of extremity (e.g. for the null hypothesis).

------
jordigh
The biggest historical obstacle to Bayesian stats was low amounts of data
available and difficulty of computation. Frequentist stats is optimised around
these. With 30 samples and very easy computations, you are able to produce a
reasonable frequentist confidence interval, sometimes even with less data. On
the other hand, even the simplest Bayesian analysis of determining the
probability of heads in a coin flip requires some interesting integration and
the somewhat obscure Beta distribution.

Now that we have lots of data and lots of computing power, Bayesian stats can
show its results, after getting rebranded as "machine learning".

~~~
tgb
> On the other hand, even the simplest Bayesian analysis of determining the
> probability of heads in a coin flip requires some interesting integration
> and the somewhat obscure Beta distribution.

Isn't this kind of misleading? The end result of the Beta distribution, etc.
is just the extremely simple-to-compute Rule of Succession [1].

[1]
[https://en.wikipedia.org/wiki/Rule_of_succession](https://en.wikipedia.org/wiki/Rule_of_succession)

~~~
jordigh
A bit, but in general, computing posterior distributions tends to very quickly
lead to more complicated integrals. It just so happens that the Beta
distribution is somewhat nice and symmetrical, if a bit obscure.

~~~
XFrequentist
More generally, this is why there's (still) a mild obsession with "conjugate"
prior/likelihood distribution pairs - i.e. combinations of prior and
likelihood that give analytically tractable posteriors[1] - despite the
ability to get easily get results with MCMC.

[1] Yes, everyone wants a nice posterior. You're very clever.

------
dkbrk
I found this talk extremely interesting despite being already somwhat familiar
with the history of Bayesian methods; I heartily recommend watching it in its
entirety regardless of prior familiarity with the subject.

The talk was also presented very well. It's almost novel nowadays for
something like this to not be some sort of powerpoint presentation, and I
can't say it suffered for it.

------
danbruc
Is there any (uncontroversial) theory that rigorously defines what a 50 %
probability for heads and tails means? It certainly doesn't mean that in the
long run you will obtain the same number of heads and tails because there is a
(vanishing) chance that you will always get heads even though the coin is
actually fair. And just saying that you will obtain the same or at least
similar number of heads and tails with high probability is a circular
definition at best.

Does Bayesian thinking just deny the existence of intrinsic probabilities or
considers them out of scope? Assuming there is such a thing as true
randomness, for example in quantum physics (and it is not due to our ignorance
of hidden variables like in Bohemian mechanics), could a Bayesian assign
probabilities to the outcomes? Would a Bayesian be mislead into assigning
something other than 50/50 for spin up and spin down if he, by chance, only
observes spin up although he is really confident that this is just a
statistical fluke? If he is mislead, does that mean that uncertainty about the
truth of his assumptions creeps in in case the system momentarily deviates
from the expected behavior?

~~~
jordigh
> Is there any (uncontroversial) theory that rigorously defines what a 50 %
> probability for heads and tails means?

Yeah, Kolmogorov's axioms:

[https://en.wikipedia.org/wiki/Probability_axioms](https://en.wikipedia.org/wiki/Probability_axioms)

To interpret these axioms for 50% probability means that the measure
underneath the density function corresponding to the event "heads" is one-
half.

But "rigourous" doesn't have anything to do with the natural world. You can't
make physics "rigourous", for example, but you can make the mathematics
inspired by physics rigourous. Kolmogorov's axioms just give a mathematical
description of probability in a formal sense i.e. only discussing its form,
not its meaning. Formalism is about saying, "whatever this means, this is how
it should behave". It's a very 20th century notion of mathematics.

~~~
danbruc
I know the Kolmogorov's axioms but I am really more interested in that part
they avoid - what is the meaning of a probability of 0.5? It is surly nice
that we can operate with probabilities in a (hopefully) self-consistent way,
but it bugs me quite a bit that I don't really precisely unterstand what the
result of a calculation implies for the real world.

~~~
jordigh
The modern approach to mathematics is that there is no "meaning", just like
"2" or "derivative" has no meaning. We just say how it behaves, or define it
in terms of other things, which eventually bottoms out with undefined terms,
such as sets and set membership. This is formalism.

How you apply mathematics to the world is not the business of formal
mathematics. Whatever you want to do with it is "mere" philosophy. ;-)

~~~
danbruc
That is what I meant - I am more asking for a solid philosophical
interpretation than a mathematical theory, after all frequentism, Bayesianism
and all the other interpretations seem to carry quite a bit of philosophy.

On the other hand in another comment it just boiled down to the question
whether there is a measure that gives 1 for the set of all infinite binary
sequences with 50/50 zeros and ones and 0 for the set of all the other
sequences. So it is not pure philosophy what I am interested in.

------
moon_of_moon
If you are in London you can visit the tombstone of the good Reverend Bayes at
the Bunhill Fields burial grounds, a short walk from the Old Street tube
station, and pay your respects. There is also a bench facing a spacious lawn
that you can sit on and contemplate Bayes in silence. He probably sat there at
some point, come to think of it.

------
dschiptsov
Brexit is the best example so far.

That painful dissonance between so called reality and these probabilistic
models.

~~~
dschiptsov
Probabilities makes sense only with absolutely certain things like a fair coin
or a dice.

In cases where there is no absolute certainty about how many sides or
dimensions your "dice" has and that it is not biased and that there is no
other forces or factors in play probability ceases to make sense.

Probability of A, given B becomes meaningless when either A or B aren't
precisely defined (like in the case of a "fair coin") and so is the
relationship between the two.

Application of the Bayesian rule to "estimated" probabilities is just wrong
and unscientific (in the face of ambiguity avoid the temptation to guess).
Multiplying and dividing nonsense by nonsense yields nonsense.

The global financial crisis and recent cock-sure consensus about outcome of
the brexit referendum the day before voting are good evidences.

~~~
greenshackle
I disagree. There's a large grey area between 'completely unknown' and
'scientific certainty'. I prefer guessing and doing computations with my
guesses than throwing my hands up in the air and calling it unknowable.

When presented with new evidence it's better to write a number down for your
degree in belief in X, ask how much you should change that belief based on the
new data, and update your probability estimate, than to just go with your
feelings. You're not doing actual bayesian computations, that's totally
untractable for anything that's not a very well defined problem, but doing
'pseudo-bayesian' updating is better than not.

I think of it as the fermi approximation of probabilities. You won't get
accurate numbers that way but you'll get better numbers than if you just
invent the answer.

EDIT to add: most of the time you should then _throw out the number_. Just
like a fermi estimate, you get a ballpark sense for the answer, not a
_precise_ answer.

In the superforecasting experiment by Tetlock the best forecasters did this.
They were writing down probability estimates and methodically updating them
based on new data (news articles, data, etc.). They were forecasting
geopolitical events, not dice rolls, and it worked (better than the
alternative, obviously no one can forecast geopolitical events with high
certainty).

