
The Ghost of Statistics Past - jimsojim
http://crypto.stanford.edu/~blynn/pr/ghost.html
======
stdbrouw
The author seems to imply that frequentists are not just wrong but borderline
malicious – they use all sorts of ad hoc procedures, have no theoretical
justification for anything they're doing, and so on and so on. But in fact,
there is an extensive body of theory about when you can "throw away
information" (cf. minimally sufficient statistics), the difference between
P(D|H) and P(H|D) (cf. likelihood theory) and so on. His undergraduate
textbook might not talk about all that, but a graduate textbook sure would --
it's certainly not lore full of implicit assumptions that everyone's
forgotten.

A much better explanation of what is happening to statistics is this: flipping
P(D|H) to P(H|D) used to be really hard, so we came up with all sorts of
tricks to either approximate it or to get by without having to bother about
P(H|D) at all, things like p-values. Now there's computers and really good
sampling algorithms so we don't have to use the approximations anymore. But
some people still prefer the approximations because they're used to them.
There. No incompetence or malice involved.

Other than that, if you ignore the tone of the article, it's an insightful
read if you're new to Bayesian statistics and want to understand what's the
point.

~~~
eterm
The author is also wrong when he states, "We used the entire sequence, not
just the number of heads", but the the calculation only uses the number of
heads and tails, and not the sequence.

The calcuation used:

> sum[p^12*(1 - p)^8 | p <\- [0,0.1..1]] / 11

Does not have anything marking position.

You would get the same results with the sequence

HHHHHHHHTTTTTTTTTTTT as TTHHTHTTHHTHTTTTHTTH

To say "In other words, we’ve shown it’s fine to forget the particular
sequence and only count the number of heads after all. What is not fine is
doing so without justification." is a confused conclusion from his statements,
he hasn't actually justified it, he's just used a calculation that doesn't use
position (but he claimed it did) then when it got to the same answer ignoring
position then said "so position doesn't matter!".

But you would have a hard time believing sequence 1 above was random.

------
ryanmonroe
One reason people might not follow this method for inference is that it
depends heavily on the experimenter's personal beliefs about the world. The
"improvements" here only lead to more accurate inference if the author's
assumptions about the world are true. In business applications you often just
want to get to a conclusion and make a decision, so this method makes sense.
In scientific publication you want to verify results with a larger community
with minimal assumptions. When you calculate a p-value and print it in
publication, you might not be giving much information, but at least that
information is objective and invariant to readers' personal beliefs. Making
inference based on p-values you can at least say "In the long run I will
incorrectly reject the null hypothesis 5% of the time with this method", while
there is no such similar statement for a method that depends on the
experimenter's personal beliefs.

In addition, the probability being calculated here is a little misleading in
that it doesn't fit with the traditional definition of "probability of X".
Despite the same s notation, P(H|D) is not the same type of probability as
P(D|H). The coin is either biased or not, so there isn't actually any random
process there and the statement "the probability this coin is fair is 50%"
makes little sense under the traditional definition.

~~~
ivan_k
> while there is no such similar statement for a method that depends on the
> experimenter's personal beliefs

The posterior probability P(H|D) is exactly this kind of statement. You say
"Based on the data, I am 78% sure that the coin is biased". I think this is
both more direct and interpretable.

~~~
ryanmonroe
By "similar statement" I meant a statement about how often the method will
lead you to the incorrect conclusion. If you can't quantify that, it seems to
me you don't have any basis for claiming your method "works".

~~~
ivan_k
If that is what you meant, than you are getting in the realm of hypothesis
testing. The best equivalent of a p-value is then an anabashedly named "Bayes
factor" [1], which is a ratio of posterior probabilities for competing
hypotheses.

[https://en.wikipedia.org/wiki/Bayes_factor](https://en.wikipedia.org/wiki/Bayes_factor)

~~~
ryanmonroe
Thanks, didn't know about this.

------
e12e
Perhaps the post is alluding to some different use of statistics than I'm used
to - but isn't it normal to view a certain set of outcomes as a sample from
some larger population, and _not_ as eg: a _sequence_. In the case of a given
coin, and a controlled _sequence_ of flips, we can ask _different_ questions,
like what is the probability of heads following heads vs, heads following
tails? (Eg: when a certain experimenter flips the coin, do the side that is up
at the start of the experiment affect the outcome? Will someone that
mechanically flips a coin 10s or 100s of times in a row end up with such a
similar mechanical motion that the coin tends to spin approximately the same
number of times from "flip" until it lands?).

I _think_ the author mixes up the mental models involved, when mixing
"throwing away information", "sequence" and "a (typical) fair coin" and "a
(typical) unfair coin".

There's _this_ coin, _this_ sequence, and _those_ typical fair/unfair coins.

In as much as I've been able to grasp anything about proper statistics, it's
the idea that without some idea of the population and sample type (eg: can we
expect a Poisson distribution?) -- most modern statistics makes no sense. And
one can look at things differently. Just like Newtonian physics is correct _at
the same time_ as quantum physics is correct (most of the time). But sometimes
we need to change our theoretical model to be more precise (quantum) in order
to predict and model behaviour mathematically.

~~~
stdbrouw
A sequence corresponds to multiple draws from that population, and a statistic
(number of successes / number of draws) is a summarization of those draws in a
single number. The author isn't really arguing that a binomial distribution is
not the right way to model this problem, but that the way frequentist
statistics weighs the evidence is faulty.

Edit: although, actually, you're right that the author does at one point say
"Why should all sequences containing exactly 12 heads be treated the same?"
but then goes on to model everything with the binomial distribution anyway.

~~~
lottin
Since each draw is independent from each other (I don't think anyone disputes
that), the order is irrelevant. You want to think about it as a set, not as a
sequence.

------
mgraczyk
The coin flipping example used in the article is pretty weak. We are convinced
that the coin is unfair when we see "HHHHHHHH" but not "HHTHTHTHH" because the
former event is more likely given an unfair coin. We choose the hypothesis
that makes the data most likely. I doubt the author's of the undergraduate
textbook were unaware of maximum likelihood estimation. I suspect that the
post's author simply did not read far enough into his textbook and got hung up
on a simplified model given in an early chapter.

Any hypothesis we test about the coin's "fairness" is implicitly a test of how
close p is to 1/2\. The only difference between a frequentist analysis and a
bayesian analysis in this case would be that the latter might impose a prior
on p (although in reality we would likely impose a uniform prior, rendering
the two analyses identical).

------
gohrt
Computing with frequentist statistics just means making a bunch of simplifying
assumptions, setting some things constant to make computation tractable. The
author correctly hints at that in the middle of the article, but then glosses
past it.

Frequentist vs Bayeisan _interpretation_ is like different interpretations of
quantum mechanics. It has no impact on the calculations.

Novice self-labelled "Bayesians" overlook the reality that, as wikipedia
explains:

> where appropriate, Bayesian inference (meaning in this case an application
> of Bayes' theorem) is used by those employing a frequentist interpretation
> of probabilities.

[https://en.wikipedia.org/wiki/Frequentist_inference](https://en.wikipedia.org/wiki/Frequentist_inference)

~~~
ivan_k
I would care to interject. First of all, you are right on several points.

* Most of mathematics is the same in both schools of though, and the interpretations is not different.

* Some basic ideas (i.e. the nature of probability) are quite different, and this is where most of the argument (Frequentist vs. Bayesian) comes from.

However, this second point has a major impact on calculations. So I disagree
here:

* The notion of a _prior_ (probability of the hypothesis, P(H)) is essentially nonsensical in the Frequentist view. Any frequentist would just call it 'bias'. However, for a Bayesian, this is the degree of belief that you put in you system before you do any measurements. Practically, it is either non-informative (you don't give more belief to any hypothesis a-priori), or comes from earlier data. The prior gives you a natural way to incorporate multiple experiments. I think that is a large difference in calculation (or at least the structure of calculation).

More importantly, Bayesian statistics inspires (and is enabled by) Markov
Chain Monte Carlo inference. It is the main mathematical machinery used for
today's Bayesian data analysis, and is impossible in the frequentist
framework. This approach allows you to scale to very complicated (i.e.
feature-rich, multiparameter) data, explore very complex (i.e. non-convex,
hard to optimize) probability surfaces. All of this stuff is very hard (if at
all possible) in the frequentist framework.

So there are difference. But people don't really argue about them. There is no
grand flame war, or anything of that sort.

Oh, and contrary to what the article suggests, no statistician likes p-values.

~~~
TuringTest
> The notion of a _prior_ (probability of the hypothesis, P(H)) is essentially
> nonsensical in the Frequentist view. Any frequentist would just call it
> 'bias'. However, for a Bayesian, this is the degree of belief that you put
> in you system before you do any measurements.

Thanks, I was having problems with that point. Stating that "a priori, all
hypothesis are equally likely" looks like a too strong assumption to make from
complete lack of information. If you interpret it instead as "lacking
information, I don't have a reason to prefer any hypothesis over the others"
it seems more reasonable.

However, that doesn't solve my qualms with the Bayesian approach as explained
in this article.

I understand the justification of Bayes Theorem from a frequentist approach,
as starting with all the possible outcomes, and filtering that initial
probability through the lens of available information; i.e. removing facts
that we know can no longer be true, and counting those who can. In such
context, the theorem seems intuitively true.

However, if the a priori probability is interpreted as a lack of knowledge,
the form of the theorem looks much more arbitrary. Why would that particular
computation be the best way to increase our confidence, if the starting point
is arbitrary and the shape of the formula is not related to the number of
cases that can be true or false in the current state of the world?

I understand that Bayesian analysis counts with well-developed and practical
tools. But what I get from this article is that their particular form seems to
come from tradition rather than any intrinsic property of that model - if you
reject frequentism, any counting model might _a priori_ work as well as the
Bayesian one.

Edit: Apparently Wikipedia agrees with me in this point.[1] There are other
rational models for updating your probabilistic belief, and Bayesian is used
primarily for being computationally convenient, rather than theoretically
incontestable. Or am I reading too much into it? I'm certainly not expert in
probability.

[1]
[https://en.wikipedia.org/wiki/Bayesian_inference#Alternative...](https://en.wikipedia.org/wiki/Bayesian_inference#Alternatives_to_Bayesian_updating)

~~~
ivan_k
I am not sure what you mean by "the form of the theorem looks much more
_arbitrary_ ". The derivation of Bayes law comes from the axioms of
conditional probability.

Given two events A, B; we have:

P(A^B) = P(A|B) * P(B)

Probability of A and B = Probability of A given B happened times probability
of B

Symmetrically, we can say:

P(A^B) = P(B|A) * P(A)

Now we have:

P(A|B) * P(B) = P(B|A) * P(A)

Rearranging, we get:

P(A|B) = P(B|A) * P(A) / P(B)

So I do not see this as being particularly arbitrary. While other rational
models are possible, I find this one rather practical and satisfying.

[Edit]: markup

~~~
TuringTest
What I mean is that those axioms of conditional probability seem intuitively
true because of their frequentist interpretation, i.e. counting the possible
cases that satisfy each probability.

If you devoid them from the combinatorics that justify their meaning, there's
no special reason to accept these particular axioms nor the law derived from
them.

------
s_gupta
This just the maximum likely hood test. Which is a frequentist method.

