
Response to the ASA’s Statement on p-Values - leephillips
https://errorstatistics.com/2019/01/19/a-letter-in-response-to-the-asas-statement-on-p-values-by-ionides-giessing-ritov-and-page/
======
chalst
Andrew Gelman, a pioneer of the methodology of Bayesian statistics (cited in
original article; he doesn't call himself a Bayesian in the philosophical
sense) responded to the ASA statement so:

[http://www.stat.columbia.edu/~gelman/research/published/asa_...](http://www.stat.columbia.edu/~gelman/research/published/asa_pvalues.pdf)

His conclusion there is that scientific institutions are not accepting, as
they need to be, of systematic uncertainty. Elsewhere he says that scientists
who are not statisticians should concentrate on gathering quality data: noise
in data often leads to spurious point estimates:

[https://statmodeling.stat.columbia.edu/2017/02/11/measuremen...](https://statmodeling.stat.columbia.edu/2017/02/11/measurement-
error-replication-crisis/)

In the first link, Gelman talks of the "garden of forking paths": this is
essentially a generalisation of the idea of p-hacking to recognise that
perfectly honest researchers will not conduct unbiased analyses because of the
myriad of parameters that estimates depend on. The solution is to move away
from summarising results through point estimates to construct statistical
models where you can explore the space of possible analyses; there has been a
revolution in the techniques for doing so through the application of Monte-
Carlo Markov-chain techniques to construct posterior distributions.

------
outlace
It seems clear to me that Popperian falsification is indeed the only way to
separate the wheat from the chaff in theoryspace, however, generating theories
generally requires inductive reasoning.

You observe data, you make an inference that leads to a theory (induction),
you then subject that theory to falsificationist testing.

"It appears all the swans I've seen are white, therefore I posit all swans are
white. Oh, wait, there's a black one, nevermind."

On the Bayesian vs Frequentist aspect... Falsification is what you should do
to theories, not model parameters. If you have a coin and you're trying to
figure out the probability of heads P(H), then you have your model of coin-
flipping (Bernoulli process) and you're trying to estimate the model's
parameter, so you do statistical inference given some sequence of coin flips.

It doesn't seem right to apply frequentist null testing because you want to
estimate the model parameter, not make some binary decision. What if you had
some prior data you want to include? Or you observe new data in the future?
This is exactly what Bayesian inference is setup for. And a lot of science is
not about falsifying theories in theoryspace but about estimating model
parameters in parameter space in the case we all agree on a particular model.

Moreover, a big advantage with Bayesian statistics is that it generally
requires you to make your model and assumptions explicit, where it's much
easier to scrutinize the model compared to a frequentist statistical test.

~~~
claudiawerner
> It seems clear to me that Popperian falsification is indeed the only way to
> separate the wheat from the chaff in theoryspace, however, generating
> theories generally requires inductive reasoning.

A report analyzed the entries to _Nature_ one year and found that very, very
few of the papers actually met Popper's criteria for falsifiable hypotheses,
and in fact most of the papers started out with an exploratory aim and
documented their findings. This means that to adopt the Popperian idea that
only falsifiable claims are science means we must reject good science. One
must also consider what opportunities are being missed if we were to force
good science to adopt strictly falsifiable hypotheses before research
commences - this would mean that every exploratory paper would need to be
redefined (or rather, firstly defined) as a falsifiable hypothesis... in an
attempt to please Popperians.

It is also a stretch to say that science must be empirical or carried out in a
particular way to which falsifiability is congenial. There are good arguments
that even philosophy[1] or mathematics can be considered sciences, and indeed
they were (see _Wissenschaft_ in Kant and Hegel for instance).

Furthermore, it's been argued that Popper's theory of falsification actually
includes bad science (pseudoscience). From SEP[0]:

>Strictly speaking, his criterion excludes the possibility that there can be a
pseudoscientific claim that is refutable. According to Larry Laudan (1983,
121), it “has the untoward consequence of countenancing as ‘scientific’ every
crank claim which makes ascertainably false assertions”. Astrology, rightly
taken by Popper as an unusually clear example of a pseudoscience, has in fact
been tested and thoroughly refuted (Culver and Ianna 1988; Carlson 1985).
Similarly, the major threats to the scientific status of psychoanalysis,
another of his major targets, do not come from claims that it is untestable
but from claims that it has been tested and failed the tests.

[0] [https://plato.stanford.edu/entries/pseudo-
science/#KarPop](https://plato.stanford.edu/entries/pseudo-science/#KarPop)

[1]
[https://web.archive.org/web/20170621073301/http://www.philos...](https://web.archive.org/web/20170621073301/http://www.philosophyisscience.com:80/p/philosophy-
is-not-science.html)

~~~
JohnStrangeII
Short summary first, long rant follows. Summary: Mathematics and small parts
of philosophy and many other disciplines rightly qualify as non-empirical
science, but no lesson can or should be drawn from them about empirical
science. Empirical questions are fundamentally different from non-empirical
questions.

Long rant:

I don't disagree fundemantally but what you say about philosophy irks me as a
philosopher. If you looked at a random sample of publications in philosophy
you would find out that only a very small percentage of them has the rigour
and exactness that we associate with science. I have also met many
philosophers, perhaps even the majority of all I've met so far, who would not
describe themselves as scientists.

The remaining part of philosophy that adheres to strict standards and uses
mathematical method is akin to mathematics and computer science, and perhaps
most similar related to formal linguistics and economics in the way it works.
I personally consider this part of philosophy a science, it is a kind of
applied mathematics, although often speculative and conditional on axioms that
are not as evident as in mathematics. Other disciplines have similar non-
empirical parts, a typical example is Social Choice theory in sociology, which
I would consider a scientific theory, although it is not empirical. In the
end, it's applied mathematics.

I agree that mathematics and the relevant rigid and formal parts of other
disciplines are non-empirical science, but to these areas the debate about
hypothesis testing and the right use of statistics simply doesn't apply. It a
fallacy to presume that, because these fields exist, clearly empirical
disciplines (or parts thereof) could do without statistics. If a question is
empirical, then it _has_ to be addressed with proper quantitative methodology.

This is extremely important to me personally, since I've been in deep
disagreement with colleagues for many years about this issue. They work in
related disciplines within our philosophy institute and habitually make
"qualitative empirical analyses" of texts without treating them as mere
precursors to quantitative studies. They see no problems with their
methodology, even when I point out to them that the size of their samples
would be too small to support the generalizations they make if they _did_ make
quantitative studies. To me, this is absolutely crazy, I just can't see how a
qualitative analysis of 20 texts could allow you to take these as
representative for hundreds of thousands of texts if a quantitative studies of
the same texts could not possibly reveal anything useful because the sample
size is too small. What's worse, their whole discipline seems to be based on
these kind of extremely small scale qualitative empirical studies, plus a very
vague mix of fairly imprecise philosophy and common sense. I'm a nice person
and get along with my colleagues well, so I won't ever tell them my opinion,
but if I'm honest, I'd say that their discipline is pseudo-science or, at
best, imprecise, non-scientific philosophy in disguise. (To make this clear, I
have no problem with imprecise philosophy and have done it occasionally
myself, I just don't think it can qualify as science and not many philosophers
consider it as such.)

Long story short, empirical question have to be addressed with quantitative
methodology or you get what I'd call "elaborate opinions".

~~~
cbkeller
Sounds fair enough.

I do wish that physical scientists, including myself, had more philosophical
training though, even if its not science _per se_ , such that we could
reliably have original, educated opinions on empiricism, rationalism, etc.

~~~
learnfromerror
Might I mention my book (Statistical Inference as Severe Testing: How to Get
Beyond the Science Wars (Mayo CUP)).
[https://twitter.com/learnfromerror/status/107197020637171302...](https://twitter.com/learnfromerror/status/1071970206371713026)
I did not know of Hacker news but was trying to trace why my blog
errorstatstics.com got its maximal # of hits in over 7 years as a result of my
posting their letter on the ASA Statement Now I'll disappear.

~~~
JohnStrangeII
Very interesting, I'll order your book tomorrow!

------
learnfromerror
The authors of the letter use “induction” to mean what I call a probabilism.
Here probability is used to quantify degrees of belief, confirmation,
plausibility support–whether absolute or comparative. This is rooted in the
old idea of confirmation by enumerative induction. Conclusions of statistical
tests go strictly beyond their premises, and it does not follow that we cannot
assess the warrant of the inference without using a probabilism. They are
qualified by the error probing capacities of the tools. A claim is severely
tested when it is subjected to and passes a test that probably would have
found it flawed if it is. The notion isn’t limited to formal testing, but
holds as well for estimation, prediction, exploration and problem solving. You
don’t have evidence for a claim if little if anything has been done to probe
and rule out how it may be flawed. That is the minimal principle for evidence.

Popper spoke similarly of corroboration only he was unable to cash it out. He
wasn’t sufficiently well versed in statistics, and anyway, he wanted to
distinguish corroboration from induction, as the latter was being used at the
time. The same impetus led Neyman to speak of inference as an act (of
inferring). I explain all this in my recent book, Statistical Inference as
Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018,CUP. As I say
there,

“In my view an account of what is warranted and unwarranted to infer – a
normative epistemology – is not a matter of using probability to assign
rational beliefs, but to control and assess how well probed claims are.” (p.
54)

------
DSingularity
Okay, so we agree with some of these points. But after reading the ASA
statement it seems like this was written to something other than what I read.
The ASA statement does not seem to be fundamentally dismissing P-values. On
the contrary. They seem to be providing guidelines for researchers and
reviewers on how to use them properly.

~~~
learnfromerror
I think it is the way they blithely mention some statisticians prefer to use
other methods, with a list of examples, that suggests they are blessing them.
Surely these other methods ought to be scrutinized, we don't know they wold
detect irreplication as significance tests do. The big issue is really all
about using frequentist methods with biasing selection effects, multiple
testing, cherry-picking, data-dredging, post hoc subgroups, etc. Only problem
is that many who espouse the "other methods" declare that these data-dependent
moves do not alter their inferences. Some are vs adjustments for multiplicity,
& even deny error control matters (This stems from the Likelihood Principle.)
If you consider the ASA guide as allowing that (in tension with principle 4 vs
data dredging when it comes to frequentist methods) then the danger the
authors mention is real. What was, and is, really needed is a discussion about
whether error control matters to inference.

------
wbl
Hume would have a fit. All scientific knowledge is inductive!

~~~
cbkeller
Yeah, or Francis Bacon, or any of the original Empiricists.

I find a lot more to like in the ASA's statement [1] than in any of these
responses, which seem to act as if Karl Popper were the only one to ever have
a worthwhile philosophy of science.

This paragraph of the response was particularly telling to me:

> _A judgment against the validity of inductive reasoning for generating
> scientific knowledge does not rule out its utility for other purposes. For
> example, the demonstrated utility of standard inductive Bayesian reasoning
> for some engineering applications is outside the scope of our current
> discussion._

Translation: Ok, so maybe induction works fine when you're going to build a
bridge where someone's life is on the line, but it still has no place in
science. Falsification or bust!

[1]
[https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1...](https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108)

~~~
astazangasta
It's ridiculous. Even the idea that frequentist use of p-values is deductive
is garbage. Where do they think the models being tested are coming from?
Induction!

------
tfgg
I'm guessing here that ASA is 'American Statistical Association', rather than
say, the UK's 'Advertising Standards Authority'?

------
roenxi
One of the joys of programming is that, unlike maths, when discovering an
exciting new paradigm there is an opportunity to rationalise all the notation
and use a new language.

Contrasting, I struggle to deal with probability and statistics without
developing a strong suspicion that the name the objects are called by is
completely different from what the name means in common English.

It is nice to see ongoing authoritative commentary that the large majority do
not understand what a p-value actually implies. The thread of discourse seems
to be that even assuming that all the academics are completely honest (ie, no
academic fraud, no hand waving) the number of false results that are awarded
statistical significance is much higher than it should be. The standard p
value threshold at 5% does not imply that 95% of the statistically significant
studies are not by chance. Particularly amongst the subset that make it into
the public eye.

~~~
BenoitEssiambre
The language of statistics is indeed dishonest. In normal language
"significant" has a meaning close to "large". What they mean in statistics is
that there is a detectable signal, there is a 95% chance that the data is not
pure noise.

They should call it "detectable", some people have suggested "discernible".

However, there is another fatal flaw to how p-values are used. They are
usually used for rejecting infinitesimally small hypothesis. The null
hypothesis is stated as "Effect exactly equals 0.00000000000...". In practice,
there are no experiments that have exactly zero effect. There are always at
least very small systematic bias due to imperfectly calibrated instruments or
small methodological variations between researchers.

Even if you do everything else right, if you pre-register your study, with
enough data, a null hypothesis test will always pick up on these small biases
and make the results significant.

If you are looking to reject a null hypothesis, I can tell you in advance: all
experiments have a non zero bias, all results are statistically significant
p<0.000000001 with enough data. There, I just saved the scientific world a ton
of money, they don't have to do all these experiments. Just reference this
comment in your paper to show significance.

At least here, the language seems honest. Rejecting a null hypothesis
correctly conveys the idea of having rejected nothing.

Using a Popperian approach is great, but you should reject a portion of the
hypothesis space this is bigger than zero.

~~~
joshuamorton
This isn't the case. I can make two hypotheses that are contradictory. Both
aren't going to be statistically significant from the same data, no matter how
biased my equipment or the size of the sample.

~~~
BenoitEssiambre
Yes, the problem is that with null hypothesis testing, the two contradictory
hypothesis are usually: "effect is exactly 0.00000000..." and "there is an
effect different than 0.000000000...". The problem stems from the fact that
people pit an infinitesimally small hypothesis against an infinitely large
one. Rejecting an infinitesimally small hypothesis is very very easy. The
tiniest of experimental bias will allow rejection if you have enough data.

It would be impressive if "There is an effect different than zero" was
rejected (but no one would ever be able to do this). Science should try to
reject finitely large hypotheses, at a minimum something like: "effect is
larger than some reasonable margin to account for experimental bias and
imperfect tools". At least a small chunk of the hypothesis space should be
rejected for your experiment to be worth something. You sorta get that with
confidence intervals since you can see how far the lower bound is from zero.

~~~
joshuamorton
I mean, you can still run I to issues here:

The effect size is <1% and the effect size is >=1%.

Or 100 different hypitheses for 1-100% effect sizes. They're mutually
exclusive, so only one will be true.

So again, while in general I agree with you that beyesian methods have
significant advantages, this objection isn't well founded.

