
It’s not just p=0.048 vs. p=0.052 - luu
https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/
======
Symmetry
Put a Number on It! did a piece a while ago going through some psychology
pieces that were part of a replication effort. They found that half failed to
replicate but that people in a betting market could often tell which ones were
going to replicate or not. The author also did a blind test himself and was
also able to guess which ones would replicate. He laid out several rules of
thumb, most significantly to the article _Jacob’s Rule of Anti-Significance: A
result with a p-value just above 0.05 could well be true. A result with a
p-value just below 0.05 is almost certainly false._

 _More importantly, p=0.06 means that the researchers are honest. They could
have easily p-hacked the results below 0.05 but chose not to. The opposite is
true when p=0.049._

[https://putanumonit.com/2018/09/07/the-scent-of-bad-
psycholo...](https://putanumonit.com/2018/09/07/the-scent-of-bad-psychology/)

~~~
UncleMeat
This is true in more than just psych. A strategy for systems papers is to
select the system that is second on the graphs. The author's results are not
trustworthy, but the second place system was usable and ran reasonably well in
the hands of somebody other than the original author.

~~~
hadsed
For ML practitioners out there, this is a great method for a field also in a
replication crisis.

------
rom1v
Related, about p-values:

> Here's the problem in a nutshell: If you run 1000 experiments over the
> course of your career, and you get a significant effect (p < .05) in 95 of
> those experiments, you might expect that 5% of these 95 significant effects
> would be false positives. However, as an example shown later in this blog
> will show, the actual false positive rate may be 47%.

> […] However, this is a statement about what happens when the null hypothesis
> is actually true. In real research, we don't know whether the null
> hypothesis is actually true. If we knew that, we wouldn't need any
> statistics! In real research, we have a p value, and we want to know whether
> we should accept or reject the null hypothesis. The probability of a false
> positive in that situation is not the same as the probability of a false
> positive when the null hypothesis is true. It can be way higher.

[https://lucklab.ucdavis.edu/blog/2018/4/19/why-i-lost-
faith-...](https://lucklab.ucdavis.edu/blog/2018/4/19/why-i-lost-faith-in-p-
values)

> Here's a more simple thought experiment that gets across the point of why
> p(null | significant effect) /= p(significant effect | null), and why
> p-values are flawed as stated in the post.

> Imagine a society where scientists are really, really bad at hypothesis
> generation. In fact, they're so bad that they only test null hypothesis that
> are true. So in this hypothetical society, the null hypothesis in any
> scientific experiment ever done is true. But statistically using a p value
> of 0.05, we'll still reject the null in 5% of experiments. And those
> experiments will then end up being published in scientific literature. But
> then this society's scientific literature now only contains false results -
> literally all published scientific results are false.

> Of course, in real life, we hope that our scientists have better intuition
> for what is in fact true - that is, we hope that the "prior" probability in
> Bayes' theorem, p(null), is not 1.

[https://news.ycombinator.com/item?id=16917158](https://news.ycombinator.com/item?id=16917158)

~~~
swsieber
For about a year or so now Ive been wanting to make a game about science. It'd
basically be a research and discovery simulator, and there would be a free-
play mode. Some of the knobs would be # of required replication, required
p-value and how good people are at generating hypothesis.

I think it'd be eye opening.

~~~
marcosdumay
There is a "people will mostly replicate/extend articles about X, and ignore
articles about Y" (groupthink) effect that I imagine is also very relevant.

~~~
swsieber
Oooh, that's a great one to add to the list!

------
twelfthnight
I think the problem with p-values is that it trains us to think about
uncertainty without nuance. It hides the inherent trade off between the cost
of taking on risk and the cost of reducing uncertainty, since it sets the
threshold at p=0.05. Taken to the extreme, with a large enough sample we can
nearly always find significant differences between populations, the difference
will just be very small and n size will be enormous.

Recently I worked with a client to interpret results from an A/B test where A
performed better than B with 85% confidence (based on credible intervals,
accounting for multiple comparisons). We therefore recommended A. In a group
phone call, the client told her colleagues that our company doesn't know what
we're talking about because 85% confidence of a difference isn't statistically
significant (i.e. isn't 95% confident). We lost their business.

This was a shame because gathering the data for the experiment was expensive
and the downside of making the wrong choice was low. It is often the case that
taking on more risk makes more sense than hitting diminishing returns on
shrinking p-values with extra sample.

------
tgb
I think p-values are actually somewhat demonized and I have grown to like them
more and more over time. The standard interpretation is actually overly
complicated for some reason and it can be simplified to "your p-value cutoff
is an upper-bound on the rate of type I errors," over the long term. That's
simple and actionable and is an immediate consequence of the definition of
p-values! Frankly, I don't know why text-books don't give this as the
definition of p-values, and they should reserve "probability of an event at
least as extreme conditional upon the null hypothesis" as the thing-you-show-
to-prove-it's-a-p-value.

The current trend of saying that "cutting p-values off at a specific value is
bad" makes me worry. Now you can argue that your p=0.06 result shouldn't be
rejected when really we should probably be pushing for stricter standards
rather than inching towards looser ones. It also destroys the nice
interpretation of p-values above. P-values were literally made to be cut off -
if you want to stop doing that, you need to show me a coherent philosophy of
what to do instead.

What I do think is true is the problem you have where part A of the experiment
suggests X so you test X more directly in part B with a weaker but more
specific test and get p=0.06 and now you can't publish. That's a dumb cutoff,
clearly a p=0.06 test is likely to shift our belief towards X so it does
nothing but bolster part A. Typically papers do this several times and the
marginal 'failure' of one step should not sink the entire ship. This is a case
where a Bayesian analysis might be more useful as it can incorporate weak
evidence.

But the problem I see often is not that p-values are misused but that they
were junk in the first place. For example, the widely-used DESeq2 (as well as
some competitors in RNA-seq differential expression analysis) will happily
spit out p-values of 10^-100 for an experiment with only four replicates in
each of two conditions! There is no way you can get that level of evidence
from just four replicates, even if the values are 0,0,0,0 and 1e6,1e6,1e6,1e6.
The assumption of normality is reasonable near the mean but gets increasingly
inaccurate in the tail, which is exactly where you end up when you do things
like sort 30,000 tests by their p-values. In fact taking a p-value cutoff is
probably the only reasonable thing to do here - that way you'll ignore the
fact that it's absurdly small and just treat it as "small enough".

~~~
BenoitEssiambre
That interpretation makes the test basically useless because we know _a
priori_ that any two variable for things within each others light cone affect
each other at least a little. More practically, since the test tells you
nothing about size of effect, it will pick up on the tiniest of bias in your
experimental procedure and always reject the null if you have enough data.

From the author of the article: "The general point reminds me of my dictum
that statistical hypothesis testing works the opposite way that people think
it does. The usual thinking is that if a hyp test rejects, you’ve learned
something, but if the test does not reject, you can’t say anything. I’d say
it’s the opposite: if the test rejects, you haven’t learned anything—after
all, we know ahead of time that just about all null hypotheses of interest are
false—but if the test doesn’t reject, you’ve learned the useful fact that you
don’t have enough data in your analysis to distinguish from pure noise."

([https://statmodeling.stat.columbia.edu/2019/08/18/i-feel-
lik...](https://statmodeling.stat.columbia.edu/2019/08/18/i-feel-like-the-
really-solid-information-therein-comes-from-non-or-negative-correlations/))

~~~
tgb
Compound/interval null hypotheses basically solve this effect-size problem and
probably should be used more.

~~~
BenoitEssiambre
Yes, it should be required.

------
pacbard
It looks like the blog author completely missed the point of the statistical
significance discussion going on. Most first-tier journals in the social
sciences have an acceptance rate of about 5%. At the margins, the differences
between acceptance and rejection could be having one more statistical
significance result in the table than the paper that was submitted right
before or after yours.

The problem with a 0.048 and a 0.052 is not a mathematical one but an
interpretation one. Reviewers are condition to be very skeptical of non-
significant results and use “under power-ness” as a grounds for rejection. As
a result, we get publication bias and p-hacking.

~~~
JustFinishedBSG
I think you should reread the article because it's exactly what the blog
author says. Blog author who btw is Andrew Gelman, not just some random guy on
Medium, his blog is well well worth reading. Fighting bad stats in science is
kind of his hobby/life mission.

~~~
dash2
Yeah, he had a major part in _creating_ the stats discussion that is going on.
I doubt that he has missed its point.

------
thanatropism
It's worth noting that at their inception, null hypothesis testing and
p-values were separate methodologies; there was even a bitter rivalry by their
chief developers (Fisher and Pearson).

It wasn't until much later textbooks started to merge both. It may be worth to
review Neyman and Pearson's attacks on Fisher in this matter.

------
mehrdadn
> To say it again: it is completely consistent with the null hypothesis to see
> p-values of 0.2 and 0.005 from two replications of the same damn experiment.

I don't really follow this. Could someone clarify what is meant here? At what
point would this author say something is _not_ consistent with the null
hypothesis?

~~~
pacbard
He is coming at that conclusion from a Bayesian point of view to statistics.
He is seeing the p-value as a random variable that can take values from 0 to 1
and follows some distribution. Under these hypotheses, observing a p-value of
0.20 and 0.005 is completely reasonable even if unlikely. Those are just two
draws from a random variable.

Edit. Under Bayesian statistics testing the null hypothesis is a moot point as
it becomes possible to directly model the distribution of the possible
effects. Thinking of it as being able to look at a picture of something (the
p-value) vs looking at a movie of it (the distribution of the effects).

~~~
pfortuny
What he says is (I gather) worse: those events are only separated be 1.1std
deviations, which is little.

~~~
mehrdadn
I think that's what he's saying too, but what is that supposed to show? Is he
arguing against some claim that every interval of 1 standard deviation is
equally significant? Did anybody make this claim? So far as I know, nobody
considers (say) a 6-sigma effect to be 6 times stronger than a 1-sigma
effect...

~~~
kgwgk
I don’t get it either. It’s a trivial consequence of having a threshold: if we
say two cities are “far” when they are at least 1000 miles away then
Washington D.C. is not far from Jacksonville while Boston is far from
Jacksonville, even though Boston is not far from Washington.

~~~
nerdponx
That's not quite the problem.

Let's assume you've already decided in advance what "far" means.

Without moving either city from its current location, the same experiment can
give you "very far" and "very close" in identical replications.

~~~
kgwgk
I’m commenting on pfortuny’s (correct) interpretation “those events are only
separated be 1.1std deviations, which is little” of the original post by
Gelman (the difference between a significant result and a non-significant
result may not be significant = a city which is far from Jacksonville and a
city which is not far from Jacksonville may not be far from each other).

------
paulddraper
> Also, to get technical for a moment, the p-value is not the “probability of
> happening by chance.” But we can just chalk that up to a casual writing
> style.

Isn't it though? The probability of this large (or larger) of a variance
happening purely by chance[1]?

This article is highly critical, but the criticism goes over my head at least.

[1] assuming a normally distributed population

~~~
andy_wrote
I think the part that you have correctly included that people forget or elide
is that it's the probability under a specific null hypothesis. So it's a
function of what you have chosen for that - normal distributions, a certain
parameter value of 0, etc. So this means that a) it's not the probability
you'd see in the real world under repeated performance b) it's not the
probability under other reasonable null hypotheses. Like maybe under the
hypothesis parameter = 0 you get an improbably large p-value, but under
parameter = 0.1, or with different assumed underlying distributions, you
wouldn't see something so extreme.

I'd guess that the original writer understands this, and that Gelman is only
pointing it out because casual readers sometimes don't mentally retain the
full baggage that the p-value carries.

------
deanstag
For somebody who has never had formal training in statistics and similar
discussions, what is a good/book to start groking these concepts?

~~~
itcrowd
If you want a college textbook, there are literally hundreds with titles
similar to "introduction to probability and statistics" and you should choose
the cheapest second hand book you can find (or that they have in your
library). Content is mostly the same, of course the writing style will be
different and some may be more appealing to you personally. Search engines are
your friend for narrowing down your list.

For a popsci work, you could check out "how to lie with statistics", a
classic.

[https://en.m.wikipedia.org/wiki/How_to_Lie_with_Statistics](https://en.m.wikipedia.org/wiki/How_to_Lie_with_Statistics)

------
blauditore
> Also, to get technical for a moment, the p-value is not the “probability of
> happening by chance.”

Is it not? According to Wikipedia, it's "[...] the probability that, when the
null hypothesis is true, the statistical summary [...] would be equal to, or
more extreme than, the actual observed results." This sounds pretty much like
"probability of happening by chance".

~~~
itcrowd
It sounds pretty much the same, but is distinctly different. That's where much
of the confusion in the popular press comes from.

The difference is that, as highlighted in your quote, there is some null
hypothesis that is assumed when discussing p-values.

For example: what is the probability of drawing x>2 when the underlying
distribution is assumed to be a standard normal distribution N(0,1)?

The probability is small in this case, and could provide evidence to reject
the null hypothesis (i.e. the distribution is not standard normal). It doesn't
tell you about the probability of drawing x>2, it only gives evidence to
reject (or not) the null hypothesis.

The wiki has more elaborate explanation. And probably better examples than
mine.

------
Istom1n
All this reminded me of an exchange where similar games were closed the
fastest. [0] [https://www.bestbitcoindice.com/wp-
content/uploads/2017/11/Y...](https://www.bestbitcoindice.com/wp-
content/uploads/2017/11/YoBit_Screenshot_1903x1080.jpg)

------
kazinator
A 5% chance of the results being reproducible by fluke even if the hypothesis
is false is obviously too high for anything important. Splitting hairs over
.048 and .052 is ridiculous: it revolves around tiny differences in a gaping
uncertainty. Neither value is anywhere in the neighborhood of where the
benchmark should be.

------
pontus
Just putting this here in case anyone is interested in a slightly different
view on p values.

[https://mindbowling.wordpress.com/2016/07/19/p-values/](https://mindbowling.wordpress.com/2016/07/19/p-values/)

------
ardacinar
Well, obviously, the result with p=0.052 is the will of the people and should
be implemented to its most extreme.

------
DonHopkins
If p = np, what's the chance that n = 0.045 or 0.052?

~~~
firebacon
1/420 :)

~~~
Istom1n
The cube has 420 sides?

------
not_a_cop75
Sure. Apply some fuzzy logic. Highly significant. Somewhat significant. And
honestly, we're in an age where somewhat significant can often be bolstered
later (in the drug industry) by coupling drugs. It's really time to stop
believing everything has to be unifactor. Everything that we care about is
multifactor and even slight significance could make a difference if added up.

