
What a debate about p-values shows about science - fanf2
https://www.vox.com/science-and-health/2017/7/31/16021654/p-values-statistical-significance-redefine-0005
======
cwyers
> Ideally, Lakens says, the level of statistical significance needed to prove
> a hypothesis depends on how outlandish the hypothesis is.

> Yes, you’d want a very low p-value in a study that claims mental telepathy
> is possible. But do you need such stringent criteria for a well-worn idea?
> The high standards could impede young PhDs with low budgets.

Watching someone try to be Bayesian without using Bayesian techniques is
painful.

> There are also new, advanced statistical techniques — like Bayesian analysis
> — that, in some ways, more directly evaluate a study’s outcome.

Almost as painful as watching Bayesian analysis described as "new."

~~~
vinchuco
Naive question: Is there a good technique for a bot that detects factually
wrong statements?

~~~
09094920394314
Sure, just give me definition of factually wrong and I'build it for you:)

Less snarky: not really

------
DamonHD
Right at the end is a key thing: reward failure better.

Publish negative results.

I'm kinda embarrassed by my negative results, such as they are, but I'm clear
in this latest case that I wanted the method reviewed most, since others then
may be able to use it.

(Warning: not a real scientist, but being validated by real scientists in the
case in question.)

~~~
kurthr
Many have discussed this, but like writing bad/buggy code, there are just too
many (combinatorial) ways to get a negative result, and most of them are
frankly boring, distracting, or worse than useless. You would need a review
paper of all the most common failures, screwups, and anti-patterns to be
particularly useful... preferably with a good write-up of proper procedure.
Some labs publish good protocols, but many (academics!) consider them to be
secrets to used for advantage over other labs.

Of course, based on the current p-hacking, publish at all costs environment,
it's not clear you need to be particularly useful to be published. I like the
idea that has been promoted of requiring a certain number of novel
replications in a tenure application as one way forward... at least as a goal
or professional shaming tool.

~~~
Sacho
Er, the type of the result(positive/negative) should be orthogonal to the
methodology, from the point of the reviewer. Reviewers should be checking for
all these many combinatorial ways to get an _incorrect_ result, whether it's
positive or negative.

It may be true that positive results are more useful than negative ones, but
that doesn't mean we should put zero value in negative results. I think that's
what parent was trying to point out.

~~~
danstanflan
Agreed, a negative result should not be confused with shoddily executed
science. A solid methodology that produces a negative result could save other
researchers from going down the same rabbit hole to come to the same
conclusion. We should record these conclusions. Perhaps maybe part of the
problem would be the increase in number of publishable results overloading
editors and reviewers?

------
claytonjy
Pretty good take on this for a general-purpose audience. I was ready to
complain about this line for implying P(H|D) (rather than P(D|H_0), as
actually occurs):

> The researcher basically asks: How ridiculous would it be to believe the
> null hypothesis is the true answer, given the results we’re seeing?

but I think they (indirectly) correct it pretty well in the next section.

It isn't published yet, but Gelman and co's recent piece on this seems like an
important contribution to the debate [1]. Short version: take the p-value down
from it's pedestal, abandon the dichotomous view of is/isn't significant, and
consider it with all the other evidence and data.

[1]: [http://andrewgelman.com/2017/09/26/abandon-statistical-
signi...](http://andrewgelman.com/2017/09/26/abandon-statistical-
significance/)

~~~
RA_Fisher
I really like Gelman's response. We should be using likelihoods, information
and and intervals to describe parameters.

------
bitL
P-value is such a stupid metrics. Never understood why people still use it and
conjure 0.05 magic number as the one that "proves" something. It should be
considered a joke, but well, there is economics and a bunch of other
"sciences" that are considered "reputable", and people there are daily making
living off dubious theories, so nothing will change.

------
paulddraper
> The case against p<.05

> The case against p<.005

> The real problem isn’t with statistical significance; it’s with the culture
> of science

Right. The substance happens in the last 10% of the article.

...and there isn't any solution or recommendation.

Unfortunately, I think this article is really long way of saying, "There
exists p-hacking". I was hoping for more :(

------
godelski
538 actually has a good model and discussion about p-hacking. Allows you to
play around with values.

[https://fivethirtyeight.com/features/science-isnt-
broken/#pa...](https://fivethirtyeight.com/features/science-isnt-
broken/#part1)

------
teekert
Why not just be more clear and instead of writing the "p-value < 0.05", write:
The probability that the difference between these two groups is by chance is
less than 5%. Then everybody knows that if you do this a bit more than 20
times this results is achieved by chance alone. Then, just show the damn plots
and let me see how much distributions overlap. I have seen insanely small
P-values with completely overlapping distributions just because of large
amounts of data. I find the P-value near useless.

Personally I much prefer ROC curves, they are much better at showing the
difference between distributions. Still, nothing beats the raw data and
healthy skepticism.

~~~
FabHK
> The probability that the difference between these two groups is by chance is
> less than 5%.

That's a common misunderstanding of the p-value, but that's not what it means.
(It means: assuming there is no difference, the probability that we'd
nevertheless see a difference as big as we are seeing or even bigger is less
than 5%).

------
ggm
I co-authored a paper into IMC discussing stats and network questions with a
statistician. We got bounced for being too polemical and teaching-mode instead
of classical science. the whole point of the paper, was to try and argue for a
more mature outlook into statistical methods applied to network operations!

My motivation to co-write was strongly influenced by being called out for
p-jacking. Its remarkably easy to fall from grace. I did. walking back is very
hard, when peer review doesn't want to have a polemical discussion.

------
Myrmornis
When I looked at the authors list I got the impression that it was mostly
sociologists, public health researchers etc, but not very many well known
statisticians, or people from the traditional physical and biological
sciences. It seems like it would have been better to have more authors from
the latter groups to make such a broad suggestion; their absence suggests that
the proposal doesn't have much support among those fields.

------
SubiculumCode
A rebuttal to p< .005 proposal:
[https://www.researchgate.net/publication/319880949_Justify_y...](https://www.researchgate.net/publication/319880949_Justify_your_alpha_a_response_to_Redefine_statistical_significance/figures?lo=1)

------
kringldt
Along with grit, growth mindset, power posing, priming, and stereotype threat.
A lot of non replicating trash in psychology right now.

------
searine
Only badly designed experiments overrely on p-values.

A robustly designed experiment should produce repeatable, intuitive results.
Anyone looking at the presented data should be able to reason the result
without needing to conduct a statistical test.

Obviously, these tests are important, but people focus too much on the p-value
rather than designing an experiment which will produce a meaningful result
(regardless of whether the result is positive or negative).

Proper experimental design is like a cake, and p-values are the frosting.
Everybody wants cake, most often with frosting. Nobody wants frosting without
the cake, or at least those that do have questionable taste.

~~~
tlb
Not all phenomena are strong enough to allow such experiments. Some make only
a small difference to an inherently noisy process. These can only be
discovered with statistical tests on substantial samples.

~~~
toufka
Certainly, but a well-designed experiment can identify the wedges that
differentiate hypothesis A from hypothesis B without requiring a statistical
analysis of noisy data.

And true, there are certain kinds of experiments that can only be done
statistically. However, depending on the field, it can just be easier to do,
and really should just be considered lazy science - again the details matter.

In my field (biochemistry/genetics) there really is a very real effort to
design experiments that do not require statistics to verify the result. And
that effort in design is rewarded amply when the result of the experiment
becomes quite clearly binary in nature, from multiple angles, rather that a
statistical "probably".

~~~
tlb
Are there many real examples of studies where statistical tests are needed to
see the effect, but a different experiment could demonstrate the same effect
without any statistics? I haven't been able to think of one in the fields I
know well.

~~~
toufka
In my field it often comes down to 'mechanism'. If you see an effect of part X
on part Y, via statistics, you could write a paper, and do good science.

However, if you have an experiment that pulls X from the system, watches Y
fail, then readds X in a new way to recreate the system, you can be much more
certain that X influences Y. That becomes a much desired, "elegant"
experiment.

The later experiment is often harder, and might not even be proposable until
you've seen some statistical effect. However the scientific results from that
experiment really don't hinge on statistical signifigances. The results are
either binary in result, or better yet, contradictory - elucidating hitherto
unknown variables (something a purely statistical result won't help with). And
if mechanism is shown in multiple systems, with multiple techniques, the
results can very quickly become near-definitive.

~~~
sndean
> However, if you have an experiment that pulls X from the system, watches Y
> fail, then readds X in a new way to recreate the system, you can be much
> more certain that X influences Y. That becomes a much desired, "elegant"
> experiment.

Yeah, in molecular biology this always seemed to be the most convincing
evidence. Deleting a gene and then adding it back. Sometimes going a step
further and overexpressing the gene. You can generally publish that data
without any statistical analysis.

> The later experiment is often harder, and might not even be proposable until
> you've seen some statistical effect.

I think the better journals/reviewers can push back at this step, though: If
you don't have a demonstrable mechanism, you can't publish. Don't let people
stop at "...this gene is important (p < 0.005)." (I know not all things are as
simple as gene deletion and complementation)

------
BenoitEssiambre
They should change the nomenclature "significant", which in normal language
tends to imply a large size to statistically "detectable" or "discernible".

P=0.05 is best interpreted as "There is a 95% chance that this data isn't pure
noise".

Then researchers should always put more emphasis on confidence intervals.
Readers of papers can then see if the size is enough to make the results
relevant and not likely caused by experimental accident. Plus given the
perverse incentives researchers are subject to, maybe assume that the real
effect size is probably closer to the lower bound of the interval.

Null hypotheses have limited usefulness even at p=0.005. A tiny systematic
bias in the experiment can make it cross that threshold and there are _always_
at least small biases. These can be caused by not exactly calibrated
instruments or small differences in the way tests are performed by different
researchers on the team etc. Inherent in null hypothesis testing is this
ridiculous assumption that there are no systematic biases. This is its fatal
flaw.

The null hypothesis test never tells you that the effect is zero or that it is
any other value, it only gives you an hint about whether or not the data is
too noisy to say anything at all.

~~~
tedsanders
>P=0.05 is best interpreted as "There is a 95% chance that this data isn't
pure noise".

No, that's not what P=0.05 means. This misconception is exactly why p values
are so problematic. The question we want to answer is unanswerable, so instead
we subtly substitute a different question in its place and don't even realize
we've swapped concepts.

~~~
BenoitEssiambre
When I say that "There is a 95% chance that this data isn't pure noise" I'm
being slightly indirect. More directly it would be something like "There is 5%
chance that pure noise generator you would give you this result" but since I
want to reason about the actual world that the experiment is trying to study,
I think it is legitimate to do the Bayesian flip and infer the more useful
first statement that the world is likely too noisy to detect an effect using
this experiment.

~~~
nonbel
If you see clouds how likely is rain? If you see rain how likely is it there
are clouds? Are the answers to these at all the same? No.

There is absolutely nothing valid about your "flip" (called transposing the
conditional) and the damage done to the human species by people doing the
"flip" is too huge for most to comprehend. Sorry, but I really cannot
overstate the problems with this "flip".

Perhaps people who get paid by grants should get three chances regarding this
flip. If they strike out they are banned from ever being funded by NIH, etc
for life. The result (either 99+% of current researchers would be culled, or
p-values would become appropriately rare) would do more to advance science
than any other idea I have seen.

~~~
BenoitEssiambre
If you want your science to be about the real world and not about some
hypothetical scenarios that don't exist you have no choice but to do the
transposition. Experiments are useless otherwise. There is a reason humans
naturally and instinctively do it when they talk about experiments. This is
how you describe things in real world human language.

Sure there are debates about priors. These are unavoidable. The hidden
assumption in my example is a 50/50 prior on too noisy, not too noisy
hypotheses. IMO a reasonable uniform (maximum entropy) prior to start a
discussion from. Of course, one should be more precise when doing this type of
analysis in a paper. The reason a good prior is hard to define here is not
because you shouldn't do the flip, it is because the question that the t-test
is trying to answer doesn't make much sense.

~~~
nonbel
>"If you want your science to be about the real world and not about some
hypothetical scenarios that don't exist you have no choice but to do the
transposition."

How can saying false things make your science "about the the real world"? The
only thing I can think of is that by "real world" you are referring to using
the publication of massive amounts of incorrect/questionable information as a
metric for success. Yes, that is indeed the sham going on.

>"There is a reason humans naturally and instinctively do it when they talk
about experiments. This is how you describe things in real world human
language."

I have no problem talking about data without transposing the conditional, if
you do that is some issue with your training.

>"Sure there are debates about priors."

The error you advocate (of transposing the conditional) has nothing to do with
priors.

~~~
BenoitEssiambre
There is no point in talking about the data if it doesn't tell you useful
things about the population or type of thing you are studying, if it's not for
building a predictive model of the subjects. This is the entire reason for
research.

Also P(A|B) = P(B|A) if the prior probabilities are equal, often a reasonable
prior.

~~~
nonbel
>"There is no point in talking about the data if it doesn't tell you useful
things about the population or type of thing you are studying"

Ok, but I fail to see the relevance of this to anything.

>"Also P(A|B) = P(B|A) if the prior probabilities are equal, often a
reasonable prior."

Here, A = Hypothesis and B = data (or vice versa). You are claiming that the
"probability of seeing your data whether or not the hypothesis is correct", is
often near the "probability the hypothesis is correct independent of your
data"?

Based on what? What principle leads you to think these two probabilities
should be near each other? BTW, if you actually have one this is like the holy
grail of statistics.

~~~
Bromskloss
> You are claiming that the "probability of seeing your data whether or not
> the hypothesis is correct", is often near the "probability the hypothesis is
> correct independent of your data"?

Shouldn't that be "the probability of seeing the data, _given_ that the
hypothesis is correct" and "the probability of the hypothesis being correct,
_given_ that the data has been observed"?

~~~
nonbel
No, the other poster is referring to Bayes' rule P(A|B) = P(A)*P(B|A)/P(B).

The claim is that it is reasonable to assume P(A) ~ P(B), so that P(A|B) ~
P(B|A). I am referring to P(A) and P(B) in the equation.

~~~
Bromskloss
Ah, OK, got it.

~~~
nonbel
Its hard to tell over the internet, but if you are being sarcastic please let
me know. I have no problem with clarifying further.

~~~
Bromskloss
No sarcasm! I believe I understand what you were saying now.

------
justwantaccount
I personally find NHST suspicious, even if the p-value is less than 0.005. It
means that, ASSUMING that the null hypothesis is correct, the probability of
observing the data is less than 0.5%. That's still not zero, though. For
example, if I try to decide that native Hawaiians are US citizens, and the
null hypothesis is that they are, but since only ~0.2% of total US population
is native Hawaiian, NHST would conclude that native Hawaiians aren't US
citizens. Conversely, the p-value being high isn't really meaningful either.
For example, if I try to decide if a white person is from South Africa, and
the null hypothesis is that they are, and since ~9% of the population in South
Africa is white, I would fail to reject the null hypothesis, even if the
person's actually from Europe or literally any other country in the world.
p-values are dependent on the sample size, too, and become more and more
sensitive to smaller differences in the ASSUMED population distribution(s),
when the effect size can in fact be small. NSHT seems highly dependent on
quality null hypothesis that's correct, which you can't really establish since
that's what you're trying to find out. So the approach doesn't really conclude
anything either way, and just really seems weak as an approach to scientific
testing overall.

~~~
majormajor
I don't really follow. Your examples seem like weird or poorly thought out
experiments (well, surveys, not experiments), not anything to do with
significance testing.

If I wanted to see if native Hawaiians are citizens and didn't have access to
law, real data, anything like that, I'd sample from the population of
Hawaiians, not citizens as a whole, and then regardless of the null hypothesis
being "are citizens" or "aren't citizens" I think the results would be
overwhelmingly one-sided enough to work out. And the South Africa example is
asking about 1 single person? That not a question of science, then, just a
question?

~~~
justwantaccount
Maybe my example was contrived, but my main argument was that p-values
represent P(D|H), or probability seeing your data conditioned on assuming that
the hypothesis is correct, and that approach seems inherently flawed to me. If
you're doing an experiment, p-values don't say anything about if the
hypothesis is true or false, in fact the approach can't theoretically give
that information. If you're doing a basic t-test, it goes something like this:

1\. You have a null hypothesis you assume to be true

2\. Based on this null hypothesis and the central limit theorem, if you
theoretically conduct many experiments you expect some summary statistic to
have a specific distribution around the 'true' population summary statistic
and 'true' population variance, which you don't have so you assume it to be
the null hypothesis

3\. You compare your experiment's summary statistic to the hypothesized
distribution. If the probability of seeing your summary statistic is below
some threshold, you say it's unlikely to see this summary statistic again.

But the probability of seeing that summary statistic again actually depends on
P(H), which NHST doesn't provide any information on. My examples were meant to
highlight the nature of conditional probabilities, rather than how real life
experiments are conducted.

~~~
Keysh
Your examples weren't so much contrived as incoherent, to the point where I
begin to suspect you don't really understand NHST.

"if I try to decide that native Hawaiians are US citizens, and the null
hypothesis is that they are, but since only ~0.2% of total US population is
native Hawaiian, NHST would conclude that native Hawaiians aren't US
citizens."

So what is the _observation_ in this case, and what would the corresponding
prediction from the null hypothesis be? The observation that "0.2% of the US
population is native Hawaiian" has no relation to your claimed null hypothesis
at all.

The rest of your objection seems like one of those confused arguments trying
to rule out basic reductio ad absurdam ("but if X really isn't true, then your
arguments about seeing or not seeing the consequences of X have no basis!").

(And the central limit theorem has nothing to do with null hypothesis testing:
you can do NHST with completely non-Gaussian statistics.)

~~~
justwantaccount
Yes, you can conduct NHST without the central limit theorem. However, it's
used very widely in NHST. Was there anything wrong with what I said about what
a typical t-test usually looked like? My lab would use that approach to do
molecular biology.

You don't seem to understand my argument, so let me rephrase: The example
about Native Hawaiians was meant to highlight the nature of conditional
probabilities, and p-values are conditional probabilities. Just because
p-values are below some threshold doesn't necessarily mean that the null
hypothesis is incorrect and therefore should be rejected. Just because
p-values values are high doesn't mean that the null hypothesis should fail to
be rejected. P-values do not theoretically give that information. It doesn't
even represent the probability of observing that value, since it's a
conditional probability - as in, the probability of observing that value given
that the null hypothesis is true, not the probability of observing that value.
If the p-value is below 0.005, can you scientifically, theoretically conclude
that the null hypothesis should be rejected? The probability of seeing a
Native Hawaiian person given that the person is a US citizen is below the
threshold of 0.005, but does that mean the conditioned part (the US citizen
thing) should be rejected? Granted, it's hard to relate that example to actual
experiments, but my argument is that p-values don't theoretically give any
conclusions either way, and trying to make it "scientific" to draw conclusions
by introducing thresholds to a conditional probability, no matter how strict,
seems inherently flawed. Using it as a single metric among many, to use it as
a tool for exploration makes sense to me. Even to make strong suggestions,
sure, especially with all the controls RCTs put in. But to make hard
conclusions, as in NHST? The approach itself doesn't have the theoretical
power to do so.

