

 Is it time to up the statistical standard for scientific results? - 001sky
http://arstechnica.com/science/2013/11/is-it-time-to-up-the-statistical-standard-for-scientific-results/

======
batbomb
> These are the sorts of nuts-and-bolts reproducibility issues that drive
> researchers crazy, because they can be affected by things like the specific
> strain of mice you use, where you buy your chemicals, and even the pH of
> your lab's water supply. No amount of statistical thinking is going to
> change any of that.

This screams systematic error and error propagation to me. It's possible that
we don't need to up the p-value, we just need to make sure researchers aren't
stupid and can properly account for all sources of errors. The problem is
that's often an acquired skill over time, not something younger researchers
typically think about, especially those who aren't multidisciplinary.

~~~
flatline
Yeah, the 95% confidence interval is completely arbitrary to begin with. Is it
really better than, say, 94% -- which will most assuredly not get you
published? In many fields, I think that a p-value indicating 95% confidence in
the results is fine, you are never going to get 99.999% or even 99% due to the
errors inherent in the subject of study, which doesn't invalidate the results
out of hand. It's the other host of errors, up to and including blatantly
cherry-picking or fabricating results, where the focus should be. There's been
a series of recent posts encouraging scientists (and, moreover, journals) to
publish negative results, which is probably a good starting point to promote
greater transparency and honesty in the paper publishing game.

~~~
apsec112
This is part of the problem. A p-value of 0.05 absolutely does _not_ (!!!)
indicate 95% confidence in the results! What a p-value means is, _if_ the
hypothesis in question is false, _then_ we have an x% chance of seeing the
observed data. It tells you absolutely nothing about, _if_ you observe such-
and-such data, _then_ how strongly should we believe the hypothesis.

For example, you could have an extremely unlikely hypothesis (eg. "dice are
controlled by alien telepathy"), test it, and still come out with p < 0.05
because of something much more plausible ("the dice were badly made and are
biased"). This is known as the base rate fallacy; an extraordinary claim
requires extraordinary proof. Moreover, to prove an extraordinary claim, you
must show it is more likely _not_ just compared to chance, but compared to
every less extraordinary alternative, including "the experimenters are faking
the data".

It's also entirely possible to get a result at p < 0.05 that makes the
hypothesis you're testing _less_ likely. Suppose you want to know how far away
the nearest star is. The value in the textbook is 13.4 light-years, and you
think it's really 12,000. You take some measurements, and get values of 19.6,
17.4, 20.1, 20.4 and 18.5.

Now, this is a significant result at p < 0.05 - _if_ the star is 13.4 light-
years away, _then_ getting these numbers has a probability of less than 5%.
However, these results completely _rule out_ the hypothesis you're testing.
The numbers you get are pretty unlikely if the real number is 13.4, but
_extraordinarily_ unlikely if the real number is 12,000, so this experiment
makes the 13.4 number _more_ credible. This kind of thing is why lots of
researchers still believe in psychic powers - they keep testing for psychic
powers, and keep getting results at p < 0.05, but don't notice their results
make psychic powers _less_ plausible. (Not joking - see
[http://commonsenseatheism.com/wp-
content/uploads/2010/11/Wag...](http://commonsenseatheism.com/wp-
content/uploads/2010/11/Wagenmakers-Why-Psychologists-Must-Change-the-Way-
They-Analyze-Their-Data.pdf) for a detailed explanation.)

This is an elementary mistake that every freshman statistics course warns
against, and yet it's absolutely pervasive.

~~~
jmpeax
> What a p-value means is, if the hypothesis in question is false, then we
> have an x% chance of seeing the observed data.

The hypothesis in question is the null hypothesis which must be assumed _true_
not _false_ for the p-value to mean a 100*p% chance of seeing the observed
data.

I can see why elementary mistakes are so pervasive.

------
RA_Fisher
The real problem is that effect size is rarely discussed. P-values only relate
to variability, sometimes high variability is acceptable, sometimes it's not.
Taken by itself, a p-value is worthless, no matter how small it is. For
example, suppose you develop a fertilizer that you're 99.99999% certain will
produce one additional ear of corn in 10,000 bushels. Who cares!? You see a
lot of that in published papers. A lot of researchers stop once they get a
p-value below 0.05 and neglecting effect size is the norm.

~~~
ProblemFactory
Effect size is important, because _all_ events in the real world except for
true random numbers have _some_ correlation.

Eating an apple might help a broken leg heal faster or slower. There is
certainly an extremely small correlation, and with a sufficiently large number
of controlled experiments, a result rejecting the null hypothesis with p <
0.05 _will_ be found. The required number of experiments might be
astronomically large, but if any correlation exists, it can be found with
enough samples.

But without looking at the effect size, the result is useless. Even if eating
an apple helps your broken leg heal 2 seconds faster on average, it is
pointless to suggest this as medical advice.

------
yes_procrast
In physics, we use 2-sigma (95%) limits all the time. 5-sigma (99.9999%) is
generally required only for a claim of detection.

If you're just feeling your way around in the dark, 2-sigma is a useful way to
work, so we use that to guide exploration.

Why 2-sigma? Well, it's twice as big as 1-sigma.

Experiment didn't go well, but you need a more-impressive result? Use a 90%
confidence interval instead of 95%.

~~~
minimaxir
1-sigma is 68% confidence. 90% confidence intervals assume 1.64 sigma.

~~~
yes_procrast
Agreed, I'll add a line break to denote that they're separate thoughts.
1-sigma is indeed 68%.

------
capnrefsmmat
The problem with upping the standard is, as the PNAS article acknowledges,
that you need a larger sample size to produce any given result.

Unfortunately, many studies -- particularly those in medicine -- are already
conducted with samples that are too small to detect any effect you'd
reasonably expect to see, because many researchers do not calculate in advance
what sample size would be required. This has interesting paradoxical effects:
the only published studies are those that _overestimate_ the size of the true
effect.

[http://www.refsmmat.com/statistics/power.html](http://www.refsmmat.com/statistics/power.html)
[http://www.refsmmat.com/statistics/regression.html#truth-
inf...](http://www.refsmmat.com/statistics/regression.html#truth-inflation)

So there's a tradeoff. Do you want to eliminate false positives at the cost of
more false negatives? It's a difficult balance. I suspect there are many areas
where poor statistical practice can be remedied to produce better results
without greater expense.

~~~
toufka
I still think the answer is not in any particular standard in any particular
field, rather it's outputting the final, worked datasets. Making the numbers
themselves available and usable to the 'reader' or scientist following up will
immediately clear up issues of reproducibility or not meeting statistical
thresholds. If I see the same data, run it through my own processes and it
looks like noise I'm likely to discount the results moving forward. There's
nothing inherently wrong with saying 'look, I did this once, and this is what
I saw'. And many times it's useful. It's less useful, but still reasonable in
certain circumstances to say, 'look, I did this a hundred times and I saw this
once'. There are occasions when the experiment cannot reasonably have more
than a statistically insignificant 'n' \- but it still might be useful to see
what happens. What is not reasonable is to say, 'this is what I (sometimes)
see always (trust me)' and hide your data behind a 'representative' jpg and a
'statistically significant' p-value.

As I scientist who has worked with both rich, and poor datasets from standard
and invented datatypes, I'd just say, "let me see the data".

------
coherentpony
>In most fields, if there's less than a five percent chance that you'd get the
two numbers by random chance, then you can reject chance—the results are
considered significant. In statistical terms, this is called having a p value
of less than 0.05.

No it isn't [1].

[1]:
[https://en.wikipedia.org/wiki/P-value](https://en.wikipedia.org/wiki/P-value)

~~~
capnrefsmmat
The Ars definition of a p value is not precisely worded but, I think,
reasonably accurate. It doesn't include the bit about also including the
probability of obtaining results _more extreme_ than what you obtained.

The best definition I know is

> The P value is defined as the probability, under the assumption of no effect
> or no difference (the null hypothesis), of obtaining a result equal to or
> more extreme than what was actually observed.

S. N. Goodman. Toward evidence-based medical statistics. 1: The P value
fallacy. Annals of Internal Medicine, 130:995–1004, 1999.

edit: oh dear, and then the Ars article says "Individual experiments may be
wrong five percent of the time," but that's exactly what _p_ values do _not_
measure. Statistics is hard.

~~~
coherentpony
Yeah that definition is good.

For the purposes of understanding the definition, another way of looking at it
is basically a 'statistical proof by contradiction':

1\. Assume null hypothesis is true

2\. Compute test statistic

3\. Ask the question, "What is the probability of obtaining that test
statistic or one more extreme?" (this probability is the p-value)

4\. Pick a threshold (usually 0.05 but this is totally arbitrary)

5\. If p < threshold then conclude the null hypothesis is false and reject it.

Reductio ad statistico absurdum.

------
aidenn0
Isn't the solution just to put funding into reproducing results? It's fine to
have a wide-filter for exploratory research, but a big problem is that it can
be decades before anybody goes back to reproduce.

Let's assume that 90% of results are false positives. Using a higher power
experiment with even the same standard of p=0.05 would result in rejecting
most of those, while hopefully keeping the majority of the true positives (due
to the higher power of the new test). This would result in going from 90%
false positives to less than half false positives, a considerable improvement.

It just seems that nobody ever advanced their career by trying to reproduce
even a landmark result in their field.

~~~
niels_olson
The issue in day to day science is that people will work their numbers to get
p=0.05. They'll have a hypothesis, that, say, lactating adenomas confer a
protective effect against DCIS. And they'll pull, say, all lactating adenoma
cases from 2001 to 2013. If that doesn't work, they might actually try
_dropping_ the 2001 data and just using the 2002 to 2013 data if it gets them
to p=0.05. The result is more brittle (we often test findings by asking "How
would the p value change if we added one negative case?") but the alternative
(pun intended) wouldn't otherwise be published.

This search for p=0.05 also leads to a lot of hair-splitting studies: take two
diagnoses, say, usual and atypical ductal hyperplasia. Now, if you can find
some constellation of parameters that define a middle category, say,
"borderline ductal hyperplasia", you have a wide-open field to all sorts of
p=0.05's, even if there's no change in treatment or outcome. You can say
"cases previously characterized as UDH with a <parameter x> greater than <x>
are 67% more likely to have <parameter y> (p=0.002)" because you lumped
together a bunch of stuff that people already mostly agreed on anyway.

------
mathattack
My impression is there are a couple issues:

\- Methodology - If you're mining for answers, and then presenting only the
statistically significant, you will still find false positives on larger
datasets, it will just take more work.

\- Reproduction - If false positives were more fiercely chased down, then this
would force researchers to be more careful. There is improvement along these
lines.

Changing the p value is arbitrary, and may miss important results. I believe
that encouraging reproduction of results, and reducing blind data mining is a
better solution.

------
adobriyan
If professional scientists do genuine and not so genuine mistakes, imagine
what programmers (who are generally clueless about statistics and theory of
experiment) do with benchmark numbers!

------
qwerta
My teachers always told me that main criteria in "science" is possibility to
verify and reproduce. Study which relies on statistics, but no raw data or
code is presented, is not scientific.

It does not take millions to check for basic mistakes. One person with
computer and free afternoon is enough.

It is like heaving open-source, but without any source code.

~~~
dllthomas
While I think availability of data and code is important, note that it does
not accomplish many of the functions of replication of experiments. If two
people implement the described algorithm and get the same results, we can be a
lot more confident than if two people run the same implementation, because the
same implementation is more likely to have the same bugs.

~~~
qwerta
Code has to be reviewed and _VERIFIED_. How can an article pass peer review,
if nobody even checked code for basic mistakes?

~~~
dllthomas
Yes, precisely. Publishing code and data is necessary for _peer review_ , not
for _replication_. Both are necessary parts of science.

------
gwern
No, it's not, unless you want _even more_ data dredging and p-value hacking.
This won't solve anything, and it'll make matters worse.

------
Bsharp
Researchers should just be forced to append ", probably." to their paper's
title and any conclusions reached.

------
Millennium
Is it ever NOT the time to up the statistical standard? Should science not be
constantly seeking to improve itself?

~~~
Bsharp
Many researchers in less mathematically rigorous fields lack basic statistical
knowledge, such as how a p-value is calculated or what a chi-squared test is.
And that isn't to be snarky towards them - they have a lot to absorb in their
own field. It just means that they are slower to adopt or even know about
various mathematical techniques.

This reminds me of the med researcher who rediscovered integration in 1993 and
was cited many, many times:

[http://care.diabetesjournals.org/content/17/2/152.abstract](http://care.diabetesjournals.org/content/17/2/152.abstract)

------
pkolaczk
[http://xkcd.com/882/](http://xkcd.com/882/)

------
yetanotherphd
In most areas (social sciences, medicine) there is no "standard" for
statistical significance. In economics, people try to deal with the issue by
looking at the robustness of a result. If a result remains when you make
various changes to the specification of your model, then it is less likely to
be a statistical artifact.

Another step in the right direction is greater reproducibility, so that at
least people can play with your analysis and see if what you did was the most
direct and natural analysis, or if there are clear signs of playing with the
parameters until you get the result you want.

