
50% of neuroscience papers suffer from a major statistical error. - zacharyvoase
http://www.badscience.net/2011/10/what-if-academics-were-as-dumb-as-quacks-with-statistics/
======
polyfractal
In academic biology, statistics is routinely considered an afterthought. In
the eyes of many labs/PIs, their results are already real. You just have to do
those damn statistical tests so the editor will get off your back.

Most biologists don't understand statistics either. To the majority, the
unpaired T-Test is the only test that is needed. Ever. Doesn't matter if you
have one or two tails, paired or unpaired trials, normally distributed
population or skewed. Most biologists don't take proper statistics classes and
most just don't care. I doubt most biologists would even be able to name
alternative statistical tests.

~~~
markkat
So very true. I went from physics into biology, and the shoddy application of
statistics is surprising. This is a failure that could be easily addressed at
the university level. More stats classes.

~~~
onemoreact
The best and simplest approach IMO would be improving peer review. If it was
understood that scientific journals would reject papers with poor statistical
analysis you would see changes fairly rapidly.

~~~
carbocation
Nature and the NEJM, and possibly others, assign statistical reviewers. So
this is a known problem that some high profile journals are attempting to
correct.

------
_delirium
The short version of the error: If in your data you find that A has no
statistically significant effect, but B does have a statistically significant
effect, this does _not_ automatically show that B has, with statistical
significance, more effect than A. To do that you have to do a statistical test
on the difference in the effects.

I believe (?) this is normally done correctly in medical studies with
placebos, where the typical analysis is to show that a drug has a
statistically significant effect compared to the placebo baseline; it's not
sufficient to show that the drug has an effect compared to a no-drug baseline,
and, separately, that a placebo doesn't.

~~~
markkat
It's true. Many anti-depressant drugs have no significant effect when compared
to placebo.

[http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fj...](http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0050045)

~~~
_delirium
I don't believe the controversy over that is whether the statistics were done
properly, but over: 1) replicability, in particular whether significance is
still found when doing large-scale meta-analyses; and 2) whether blindness of
the studies is compromised by "unblinding" effects where a drug's side-effects
can tip off the doctor or patient about whether they got a placebo or not. But
it's been a while since I waded into that flamewar of a debate...

------
gizmo
Not to nitpick, but the headline "50% of neuroscience papers suffer from a
major statistical error." is false. Out of a sample of 513 papers 78 contained
this specific mistake.

~~~
noelwelsh
I think you skimmed a bit fast:

"Nieuwenhuis looked at 513 papers published in five prestigious neuroscience
journals over two years. In half the 157 studies where this error could have
been made, it was made."

~~~
fluidcruft
In other words, you're arguing that the title should be: "50% of 31% of
neuroscience papers suffer from a major statistical error." ?

~~~
numeromancer
How about "50% of eligible neuroscience papers assume that statistical
significance is transitive.".

------
jules
This should be the least of our worries. The incentives are set up so strongly
to produce a statistically significant result that indeed nearly all
experiments do produce a statistically significant result. Scientists measure
a data set. Then they work to find a subset of the data and a statistical test
that gives them a significant result. Just ask a couple of scientists whether
they ever did an experiment that didn't provide a statistically significant
result. Proper statistics, where you decide on the test you'll apply and on
what data before doing the experiment, is almost unheard of. And then there's
publication bias, of course.

------
impendia
Here is a wonderful research paper that illustrates the misuse of statistics
in neuroscience:

<http://prefrontal.org/files/posters/Bennett-Salmon-2009.jpg>

------
kghose
This statistical point is true. However, there is a practical problem with
experimental neuroscience - data is painful to get and the means are often
comparable to the variances.

So, with the data you have, it is often much harder to show A > B than to show
A > C and B !> C.

~~~
jvm
So your point is, inappropriate stats are okay because science is hard???

------
ArchD
Maybe someone should follow up and do a similar study for multiple fields and
see whether the difference in error rate between fields is statistically
significant.

~~~
barik
There is some literature in other fields. Sjoberg, in "A Survey of Controlled
Experiments in Software Engineering" did similar studies in software
engineering to compare with other fields such as medicine and the social
sciences. This survey showed that only 1.9% of Software Engineering studies
were actually controlled experiments.

They examined 5453 scientific articles in 12 leading journals from 1993 to
2002. I like how the authors tactfully state that "the gathered data reflects
the relevance of software engineering experiments to industrial practice and
the scientific maturity of software engineering research."

------
loup-vaillant
"Statistical significance" is starting to tire me. It is too binary for my
test: either a given result "achieved" statistical significant, or it is not.
Obviously you have to choose a threshold, and which it should be is much less
obvious.

Couldn't we just do a way with statistical significance, and just publish
likelihood ratios, or decibels of evidence (in favour of one hypothesis over
another) ? That way, we should know _exactly_ how much an experiment is
supposed to be worth. No arbitrary threshold. Plus, you get to combine several
experiments, and get the compound evidence, which can be much stronger (or
_weaker_ ) than the evidence you get from any single one of them. And _then_
you may have found something worthwhile.

This is especially crucial when said evidence is expensive. In teaching, for
instance, one researcher can hardly do experiments on more than 2 or three
classrooms, over little more than a year. This is often not enough to
accumulate enough evidence _at once_ for reaching statistical significance.
But _a bunch of_ such experiments may very well be. (Or not, if the first one
proved to be a fluke.)

------
tokenadult
The submission title of the submitted article (which does NOT appear as the
original title of the article) is probably a hat tip to the famous article by
John P. A. Ioannidis, "Why Most Published Research Findings Are False."

[http://www.plosmedicine.org/article/info:doi/10.1371/journal...](http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124)

The article by Ioannidis, which I think is the most downloaded ever article on
PLoS Medicine, is well worth reading for guidance in how to avoid errors in
research design.

But to the point of the accuracy of the submission title, what the underlying
journal paper

<http://www.nature.com/neuro/journal/v14/n9/full/nn.2886.html>

said was, "We reviewed 513 behavioral, systems and cognitive neuroscience
articles in five top-ranking journals (Science, Nature, Nature Neuroscience,
Neuron and The Journal of Neuroscience) and found that 78 used the correct
procedure and 79 used the incorrect procedure. An additional analysis suggests
that incorrect analyses of interactions are even more common in cellular and
molecular neuroscience." In other words, half the time when the specific issue
came up, the neuroscience authors got the procedure wrong.

AFTER EDIT: Meanwhile, a lot of other interesting comments have mentioned the
role of statistics education or math-aversion in higher education in various
disciplines. I'll comment here about some things I've learned about statistics
since I completed my higher education, things I've self-educated about as a
homeschooling parent for the last two decades and now a mathematics teacher in
private practice.

First of all, even when undergraduate students take courses in statistics, the
courses are not likely to be very helpful. Many statistics textbooks used in
colleges are poorly chosen

<http://statland.org/MAAFIXED.PDF>

and many statistics courses are taught by professors who themselves have very
poor backgrounds in statistics, so the essential point that statistics is all
about DATA never gets emphasized. Moreover, the undergraduate statistics
curriculum historically has emphasized the wrong issues about valid inference

<http://escholarship.org/uc/item/6hb3k0nz>

and most undergraduates who complete one or two statistics courses still have
a very weak sense of what valid statistical inference is.

And all of this is not even to get into issues such as Bayesian versus
frequentist approaches to statistics

<http://yudkowsky.net/rational/bayes>

in modeling reality. Yes, biologists need to get over fear of mathematics for
biology to progress as a science,

[http://www.guardian.co.uk/books/2011/apr/16/mathematics-
of-l...](http://www.guardian.co.uk/books/2011/apr/16/mathematics-of-life-ian-
stewart-review)

and everyone can gain by learning more about statistics,

[http://knowledge.wharton.upenn.edu/article.cfm?articleid=192...](http://knowledge.wharton.upenn.edu/article.cfm?articleid=1928)

but statistics education isn't easy, and it can still be greatly improved even
for the students who do step up to take statistics courses.

~~~
cop359
I think bad statistics education is just a facade that hides what is really
happening. Now I don't have proof, but I think the majority of these "errors"
are done on purpose. It's far better to fudge your math, get amazing (and
wrong) conclusions and then get published in Nature then it is to not get
published in Nature.

The prestige of getting published in Nature or Science far outweighs the
criticism you will get for forging or manipulating your data. In large part
because the later can almost never be proven. You can always say you just made
a mistake or plead ignorance.

~~~
mbreese
These aren't errors that would be caused my manipulating your data. If you
were manipulating your data, you'd have made sure that your results were
significant with the correct tests. These are errors where people didn't use
the proper test.

At the worst, you could only claim that people only submitted the results of
the test that made their research look better than it otherwise would have
been (with the correct test).

In this case, I think it is more of an issue with the reviewers catching the
problems than the authors deliberately misleading.

~~~
jvm
Actually, in the case of this specific error, reviewers are often shut down by
editors who want to publish the finding. It happened to my friend's advisor
recently: she called them out for it and the editor said it would be published
anyway.

------
narkee
I wonder how many people make this error while A/B testing their websites...

~~~
3pt14159
Nobody does since we have this thing called the G-test. We can make other
errors, but this specific one isn't possible.

~~~
btilly
I only wish it were so simple. I've presented correct statistical results for
A/B tests only to have people try to argue me into accepting incorrect results
that would follow from this logical error. If I knew less statistics, or was
unwilling to argue with my boss, this error would have been made.

And another common variant of the problem happens when you're testing 10
variations. People want to do a pairwise test on the top and bottom right
away, without realizing that, even if all are equal, the top and bottom
frequently look different. Or the flip side of that error is that people see
that the G-test says that there is a difference, and conclude that the current
top one must be better than the current bottom one. Which is again incorrect.

There is a lot of subtlety, and just saying, "I have this statistical test
that most people don't understand" is not really going to cut it.

~~~
3pt14159
Right, which is why I said "We can make other errors, but this specific one
isn't possible" to the question: "I wonder how many people make this error
while A/B testing their websites".

I'm familiar with the drawbacks of Taguchi methods and the subtle problems by
changing distribution, and the problem of checking the G-test continuously and
there-by reducing its effectiveness. But for a simple A/B test (and by that I
mean challenger versus champion served randomly from the backend at a static
distribution (50-50 through out the life of the test, say)), unless I need to
hit the books again, this specific problem is not possible if everyone on
board trusts the G-Test (the Yates correction on, etc).

------
bluecalm
The whole concept of "statistically significant" has to go imo. It causes much
confusion and influences experiment design (because in short it focuses on
obtaining maximum amount of measurement in unchanged conditions instead of
measurements which maximize information value), All that in the name of some
arbitrary chosen threshold for statistical significance.

------
zerostar07
That's quite impressive. Esp. when they suggest that researchers may choose to
report differences in significance because the actual interaction effect is
not significant. Part of the craze to publish or perish i guess. Having an
open science approach would help identify these errors here.

That is not to say that the significant results are not significant, though,
just that any claim you may read in the discussion section should be taken
with a grain of salt (which i think is already the case, given that brain
phenomena are borderline chaotic).

This may even indicate the mere fact that in neurosciences you have zero
really large and deep labs (think LHC scale), but instead you have thousands
of small labs doing largely overlapping work with usually small sample sizes,
all competing for small grants.

------
lliiffee
This error builds on a simpler, even more common one (at least among
students): Suppose you have just one treatment (A), and you find that A has no
statistically significant effect. This does _not_ show that A has no effect,
or that A's effect is small. (It could just mean your dataset isn't large
enough.) The error discussed in the article seems to build off this mistake.

~~~
alexholehouse
Could you elaborate on this? Does this assume "you" extrapolate the effect of
treatment A to the general population?

I mean, I understand that if you have a sample size of two, find that
treatment A does not induce an effect, and conclude that treatment A has no
effect [for all test subjects] this does not hold. However, surely if the
sample size _is_ big enough (which obviously isn't always clear, but for the
sake of the argument let's assume it is) then drawing such conclusions does
hold (within the certainty thresholds predefined for your statistical test of
choice, such as 99% or 95% probability). Or have I misunderstood?

~~~
sparsevector
The problem is that standard statistical tests have two outcomes: (1) "reject
the null hypothesis" or (2) "failure to reject the null hypothesis". Moreover
in the most common statistical tests the null hypothesis amounts to something
like "the two samples were drawn from the same distribution" so if you fail to
find a significant difference you haven't shown they're the same, you've just
failed to show they're different. If what you want to do is show that they're
nearly the same, you can design a statistical test where the null hypothesis
is instead something of the sort "these two samples were drawn from
distributions that differ by > X amount" so rejecting the null hypothesis
shows they differ by <= X.

------
DaniFong
The author is himself imputing an absolutist interpretation of statistical
significance. Statistical significance always has a p value (and an assumed
distribution!) associated with it. The permutation of statistical
significances of insignificances the articles describe could well be true for
various p values.

------
rbanffy
Many people enter biology and related areas precisely _because_ they dislike
math.

How ironic...

~~~
Sayter
That's not ironic. Irony occurs when the outcome is opposite of what actions
intended. If many people enter biology and related areas precisely because
they dislike math, then the expectation would be that math-related areas of
the field (such as statistics) would suffer as a result. Irony (in this case)
would be if people entered biology because they dislike math, but then biology
as a field ended up being stronger in statistics than the more mathematically
inclined fields.

~~~
rbanffy
It's ironic because they still have to learn math.

~~~
Sayter
Fair enough.

------
101010010101
In general, beware statistics. Check and double check before making
assumptions. This applies to more than just neuroscience papers.

When reading scientific papers, beware conclusions.

Isn't the important thing whether someone else can replicate the experiments
and achieve similar/same results?

~~~
pbhjpbhj
> _Isn't the important thing whether someone else can replicate the
> experiments and achieve similar/same results?_ //

Or whether a statistically significant proportion of experiments return a
significantly similar result?

;0)>

~~~
101010010101
If the experiments are replicated enough times by other labs to create an
sufficient sample size.

