
Neuroimaging results altered by varying analysis pipelines - seesawtron
https://www.nature.com/articles/d41586-020-01282-z
======
mola
I wish university PR teams will stop peddling "cutting edge" science to the
masses. I know people in the field know that fmri is problematic, but when the
masses don't. We need to stop confusing the public with PR pieces and silly
Ted talks. The public are not idiots, they just end up not trusting science,
and rightly so because they are exposed to it mainly through half assed puff
pieces that peddles half baked science. This ends up as an incoherent and
wrong narrative. Thus causing mistrust.

------
colincooke
I'm friends with one of the labs that took part in this study, so to give a
perspective of how this was received in the field.

For the most part it wasn't a radical finding, but really good to quantify and
see the effects. It also demonstrated the importance of including and
documenting the parameters of the analysis pipeline.

Something else to keep in mind, especially when thinking about cognitive
neuroscience (where scanning is an important tool), is that analysis are not
done in isolation. Every experiment is motivated by behavioural results,
neurobiological results or both. The goal of fMRI is to gain insight into the
activity of the brain, but its also not very precise (given the scale of the
brain).

Basically this data is analyzed in the context of what has been shown before,
and how it matches our current hypothesis. This doesn't mean such variation is
acceptable (see first part of my comment), but it does mean that this doesn't
cause all results on fMRI to be invalid.

------
daedalus_f
There have been large problems with fMRI studies for a long time, even leaving
aside the potentially sketchy coupling between the BOLD signal and actual
neural activity and the difficulties accounting for movement artifact.

This paper published in 2016 suggests that commonly used statistical packages
for analysis of fMRI data can have a false discovery rate of up to 70%:
[https://www.pnas.org/content/113/28/7900](https://www.pnas.org/content/113/28/7900)

More fun, this poster presents the results of fMRI in dead salmon given an
open ended mentalising task:
[https://www.psychology.mcmaster.ca/bennett/psy710/readings/B...](https://www.psychology.mcmaster.ca/bennett/psy710/readings/BennettDeadSalmon.pdf)

~~~
jointpdf
> _Subject: One mature Atlantic Salmon (Salmo salar) participated in the fMRI
> study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was
> not alive at the time of scanning._

> _Task: The task administered to the salmon involved completing an open-ended
> mentalizing task. The salmon was shown a series of photographs depicting
> human individuals in social situations with a specified emotional valence.
> The salmon was asked to determine what emotion the individual in the photo
> must have been experiencing._

This is just so, so good. Thank you to this salmon for participating in
science.

~~~
SubiculumCode
The salmon experiment was ages ago, and just showed that the threshold for
statistical significance was too liberal. At that time, when neuroimaging was
quite new, it was common practice to just pick a threshold, say .001, and not
do corrections for multiple tests. That proved to be too liberal. It wasn't a
totally stupid guess though. Much of the brain's BOLD activity is correlated,
so Bonferroni correction was stupidly conservative (not independent tests). It
was only then that newer techniques (cluster-based correction) were adopted,
which explicitly made an assumption about the spatial autocorrelation of the
data. The one assumption of this technique was that this autocorrelation was
the same throughout the brain, which ended up not being a good assumption,
though computationally useful. Newer methods today try to model the
autocorrelation in the data for cluster based correction, or have moved to
other numerical techniques for correction of multiple comparisons re repeated
resampling.

So, while the dead salmon experiment was eye opening to some when the
neuroimaging field was very new, it is no longer relevant now.

Neuroimaging techniques are constantly improving, both from better analytical
techniques and knowledge about the factors that can compromise the veracity of
analyses, but they, including the present paper, show that neuroimgaging is
bunk science or useless, which is most certainly incorrect.

~~~
jointpdf
Thanks for this, it was interesting and helpful. My comment was mostly
appreciating the sardonic and dry humor in the poster (and the concept of the
study itself).

It was my first time hearing about this, and it struck me as the type of funny
anecdote that could be slipped into a talk/lecture (I find it takes quite a
bit of finesse to keep non-technical people engaged when talking about
statistics and data science, so humor and relatable examples always help). So,
it’s good to know the broader context.

------
jdoliner
This is a big problem and not just in neuroscience. Reproducible science is
inherently hard and there aren't a lot of great tools that make it easy. The
key to solving it is having a way to track data lineage and a reproducible way
to run the processing. I've been building a system that implements those ideas
for the past several years called Pachyderm [0]. We've helped many scientists
across different fields run their pipelines in a reproducible way. If you're
suffering from this problem we'd love to chat with you.

[0] [https://www.pachyderm.com/](https://www.pachyderm.com/)

~~~
SubiculumCode
THis is why I am a big supporter of Configurable Pipeline for the Analysis of
Connectomes (C-PAC) [https://fcp-
indi.github.io/docs/latest/user/index.html](https://fcp-
indi.github.io/docs/latest/user/index.html) for neuroimaging. This software
facilitates both exact reproducibility and testing variability in results as a
function of multiple preprocessing choices, the latter allowing one to be sure
that your results is not idiosyncratic to specific choices in proprocessing
(e.g., smoothing, nuisance correction methods, etc).

------
fluidcruft
I don't really see why standardized pipelines are the answer. That just makes
things reproducible while ignoring the question of accuracy. Phrenology can be
perfectly reproducible, but unless there's a real effect to measure it's just
chasing ghosts. So few people actually understand what these calculations are
doing. I don't think making things even more cook-by-numbers is really going
to achieve anything other than empowering more people to not know what they're
doing.

~~~
jonnycomputer
Actually the paper, which our lab participated in, though I wasn't personally
involved, doesn't really support the idea of standardizing. What it really
lends credence to is the idea of running multiple pipelines and averaging the
results. Because the average of the results were really quite robust.

Put another way. If your effect of interest disappears with slight changes in
your pipeline then the effect probably isn't that robust to begin with. That
doesn't mean it isn't a true effect, but on average it probably isn't.

~~~
Kednicma
I brought this up in a grand-cousin thread, but to what degree does numerical
analysis matter here? I mentioned [0], which is a must-read paper for folks
doing numerical work, and the author brings up two examples (sections 9 and
11) where averaging over an ensemble of pipelines could not just fail to
converge on the right answer, but actually mask systemic biases in the
ecosystem.

If your effect of interest disappears with slight changes to your pipeline
(perturbation), then maybe the effect is weak or chaotic or unstable, but also
maybe there is a systemic bias in your pipeline!

[0]
[https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf](https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf)

~~~
jonnycomputer
Agree with fluidcraft here. I don't want to discount the importance of
numerical analysis, but the different preprocessing and analysis packages that
I am personally aware of make substantively different decisions about how to
proceed. For example, a psycho-physiological analysis, a connectivity analysis
that tries to show how connectivity with a seed region changes in response to
task condition, done in SPM will deconvolve the BOLD signal in the seed region
to get at a purported neural signal, while in FSL, the condition regressor is
convolved with haemodynamic response function. Both are "correct" in that in
an ideal world you should get similar results doing either thing, but in the
real world you very well may not. Similarly, different packages may make
different assumptions about smoothness of brain regions in calculating whether
the size of a cluster is significantly above zero.

And even when you use the same software package, there are various things you
might do differently, parameters you might set differently; even the order of
certain operations maybe switched around. A classic example of that has to do
with slice timing correction. Since fMRI slices are acquired at different
times, a typical thing to do would be to interpolate the slices so that you
can treat them as if they were acquired all at the same time. However, motion
correction between volumes, in which subject motion is partially corrected by
jiggling around each image until they maximally overlap each other, can be
done before or after slice timing correction; but the consequences of that is
that motion artifacts can propagate through volumes in different ways
depending on the order you do it in; but neither order is definitely better,
because it matters how much motion is a problem in the scans, and how long it
takes to acquire a volume (e.g. tr 2 seconds), and so on.

------
mensetmanusman
Unfortunate, this is big news because it means all the brain imaging science
suffers from the reproducibility crisis.

This means that all the TED talks saying ‘we know this because fmri’ should be
assumed incorrect until proven otherwise with multiple double blind studies
(which is too expensive in the current era to actually reproduce all of these
studies).

~~~
disgruntledphd2
Yeah, but to be fair, fMRI has been problematic for years and years. When I
was doing my PhD (2008-11), I read a bunch of neuroimaging studies, and was
very sceptical about the results because of the small samples, and the
undocumented pre-processing.

To be fair though, you shouldn't believe any science in most disciplines
unless its been replicated by multiple labs/authors.

~~~
aeternum
Is it common for labs to attempt replication?

From my limited understanding, replication does not have much prestige and can
be difficult to get published. Is this a problem, and have research
universities found a solution?

~~~
disgruntledphd2
Yeah, this is a real problem. The incentive structure is such that people are
encouraged to find "novel" things. This leads to lots of data-dredging for
significant findings that are counter-intuitive.

And because of the bias towards novelty, replications are not regarded as
worth publishing and so these results remain unchallenged.

It's a hard problem, and has been known to be a hard problem for at least the
last forty years (within psychology, at least) and yet very little has been
done.

Brian Nosek and the OSF are doing good work here though.

------
etiam
From the ingress:

 _This finding highlights the potential consequences of a lack of standardized
pipelines for processing complex data._

Couldn't it at least well be concluded to 'highlight the potential
consequences should standardized pipelines for processing complex data be
introduced' ?

If they had all been doing the same thing, aren't odds the results would be
just as fishy but it would have been even harder to notice? Seems to me the
way you know something really is there is generally that it keeps showing up
in analysis with a multitude of different appropriate methods.

------
jacob-peacock
It seems like building stats consensus on analysis pipelines for given study
designs could be very worthwhile. Still surprised there's no declarative
statistical programming language to analyze RCTs with code like "Maximize
primary outcome, control age, account for attrition." Of course, written this
way it sounds extremely naive—well _how_ do you account for attrition? But, of
course, people analyze data this way all the time, just with more verbose
code. And that's the point of declarative programming: focus on what you want
rather than how to get it.

(Also, not sure what the HN norms are, but to avoiding self-plagarization,
note this is cross-posted from my twitter account.)

------
Kednicma
Some of us like to talk of numerical accuracy as an arcane and trivial detail,
but there is no automated technique for always eliminating accuracy problems
[0], and this article is showing us that there are real consequences to
applied practices when our arithmetic isn't careful.

[0]
[https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf](https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf)

~~~
dekhn
none of the problems reported here are because of numerical accuracy.

