Neuroimaging results altered by varying analysis pipelines

mola · on Aug 21, 2020

I wish university PR teams will stop peddling "cutting edge" science to the masses. I know people in the field know that fmri is problematic, but when the masses don't. We need to stop confusing the public with PR pieces and silly Ted talks. The public are not idiots, they just end up not trusting science, and rightly so because they are exposed to it mainly through half assed puff pieces that peddles half baked science. This ends up as an incoherent and wrong narrative. Thus causing mistrust.

colincooke · on Aug 21, 2020

I'm friends with one of the labs that took part in this study, so to give a perspective of how this was received in the field.

For the most part it wasn't a radical finding, but really good to quantify and see the effects. It also demonstrated the importance of including and documenting the parameters of the analysis pipeline.

Something else to keep in mind, especially when thinking about cognitive neuroscience (where scanning is an important tool), is that analysis are not done in isolation. Every experiment is motivated by behavioural results, neurobiological results or both. The goal of fMRI is to gain insight into the activity of the brain, but its also not very precise (given the scale of the brain).

Basically this data is analyzed in the context of what has been shown before, and how it matches our current hypothesis. This doesn't mean such variation is acceptable (see first part of my comment), but it does mean that this doesn't cause all results on fMRI to be invalid.

daedalus_f · on Aug 21, 2020

There have been large problems with fMRI studies for a long time, even leaving aside the potentially sketchy coupling between the BOLD signal and actual neural activity and the difficulties accounting for movement artifact.

This paper published in 2016 suggests that commonly used statistical packages for analysis of fMRI data can have a false discovery rate of up to 70%: https://www.pnas.org/content/113/28/7900

More fun, this poster presents the results of fMRI in dead salmon given an open ended mentalising task: https://www.psychology.mcmaster.ca/bennett/psy710/readings/B...

jointpdf · on Aug 21, 2020

> Subject: One mature Atlantic Salmon (Salmo salar) participated in the fMRI study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was not alive at the time of scanning.

> Task: The task administered to the salmon involved completing an open-ended mentalizing task. The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence. The salmon was asked to determine what emotion the individual in the photo must have been experiencing.

This is just so, so good. Thank you to this salmon for participating in science.

SubiculumCode · on Aug 21, 2020

The salmon experiment was ages ago, and just showed that the threshold for statistical significance was too liberal. At that time, when neuroimaging was quite new, it was common practice to just pick a threshold, say .001, and not do corrections for multiple tests. That proved to be too liberal. It wasn't a totally stupid guess though. Much of the brain's BOLD activity is correlated, so Bonferroni correction was stupidly conservative (not independent tests). It was only then that newer techniques (cluster-based correction) were adopted, which explicitly made an assumption about the spatial autocorrelation of the data. The one assumption of this technique was that this autocorrelation was the same throughout the brain, which ended up not being a good assumption, though computationally useful. Newer methods today try to model the autocorrelation in the data for cluster based correction, or have moved to other numerical techniques for correction of multiple comparisons re repeated resampling.

So, while the dead salmon experiment was eye opening to some when the neuroimaging field was very new, it is no longer relevant now.

Neuroimaging techniques are constantly improving, both from better analytical techniques and knowledge about the factors that can compromise the veracity of analyses, but they, including the present paper, show that neuroimgaging is bunk science or useless, which is most certainly incorrect.

jointpdf · on Aug 22, 2020

Thanks for this, it was interesting and helpful. My comment was mostly appreciating the sardonic and dry humor in the poster (and the concept of the study itself).

It was my first time hearing about this, and it struck me as the type of funny anecdote that could be slipped into a talk/lecture (I find it takes quite a bit of finesse to keep non-technical people engaged when talking about statistics and data science, so humor and relatable examples always help). So, it’s good to know the broader context.

phreeza · on Aug 21, 2020

I think Jack Gallant and colleagues get it right by basically applying the method that works well in machine learning. Train your model on one dataset, and then validate on a held out test set. That should get rid of the majority of all statistical artifacts (as long as you do the split right of course).

9214 · on Aug 21, 2020

Salmon story is more than a decade old at this point, and still as relevant as before; became an instant classic in my neuroscience-related MS programme. This Wired article elaborates on the motivation behind the study:

https://www.wired.com/2009/09/fmrisalmon/

SubiculumCode · on Aug 21, 2020

How is it relevant now? Please expand. Sure its relevant in the sense that fmri requires multiple tests, and that has to be accounted for correctly, but we do so now...with greater and greater accuracy as newer and newer techniques take hold, mind you, for a statistical problem that is very difficult to model. The salmon experiment showed that the thresholds scientists used at the BEGINNING of neuroimaging science were insufficient...this is a new field, methods take time to develop.The salmon experiment is not relevant now because we are not making the same mistakes we made then, over a decade ago.

9214 · on Aug 21, 2020

My comment about relevancy was more in line with @jdoliner's in the neighboring thread. I totally agree that tools and practices weren't quite there at the time and that things are different now.

The takeaway from Salmon experiment IMO is not specific to fMRI: you cannot trust a single technique or field "golden" standard (whether flawed or not), and should rather pursue accuracy across different pre- post-processing methods.

That's not realistically doable single-handily (at least in my experience), but in the current age it should become the norm to rely on publically available pipelines, probably with some sort of configuration files or even GUIs. I personally worked with fMRI only sparingly, so cannot comment on that in the context of the posted article.

LargoLasskhyfv · on Aug 22, 2020

I'll just link to a comment I made about 3 months ago, regarding this: https://news.ycombinator.com/item?id=23418356

jdoliner · on Aug 21, 2020

This is a big problem and not just in neuroscience. Reproducible science is inherently hard and there aren't a lot of great tools that make it easy. The key to solving it is having a way to track data lineage and a reproducible way to run the processing. I've been building a system that implements those ideas for the past several years called Pachyderm [0]. We've helped many scientists across different fields run their pipelines in a reproducible way. If you're suffering from this problem we'd love to chat with you.

[0] https://www.pachyderm.com/

SubiculumCode · on Aug 21, 2020

THis is why I am a big supporter of Configurable Pipeline for the Analysis of Connectomes (C-PAC) https://fcp-indi.github.io/docs/latest/user/index.html for neuroimaging. This software facilitates both exact reproducibility and testing variability in results as a function of multiple preprocessing choices, the latter allowing one to be sure that your results is not idiosyncratic to specific choices in proprocessing (e.g., smoothing, nuisance correction methods, etc).

fluidcruft · on Aug 21, 2020

I don't really see why standardized pipelines are the answer. That just makes things reproducible while ignoring the question of accuracy. Phrenology can be perfectly reproducible, but unless there's a real effect to measure it's just chasing ghosts. So few people actually understand what these calculations are doing. I don't think making things even more cook-by-numbers is really going to achieve anything other than empowering more people to not know what they're doing.

jonnycomputer · on Aug 21, 2020

Actually the paper, which our lab participated in, though I wasn't personally involved, doesn't really support the idea of standardizing. What it really lends credence to is the idea of running multiple pipelines and averaging the results. Because the average of the results were really quite robust.

Put another way. If your effect of interest disappears with slight changes in your pipeline then the effect probably isn't that robust to begin with. That doesn't mean it isn't a true effect, but on average it probably isn't.

fluidcruft · on Aug 21, 2020

Yeah, that's the better approach. I'm mostly responding to this link's framing in the subtitle

> This finding highlights the potential consequences of a lack of standardized pipelines for processing complex data.

Lack of a standardized pipeline is not a problem in my mind. Replication of findings on multiple pipelines is more convincing that results are not computational artifacts or analysis quirks. Replication using independently acquired data is of course even more convincing, but that's can be expensive and difficult in this field.

Kednicma · on Aug 21, 2020

I brought this up in a grand-cousin thread, but to what degree does numerical analysis matter here? I mentioned [0], which is a must-read paper for folks doing numerical work, and the author brings up two examples (sections 9 and 11) where averaging over an ensemble of pipelines could not just fail to converge on the right answer, but actually mask systemic biases in the ecosystem.

If your effect of interest disappears with slight changes to your pipeline (perturbation), then maybe the effect is weak or chaotic or unstable, but also maybe there is a systemic bias in your pipeline!

[0] https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf

jonnycomputer · on Aug 23, 2020

Agree with fluidcraft here. I don't want to discount the importance of numerical analysis, but the different preprocessing and analysis packages that I am personally aware of make substantively different decisions about how to proceed. For example, a psycho-physiological analysis, a connectivity analysis that tries to show how connectivity with a seed region changes in response to task condition, done in SPM will deconvolve the BOLD signal in the seed region to get at a purported neural signal, while in FSL, the condition regressor is convolved with haemodynamic response function. Both are "correct" in that in an ideal world you should get similar results doing either thing, but in the real world you very well may not. Similarly, different packages may make different assumptions about smoothness of brain regions in calculating whether the size of a cluster is significantly above zero.

And even when you use the same software package, there are various things you might do differently, parameters you might set differently; even the order of certain operations maybe switched around. A classic example of that has to do with slice timing correction. Since fMRI slices are acquired at different times, a typical thing to do would be to interpolate the slices so that you can treat them as if they were acquired all at the same time. However, motion correction between volumes, in which subject motion is partially corrected by jiggling around each image until they maximally overlap each other, can be done before or after slice timing correction; but the consequences of that is that motion artifacts can propagate through volumes in different ways depending on the order you do it in; but neither order is definitely better, because it matters how much motion is a problem in the scans, and how long it takes to acquire a volume (e.g. tr 2 seconds), and so on.

fluidcruft · on Aug 21, 2020

IMHO there are far bigger differences in how the different pipelines approach very basic things such as how to identify which parts of one person's brain corresponds to which part of a different person's brain and what sorts of corrections should be applied for the distortions. There's a lot of stuff done that "looks good" or based on analogy (for example by assuming the brain is essentially a block of jello or a fluid) or because a particular structure is needed for a type of desired analysis (i.e. the need for an invertible displacement field) that don't necessarily have any biologic basis. There's also the issue that some people just don't have "extra" or "missing" cortical folds depending on your perspective. Additionally a real problem in general is we can't really see enough of the microstructure of the brain with enough detail to see basic things that are seem to relate to processing capabilities like the cortical layer structure without histology.

SubiculumCode · on Aug 21, 2020

The field is not moving towards standardization. The field is moving to requiring (via reviewer requests) to test the results with multiple preprocessing parameters to probe the robustness of results.

mensetmanusman · on Aug 21, 2020

Unfortunate, this is big news because it means all the brain imaging science suffers from the reproducibility crisis.

This means that all the TED talks saying ‘we know this because fmri’ should be assumed incorrect until proven otherwise with multiple double blind studies (which is too expensive in the current era to actually reproduce all of these studies).

disgruntledphd2 · on Aug 21, 2020

Yeah, but to be fair, fMRI has been problematic for years and years. When I was doing my PhD (2008-11), I read a bunch of neuroimaging studies, and was very sceptical about the results because of the small samples, and the undocumented pre-processing.

To be fair though, you shouldn't believe any science in most disciplines unless its been replicated by multiple labs/authors.

aeternum · on Aug 22, 2020

Is it common for labs to attempt replication?

From my limited understanding, replication does not have much prestige and can be difficult to get published. Is this a problem, and have research universities found a solution?

disgruntledphd2 · on Aug 22, 2020

Yeah, this is a real problem. The incentive structure is such that people are encouraged to find "novel" things. This leads to lots of data-dredging for significant findings that are counter-intuitive.

And because of the bias towards novelty, replications are not regarded as worth publishing and so these results remain unchallenged.

It's a hard problem, and has been known to be a hard problem for at least the last forty years (within psychology, at least) and yet very little has been done.

Brian Nosek and the OSF are doing good work here though.

pc86 · on Aug 21, 2020

Interesting, what is your PhD in?

disgruntledphd2 · on Aug 21, 2020

Psychology (mostly psychometrics, social and health). I was studying the placebo effect, so read very very widely.

I know very little about the brain, except what I did in my undergrad (which wasn't a lot), but all the statistics in papers looked really, really dodgy to me.

etiam · on Aug 21, 2020

From the ingress:

This finding highlights the potential consequences of a lack of standardized pipelines for processing complex data.

Couldn't it at least well be concluded to 'highlight the potential consequences should standardized pipelines for processing complex data be introduced' ?

If they had all been doing the same thing, aren't odds the results would be just as fishy but it would have been even harder to notice? Seems to me the way you know something really is there is generally that it keeps showing up in analysis with a multitude of different appropriate methods.

jacob-peacock · on Aug 21, 2020

It seems like building stats consensus on analysis pipelines for given study designs could be very worthwhile. Still surprised there's no declarative statistical programming language to analyze RCTs with code like "Maximize primary outcome, control age, account for attrition." Of course, written this way it sounds extremely naive—well how do you account for attrition? But, of course, people analyze data this way all the time, just with more verbose code. And that's the point of declarative programming: focus on what you want rather than how to get it.

(Also, not sure what the HN norms are, but to avoiding self-plagarization, note this is cross-posted from my twitter account.)

Kednicma · on Aug 21, 2020

Some of us like to talk of numerical accuracy as an arcane and trivial detail, but there is no automated technique for always eliminating accuracy problems [0], and this article is showing us that there are real consequences to applied practices when our arithmetic isn't careful.

[0] https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf

dekhn · on Aug 21, 2020

none of the problems reported here are because of numerical accuracy.