Hacker News new | past | comments | ask | show | jobs | submit login
fMRI software bugs could upend years of research (theregister.co.uk)
400 points by taylorbuley on July 4, 2016 | hide | past | web | favorite | 180 comments

"and along the way they swipe the fMRI community for their “lamentable archiving and data-sharing practices” that prevent most of the discipline's body of work being re-analysed."

That's quite funny. My girlfriend recently finished her master's thesis on data sharing for neuroscience data and created a model for universal access to research data across institutions, but came to the conclusion that making researchers share their data is a bigger hurdle than actually implementing the system.

The main reason for lack of sharing, she postulated, is, that studies (that create funding for the researcher who publishes them) can be done using just the raw data and researchers who create data want to publish all the studies/papers for themselves (because "they" paid for the data acquisition) and are also afraid to publish underlying data for it to be harder for others to falsify their results, which would lead, in their opinion, to funding going away.

Edit: of course there are privacy issues for the test subjects as well.

Meanwhile, the research I work for publishes all of their data. In fact, the main reason I have been hired by their professor as a programmer/interaction designer (through HN no less) is because the next project will produce a mountain of data and he does not want it to collect dust after the project is done. I am supposed to make the data even more accessible to other researchers through a web-interface.

Aside from the ethical motivations, other benefits are pretty obvious: there is at least one postdoc in the research department that I know of was recruited because he found a completely different use for the data shared in the past and published a paper based on his own analysis.

(The research group works with mouse brains so privacy is not an issue)

I guess it depends on research field and journal - in biology most journals require you to deposit your raw genomic reads to the Sequence Read Archive, this is still sometimes missed by editors and reviewers.

Random example for the requirement of data depositing from Journal of Plant Physiology: "If new sequence data are reported, insert the following: “Sequence data from this article have been deposited at XXX under accession number(s) YY000000.‘‘ http://cdn.elsevier.com/promis_misc/jpp_Instruction.pdf

Another random example from the Nature Group http://www.nature.com/authors/policies/availability.html

AFAIK only few funding bodies require the deposit of raw data...

Please ask her to keep working on the problem and keep fighting the good fight. I think it's a political problem. If research grants were only given out if all data was made available (and while we're at it open source all the code produced) we'd be a huge step further down the right path.

There is very, very strong resistance to these sort of ideas. The idealist in me likes to think that there shouldn't be much resistance after all scientists should be happy if their results get double checked (for free!). I can only assume that a lot of people are very afraid of having their data double checked (or possibly never had data in the first place). Additionally access to data is a competitive advantage as it's often hard to generate.

It's a rather sad reflection of society, no telling how may breakthroughs we're missing out on due to the lack of cooperation in science on this front.

There is also the motivations of the scanner owner to consider. They are paid to do scans. A cheap scan might be low hundreds of dollars per hour. A more commercial rate would be well into the thousands. We don't want to dry up now do we?

Not really.

The scanner owners are almost all universities. One typically does have to pay to use the scanner(and quite a bit--$500+ per hour is typical), but this is typically meant to cover the scanner costs. The equipment costs at least four million dollars, plus some ongoing costs (cryogens, typically a full-time tech or two, and maintenance; the power bill is probably not trivial either).

As someone who works in the biomedical imaging business and is also a fan of philosophy, I think this news will matter more to folks in the latter camp. For a couple years now philosophers have insisted that fMRI images prove there is no such thing as free will. Today's revelation should put an end to that whole line of reasoning (and the absurd amount of fatalism that it engendered).

(The back story: Apparently fMRI showed motor signals arising before the cognitive / conscious signals that should have created them, assuming we humans have free will. This has led to the widely adopted belief among philosophers that we humans act before we think, thus we don't and can't act willfully and freely. To wit, science has proven there is no such thing as free will; we're all just automatons.)

Just this week there was an article in The Atlantic on how we all must accept that we're mere robots and we don't really choose our actions (nor can we choose to believe in a god).

Ah well. It seems philosophers STILL haven't learned the importance of applying the scientific method before leaping to a conclusion -- sometimes just to check that someone else didn't just abuse the scientific method.

Why would this put an end to that line of reasoning? I'd expect it to flare up the debate, not end the debate. They didn't disprove behavior being computed, they demonstrated that a class of data supporting it was useless. The natural reaction to this isn't "okay we give up", it's "better go get some good data".

(Also, I don't like mixing the question "Is our behavior computed?" with the question "Does computed behavior imply no free will?".)

The problem with free will debates has to do with its definition. I think once people start hammering out the definition, free will either becomes a wimpy variant that people didn't want to talk about in the first place, or it becomes so ambitious that it's outside the scope of science to discuss.

If you define free will as the ability to generate and choose options based on constraints, then people get bored of that discussion because it seems to lack the freedom they want. If you define free will as the ability to escape biophysics, then it becomes an unscientific discussion of metaphysics.

But what people really want is the freedom from biophysics, not a discussion of a bounded system to generate and choose options. People want a reality where this state does not need to relate to the one before it, a non-markovy world.

That's why people mix the discussion of behavioral computation and free will. People want the most ambitious form of free will, something too special for computers -- freedom from biophysics, freedom from the prior state's tyranny over its future state.

Well put. Serendipitously, Dan Dennett and Sam Harris just discussed this last week and released the recording of their conversation[0].

The conversation was interesting and disappointing.

Harris argues that the commonly-held notion of 'free will' is nonsensical. Dennett doesn't disagree but worries that people may construe this to mean that 'all bets are off' and the world will descend into chaos. Harris attempts to explain why this would not be the case; that we'd still have good reasons to imprison people who want to do harm. Dennett agrees but then restates the same worry differently. They never manage to get past this.

Still, a good listen.


Every time I see Dennett deliver his, "free will exists, but it's not what you think it is" line. I imagine him in a Santa suit, with his grandchildren, when they realize that that is grandpa Dan in there. He must certainly say, "no no children, Santa exists, he just isn't who you thought he was."

Listened to that whole thing as well, and was just as frustrated. I prefer Brian Greene's 1 minute case for the absence of free will: https://www.youtube.com/watch?v=KBNzaXx6eKg

When a future state does not depend on previous states, it's called "randomness". Not quite unattainable, but does not sound too desirable either.

Whenever people claim we have free will, the first thing I do is generally to ask them to define it, and that is usually where that conversation ends up in a quagmire.

This has little to do with the study in question. It wasn't about weird metaphysics, it was that MRI scans could predict what choice people would make, before they reported making a choice. Showing that the unconscious mind makes choices before our conscious mind is even aware of it. I don't think this result is affected by this new issue, because it seems to have been done with old fashioned EEGs too.

You can read more about this stuff here: http://io9.gizmodo.com/5975778/scientific-evidence-that-you-...

This adds to other work from split brain patients, that our right brain explains away the choices we've made even if those explanations are totally false. E.g. they ask the left brain to pick up a toy soldier, and ask the right brain why they did that. They say "well because I always liked toy soldiers when I was a kid", or some other made up explanation. There's even cases where one side of the brain is paralyzed and can't move its arm, and the other side makes up explanations why it doesn't want to move its arm and refuses to believe it's paralyzed.

I think it's possible humans don't have Free Will. Not just in a philosophical, or determinism vs nondeterminism sense, but in a very practical sense. That our actions are highly predictable. And that once you start to see the inner workings of the machine that is our minds, it starts to seem a whole lot less magical.

This intuition is hard to explain, but in general many systems seem to have "agency" until you understand how they work, and then they start to seem just like normal "non agency" things. As we learn more about how humans work, we start to look a lot less agenty. More on that here: http://lesswrong.com/lw/mb0/agency_is_bugs_and_uncertainty/

Conway and Kitchen have defined a theory of free will.


What are the actionable insights that have come out of fMRI studies? Even when properly conducted (no false positives), the conclusions that are often drawn have always felt dubious to me. Basically you are looking for regions of the brain that light up with various stimuli. Except that's as far as it goes, we don't yet understand much beyond that.

It's as if you figure out that your car is making a funny sound, and you can pinpoint where it is coming from, you can even reproduce the sound on demand - but you have no idea WHY it sounds the way it does.

I was participant in a study in linguistics [1] that compared native Polish and native German speakers who were put into an fRMI and played speech sounds. Both Polish and German ones.

It clearly showed that speech sounds from your native language are processed in a different part of the brain than non-native speech sounds.

Yes, that does not explain much. But it leads towards all kinds of questions. And I found that fascinating.

[1] Silvia Lipski, Neurosci Lett. 2007 Mar 19, „A Magnetoencephalographic study on auditory processing of native and nonnative fricative contrasts in Polish and German listeners.“

A small correction: cited study used MEG, not fMRI, as modality.

The conclusion itself doesn't look very surprising for me. We already know that the sound processing in general is the same in both hemispheres while the speech processing is very lateralized. From the continuity I could say that there should be a border where speech-like sounds sound like speech and therefore they are processed differently between hemispheres. This study seems to estimate this border.

Thanks for the correction!

I misremembered because my professor's group did a lot of fMRI stuff, as well, and in the seminars we mostly talked about those.

Speech/language and brain is fascinating. There are resident linguists at major hospitals who are consulted before neurosurgery. Speech sounds are processed faster than other sounds in our brain. Rearranging sentences from active voice to passive voice, silently in your head, lead to easily seen activity in fMRIs, distinct from non-linguistic mental actions. And so on.

When your about to dig a chunk of someone's brain out and have a variety of routes to get to the offending region, it's handy to know what you might be able to avoid knocking off.

Well, the non-pop philosophers will understand that as a minor hit to the model of consciousness as solely concerned with post-facto rationalization and basically removed from the real-time decision making loop. Those studies were never strong evidence anyway, the protocols were pretty weak as they required asking the subjects to self-report at what time they "decided" to act.

The real free will debate is about both the definition of free will (for the compatibilists and libertarians) and a debate about whether the evidence for materialism outweighs the subjective experience of free will.

But yes, this will hopefully stop those confused pop-philosophy stories.

Interestingly one of the best arguments apposed to free will is people being terrible random number generators. Which tends to be hand waved away.

Which suggests people have already decided which side they believe in and only try and rationalize it after the fact. ;)

I've never heard of that argument, though I admit I'm new to this discussion.

But, given that people can and have built pretty good random generators, by finding sources of entropy and even with pure math, thus sidestepping our limitations, the point is kind of moot IMHO.

The logic is odd but strait forward. Basically, free will requires people to make choices independent of the past state of the universe. Aka if you replayed to universe up to this point several times they would not always respond in the same way.

If that's the case people should be able to make unpredictable choices by definition. However, when asked to be unpredictable they fail. So, at best people have a limited form of free will.

PS: This is something of an upper bound. At the other end, if replaying someone's life they only have a single non predictable choice they they may have free will. Similarly if they make the same choices in a very large number of universes and only make a different choice in one you could still argue that that's free will. But, the lower bound is not that meaningful of a distinction.

Well OK, but if you want nondeterministic behavior, you can find a source of entropy and make decisions based on a random number (e.g. flip a coin). You chose to do it and it invalidates the hypothesis, because if you replay the universe up to this point, you won't behave in the same way and the choices you just made by flipping a coin are independent of the past state of the universe.

So what am I missing, as I feel we are going into non-falsifiable territory.

If you decide to follow an entropy source and I see the same output as you I can predict your behavior. Free will must be unpredictable as it does not depend on the past state of the universe. (Unless you mean an entropy source outside of the 'universe' which is a rather strange a circular requirement for free will.)

Now, you can argue for a lesser form of free will which is influenced, but not dependent on the state of the universe. However, that's progress even if somewhat obvious.

Quick, don't think of an elephant.

There are many things that, when asked, people are suddenly bad at doing. I don't think that's evidence.

I've always found that these "Quick, don't think of 'x'" things are a silly example. You end up thinking of X the moment you read X, just before it gets put in context of "not" and "think". It's like telling the parser not to tokenize a given word. Then there's automaticity, which you have to turn off, which is made harder by "Quick", further setting you up to think further of elephants till you get around to grasping the meaning of the phrase.

So what you're saying is that this is a good example of context affecting the outcome.

That I had no problem doing. What's the point?

To be fair, there are more arguments to be made contra free will (thought experiments like: At what point from conception to adulthood would free will develop (if at all), could free will be a merely probabilistic byproduct, etc...) and even if the details are wrong in these fMRI images, the general notion that a motor signal is generated before it appears in consciousness might still be valid.

Funny. One measure when motor signal and sees that it can measure when it appears in consciousness. Do me a favor, how can you be sure that what you measure is actually consciousness. Give me a "measurable" consciousness.

Moreover, Freud basically says that free will doesn't exist because he makes a difference between conscious and subconscious... The latter being a huge bunch of thoughts you don't make/change at will...

Moreover, predicting if a ball will fall when put above the ground is something. Predicting the next second of life of the universe is something else. So predicting a subject will raise its left/right hand is something. Predicting what will be his opinion about abortion is an entire story

All of this reminds me of studies done long ago which tried to predict if someone was a criminal or not by measuring the shape of its skull...

Free will as in "there's something that can't be explained by chemistry and electrical signals in the brain" is just an attempt of some believers to rationalize their belief in the kind of the supernatural "soul."

Exactly. The whole concept is flawed.

The brain computes an abstraction of the reality (or most likely multiple competing abstractions) and decides on actions. There's the free will.

There is some kind of idea that the consciousness "floats" on top of this abstract soup as some kind of a running program in an operative system, and is making the calls, but this is most likely not how it works.

All these abstractions of sensory input, probabilities, decisions, etc, is probably what creates the consciousnesses but I doubt that there is a meaningful boundary.

After-the-fact rationalization may be true or not - it's obvious that events can't be processed in no time at all, and that there must be some processing time for the experienced "now."

But trying to build some meaningful philosophical arguments from it is probably a red herring.

After all, when trying to explain the reasons behind a decision made without explicit reasoning and external lists of pros and cons, it's often clear that there are many "hunches" that are not easy to summarize or valuate.

The idea that after the fact rationalization negates free will has never been convincing. Observing the consequences of a "choice" then actively attempting to rewire the brain to make a different one in the future is just as much free will as anything else. Free will doesn't require instantaneous control over decision making.

Are you suggesting there is no room for quantum effects in the brain? Honest question because I thought determinism is out of fashion now-a-days for that reason.

Determinism and non-determinism are both just models that can be converted into each other fairly easily. Either way, that has nothing to do with free will, because getting random outcomes is not 'will' at all, let alone free will -- as ill-defined as that concept is.

I subscribe 100% to the 'free-will as an emotion' description you gave else thread, but to play the devil advocate, a non deterministic acausal process could be the gateway for for a metaphysical soul to affect the physical mind. How this can be done without violating the expected probability density function of the non deterministic process I leave it for others to explain.

Quantum effects are still physical. If your behavior is determined by the collapse of a wavefunction, is that more free-willish than behavior determined by a prior state? A double-slit photon doesn't choose a slit via free will, so I'm not sure how quantum mechanics helps inject free will into the brain.

It's hard to prove a negative but....

No one in "mainstream" neuroscience or cell biology takes these proposals seriously. Hameroff and Penrose have proposed some possible mechanisms (e.g., dendritic lamellar bodies), but none of these have panned out experimentally; for example, the DLBs are in the wrong place. They keep revising the theory around these data, but I think a lot of experimentalists have lost interest. I gather there are other more philosophical/first principles arguments against this too.

It would be neat if it were true though....

I don't quite get why quantum effects are required, and the determinism have to be removed to explain "free will".

The deterministic effects in the brain must be negligible, as you can't reasonably reproduce the internal and external state of 100 billion neurons, swimming in a incomprehensibly complex chemical soup that is altered by the brain itself, at the same time experiencing the changing reality.

There's free will as in a decision-making system that generates different options and selects one based on constraints, which is something a computer can do, and then there's free will as in the ability to defy the mechanics of one's biophysics, which is just an indirect way of saying metaphysical soul.

I don't think empiricists can even discuss the kind of free will which defies causal explanation, the kind of mind which exists outside of the brain.

If you reject that kind of metaphysical soul from the discussion, what remains is whatever soul can be housed in a cage of biophysics. And whether you believe there is fundamental randomness in the universe, or whether given perfect information the universe becomes predictable, both perspectives are equally hostile to the kind of free will people dream about.

People who talk about free will want to escape biophysics, and the only way is to talk about the mind outside the brain, or the ghost outside the machine.

Free will is just the sensation of having a brain that models counterfactuals. It allows you to have emotions like hope and regret, since you are capable of imagining the world being different from how it is. It allows you to understand your role in the consequence-that-is-the-current-reality.

And here there's nothing to debate. Of course free will exists. It exists just like every other aspect of mental perspective exists. Nobody is debating whether love, hunger, confusion, or boredom exist just because they aren't quantified and modeled in a properly predictable manner. They're debating free will because religious people want a common experience and a long history of unaddressed confusion to leverage for their self-comfort. A "mystery" that not even scientists can touch.

I don't think there's such a thing as a metaphysical soul. What would that look like anyhow? But I do think any sufficient complex system could show patterns which could be perceived as random. Although they may be hard to reason about. I believe consciousness arises from such processes, and the perception of free will.

There are loads of other reasons not to accept dualistic free will.

So have these studies been invalidated by the software bug as well? If so, do you have any pointers? I. e. which were the infamous studies, and did they indeed use the faulty software to derive their conclusions? I'm genuinely interested.


This article preceded awareness of the bug.

> It seems philosophers STILL haven't learned the importance of applying the scientific method before leaping to a conclusion

To be fair, if you can apply the scientific method it's not really philosophy anymore, it's science. Philosophy exists in order to attempt understanding of domains we cannot rigorously apply empirical reasoning to.

"Apparently fMRI showed motor signals arising before the cognitive / conscious signals that should have created them, assuming we humans have free will."

Has anyone presented any theory where consciousness precedes neural activity that doesn't invoke hard-core dualism and an immaterial soul?

This paper (http://www.ncbi.nlm.nih.gov/pubmed/19423830) tried to prove exactly what you talk about and called it free will, but they used the SPM software that was invalidated.

I thought someone had found that while "fMRI showed motor signals arising before the cognitive / conscious signals" that they also had found that we could choose to negate/override that signal thus allowing free will. That is free will was expressed by overriding the default.

The explanation I heard about that is that recognizing that we made a choice happens after we actually make the choice.

The real takeaway lesson from this research should be the vital importance of Open Data to the modern scientific enterprise:

> "lamentable archiving and data-sharing practices" that prevent most of the discipline's body of work being re-analysed.

Keeping data private before publication is (at this point in time) understandable. Once results are published, however, there is no excuse for not depositing the raw data in an open repository for later re-evaluation.

This is medical data. "Open repository" and "medical data on individuals" doesn't really mix well. Ask the next person denied health insurance based on an open access fmri scan (Just to name the most basic/trivial example).

In fact, "simpler" things like heart rate etc, are not so simple, as mixed with other factors that are needed for control, such data can be surprisingly hard to meaningfully anonymize. I'm not saying we should give up on open medical data, but it is definitively different from eg: open data regarding a physics or chemical experiment, material science etc.

[ed: There's of course projects to collect data, but limit access - in that sense it can be "open" but one would need permission approval to work with the data. There are many such projects, like for Norway, for questionnaires and similar research: http://www.nsd.uib.no/personvern/om/english.html - research that uses standard clauses for possible re-use is also available to be combined with new studies and meta-studies. Often these data are only available in aggregate and/or anonomyzed form.]

Uh, where would you store it though? IIRC from the time a relative was going through a chemistry PHD, they produced several GBs of raw data every half hour. Storage is cheap but it's not that cheap...

It's not that much data. You're storing about 2M voxel image every 2s or so. There are open repositories out there already: https://openfmri.org/

But how expensive was the lab and the instruments that produce that amount of data?

Not to mention the researchers' and post-docs' and students' time. By not paying to store it, you're risking the possibility that all of this money is essentially flushed down the toilet by a software bug because it cannot now be reanalyzed.

Think of the flipside: by paying to store it you are paying to run the risk that your research will be invalidated.

Whilst the incentives are all wrong in this situation, that's a good thing, right?

Even storing - and making accessible - a fraction of it would be a drastic improvement over the current status quo.

Yeah, it's pretty frustrating. I tried getting my own data from an fMRI study, but was told that the signed paperwork specifically disallowed this sort of thing. Not even my doctor could request it. The only option I have is to completely withdrawal myself from the study, but that would be a pretty dick move. The other option is to wait until next year when the study's wrapped and then request it. Though, I'm not even sure if that'll get me the actual data..

FWIW, typical raw fMRI data is mostly useless unless you know the design parameters of the study (stimulus timing, imaging onsets, etc.). Though there are some interesting data-driven analyses techniques, especially for resting-state data.

Many research studies in the U.S. are required to do "structural" scans (high-resolution T1 or T2) and send them to a safety read by a radiologist, for liability reasons. At very least, this scan should be available directly from the hospital imaging department. If you are lucky all the data will have been sent through the hospital PACS and the imaging department will indiscriminately dump everything. At a research-only center it might be more complicated because such images are likely sent off-site for the safety read.

Right, they did actually give me the surface scan. It was really, really cool, though with really low resolution.

What got me interested in getting the data was this: [1]. It might be difficult to work with, but there was a lot of motivation to learn it.

[1] http://nbviewer.jupyter.org/github/GaelVaroquaux/nilearn_cou...

I completely agree that this can be frustrating. For what it's worth, the data from a research study aren't necessarily going to be clinically informative to a GP or even to a radiologist. Structural MRI data could potentially be interpreted, but the sequences collected for research are still different than what you'd get if you went in for a clinical scan.

And if you're interested in having the data for your own purposes, there may still be some liability concerns for the researchers. There's always an agreement to keep MRI data under strict supervision to avoid leaks of personal health information, so providing those data to participants (safeguards or no) could potentially count as a breach of protocol.

Agreed on leaking information. It make total sense why they wouldn't want the participants to get access to the data.

Though, when I found that out, I did some naive scans to see if they had any public webservers that would have the research information. I didn't see any, but there were a huge number of HIPAA violations where many, many computers with health information were exposed directly to the internet. Their security team told me that it wasn't anything to be concerned about. A HIPAA complaint didn't go anywhere, either. Already spoke with a journalist to try to get some traction on it..

I was able to get a copy of my data from a MEG study, supposedly because it was part of my electronic medical record so I was entitled to it. Not sure whether that was sound reasoning, but it got the result I wanted.

I would send it back and ask for a detailed description of the null hypothesis they are testing, because they are not clear on this point at all:

>"All of the analyses to this point have been based on resting-state fMRI data, where the null hypothesis should be true."

They are not careful to explicitly define this null hypothesis anywhere, but earlier in the paper they describe some issues with the model used:

>"Resting-state data should not contain systematic changes in brain activity, but our previous work (14) showed that the assumed activity paradigm can have a large impact on the degree of false positives. Several different activity paradigms were therefore used, two block based (B1 and B2) and two event related (E1 and E2); see Table 1 for details."

This means that they actually know the null model to be false and have even written papers about some of the major contributors to this:

>"The main reason for the high familywise error rates seems to be that the global AR(1) auto correlation correction in SPM fails to model the spectra of the residuals" http://www.sciencedirect.com/science/article/pii/S1053811912...

If the null hypothesis is false, it is no wonder they detect this. In fact, if the sample size was larger (they used only n=20/40 here) they would get near 100% false positive rates. The test seems to be telling them the truth, it is a trivial truth, but according to their description it is correct nonetheless.

Edit: I was quoting from the actual paper.


Nononono. The null hypothesis here is definitely true. They take resting state data ("lie here while we scan your brain") and pretend that it was acquired during an experiment, using either block or event-related designs. Next, they look for differences between the fake blocks (or events).

The null hypothesis is that neural activity is the same across blocks. It must be true because the block boundaries are totally arbitrary. In fact, they were defined after the scanning, so they could not possibly influence neural activity (assuming the resting state data is stationary, which may or may not be true; they discuss that).

The NeuroImage paper you linked to isn't disagreeing with this. They're saying that if you use the above procedure to make your null model, it doesn't work and it doesn't work because of the reason outlined in this paper.

>"The null hypothesis is that neural activity is the same across blocks."

You may have written this before seeing my other response, but I will repeat the point here. You have mentioned only one part of the null hypothesis, there are other assumptions being made that can be violated. From their description, it sounds like they discovered one that is causing the null model to make bad predictions.

Btw, I am using "null hypothesis" and "null model "interchangeably here. Really, null model should be the preferred term but people are so used to "hypothesis".

Can you explain what those assumptions are then, even roughly?

I read the PNAS paper as being an extension of their previous work, not a contradiction of it. They've scaled the data set up (the previous paper had less data), added more packages (the previous paper used only SPM), and ran it as a group analysis (the previous paper was within subjects). They pretty much say that right in the PNAS text:

"That work found a high degree of false positives, up to 70% compared with the expected 5%, likely due to a simplistic temporal autocorrelation model in SPM. It was, however, not clear whether these problems would propagate to group studies. Another unanswered question was the statistical validity of other fMRI software packages. We address these limitations in the current work with an evaluation of group inference with the three most common fMRI software packages [SPM (15, 16), FSL (17), and AFNI (18)]."

Furthermore, the NeuroImage paper says that the AR(1) temporal autocorrelation model in SPM is actually not too problematic for event-related designs:

"Overall, a simple AR(1) model for temporal correlations appears to be adequate for fast designs (E1 and E2) at all three TRs. However, there is a massive inflation of false positive rates at short TRs that is particularly pronounced for slower (block) designs. "

If that's what is bugging you, just ignore the red and yellow bars in the PNAS figures.

>"Can you explain what those assumptions are then, even roughly?"

I have not looked at the code myself. I only read their description of it. As I quoted earlier:

"assumed activity paradigm can have a large impact on the degree of false positives"

If the assumed activity paradigm affects (what they are calling) the false positive rate, it is obviously part of the model. If they are arbitrarily varying it because they don't know which one is "best", it is clearly wrong.

The t-tests will pick up this wrong assumption if there is enough data to do so (even without knowing the details, remember they say it affects the "false positive" rate). The t-test is doing it's job correctly, however, they are plugging in a known to be incorrect null hypothesis, rendering this testing procedure pointless.

Maybe we're agreeing?

The ENTIRE POINT of the paper is that there are inappropriate assumptions baked into the defaults of FSL and SPM. (AFNI has an actual code-does-the-wrong-thing bug). These errors result in cluster-wise tests being inappropriately sized: when thresholds are set to p<0.001, there are more than 0.1% false positives.

They tested this by dividing a single stream of "off" data into segments of "ON" and "OFF" data. They do this in four different ways (two block paradigms and two event-related paradigms). Most fMRI experiments use a paradigm that is close to one of these, so this isn't a fishing attempt--it's meant to make the results broadly applicable .

They argue that, by construction, no ON data is present in this pseudo-data. Therefore, any ON calls must be spurious/false alarms/type I errors/etc, which lets you calculate the actual FWER and compare it with its nominal size.

You seem to be arguing that this step is not true--some other assumption, which they knew about(?), causes it. The NeuroImage paper says that different assumed activity paradigms have different false positive rates because the assumed model is dumb. The event-related paradigms alternate between on and off rapidly (and at random), which washes out the effects of unmodelled low frequencies. The much slower block designs, on the other hand, are especially susceptible to this problem because they spend a long time in each condition.

The current work extends that to look at other ways in which the default assumptions might be dumb in SPM, while adding in more data, different tests, and other packages.

I am open to the argument that the Block Design data (e.g., in Figure 2 in PNAS) does not add much because the NeuroImage paper already suggests those designs are screwed. However, it also says that the event-related designs are actually in pretty good shape: the voxel-based familywise error rates in the NeuroImage paper are about right (Figure 2), but the cluster FWER for the same design is through the roof in the PNAS paper.

Otherwise though, I don't see many unwarranted/undiscussed assumptions in the paper. If you want to keep claiming that the authors (and I) are making wild assumptions, it would be nice to have a vague sense of what they are and why they/I am wrong.

I think we are largely in agreement on the surface. The problem is that they are attributing the "positive" results to statistical error, when it isn't due to that.

The stats are working just fine and correctly telling us the null hypothesis is false. The problem is with the crappy null hypothesis, not the stats.

The experiment is set up so that there are no true positives.

When the threshold is p<0.05, then 5% of the calls should be positives, and they are all false positives. If anything else happens, SPM (etc) are borked. The only other alternatives are that the (fake) event design is somehow flawed in a way that exploits the non-stationarity of the fake event design, or the analysis code for the paper is somehow buggy.

Are you trying to say that it's not the t-test per se, but something before it in the pipeline? If so, sure, of course, but that's not the null. The null is that chunks randomly labeled as ON are the same as chunks randomly labeled as OFF, and since it's not physically possible for the labels to affect the signal, this null must be true.

>"The experiment is set up so that there are no true positives."

No, a true positive is when the null model is called false when it is false. They say the null is known to be false. The errors here are actually the false negatives, in this case that is every "non-significant" result.

>"When the threshold is p<0.05, then 5% of the calls should be positives"

Only if the null model is correct. In this case they describe problematic aspects of their null model. Ie they know it is false.

> They say the null is known to be false.

Can you quote where they say that, because the article says:

"Because two groups of subjects are randomly drawn from a large group of healthy controls, the null hypothesis of no group difference in brain activation should be true. Moreover, because the resting-state fMRI data should contain no consistent shifts in blood oxygen level-dependent (BOLD) activity, for a single group of subjects the null hypothesis of mean zero activation should also be true."

There is also this quote:

"Resting-state data should not contain systematic changes in brain activity, but our previous work (14) showed that the assumed activity paradigm can have a large impact on the degree of false positives. "

This does not mean the null is false. That papers shows that temporal autocorrelations can also introduce spurious false positives. These factors affect block, but not event-related designs. They can be avoided by filtering--and maybe by averaging over many subjects (since these oscillations aren't locked to anything). In any case, they don't appear, as you can verify on the right side of Figure 1A.

Finally, the discussion has a paragraph describing reasons why the experiment might be flawed, but most of those possible objections seem answered.

The key point is that I am saying the second quote does say the null is false. This is where we differ. The actual null hypothesis they are testing is different than the one they actually care and talk about. It includes additional auxilliary assumptions that they say are wrong.

What gets plugged in as the null hypothesis is not a problem in the realm of statistics. That is a choice of the researcher. The stats algorithm worked fine.

That sentence is closer to a link than actual content.

If you look at paper (14), it has the following graph: http://imgur.com/a/oxSqp It shows that the null model is perfectly appropriate for individual voxels under the event-related paradigms. The FWER there is indistinguishable from 5%, as it should be if the null were true.

It also shows that FWER for the null model is inflated by ~6x for blocked paradigms. As they say in the new paper, it is unclear if that would re-occur in a multi-subject experiment (that was apparently a criticism of the first paper).

Based on this, the absolute worst you can say is "the data on blocked paradigms is not very interesting", but that leaves their event-related data completely intact.

As for the stats thing, the actual formal itself test is obvious fine--it's a t-test! They're saying that the pipeline that runs between the raw DICOM output of the scanner and the t-test, which is an equally important part of the testing procedure, is flawed.

I'm happy to keep discussing this with you, but we seem to be going in circles. If "additional auxiliary assumptions" are the problem here, please describe them. If you think the 2012 NeuroImage paper proves that their null model is wrong in all cases, I'd love to read a few sentences describing how that follows.

>"They're saying that the pipeline that runs between the raw DICOM output of the scanner and the t-test, which is an equally important part of the testing procedure, is flawed."

Yes, in other words the null hypothesis is being rendered false, by some aspect of the analysis pipeline. I am saying that is the correct description of the problem.

Describing the problem as excess false positives is confused, because these are true positives.

They appear to have successfully identified a problem, but then described and analyzed it incorrectly.

>"The FWER there is indistinguishable from 5%, as it should be if the null were true."

If the null is true, the pvalues should be samples from a uniform distribution. Another thing is they should have shown histograms of these. The 5% below 0.05 is not the whole story. There could be other ways the deviation manifests and/or they just chose sample sizes to be powered to get that result.

"I'd love to read a few sentences describing how that follows."

We have already read the exact sentences that mean the null model is false, multiple times. You have agreed that something was wrong, but want to call it something other than the null model.

> Describing the problem as excess false positives is confused, because these are true positives.

This is deeply wrong. A "true positive" in this context would mean that the resting brain activity of multiple subjects is actually correlated with arbitrary length, randomized, moving test windows. Again, the data is untreated; they are imposing arbitrary test windows and looking for increases in brain activity (increased image intensity) correlated with the stimulus-ON sections of the pseudo-treatment.

From reading some of your prior comments on the subject, it seems prudent to point out that the causal link between stimulus and BOLD effect is quite literally observable: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130346/

>"This is deeply wrong. A "true positive" in this context would mean that the resting brain activity of multiple subjects is actually correlated with arbitrary length, randomized, moving test windows."

This is the same misconception I have been trying to dispel. The null hypothesis is not the inverse of that "layman's" statement, it is a specific set of predicted results calculated in a very specific way. It is a mathematical statement, not an English prose statement.

In this case, apparently one part of this calculation (this assumption about autocorrelation, whatever it is doesn't matter to my point) has lead to such a large deviation from the observations that the null model has been rejected. The null model has been rejected correctly. This is not a false positive.

The problem here is not false-positives due to faulty stats. It is the poor mapping between the hypothesis the researchers want to test, and the null hypothesis they are actually testing.

The tools provided by statisticians look like they worked just fine in this case. If the researchers have decided to use them for GIGO, that is not a statistical problem.

> This is the same misconception I have been trying to dispel. The null hypothesis is not the inverse of that "layman's" statement, it is a specific set of predicted results calculated in a very specific way.

The entire point of this discussion is that the software is not calculating the null hypothesis that users expect. That the current model it does in fact calculate is internally-consistent is tautological and irrelevant (though actual bugs were found in at least one package).

As you yourself said: "I didn't read the code, or even the paper very closely." (https://news.ycombinator.com/item?id=12037207) Perhaps you should do?

There really is no reason for me to read the paper closely. They say that the null hypothesis is wrong, and they know exactly why. Then, like multiple people responding to me here, they also want to say somehow the null hypothesis is true.

Everyone who has questioned me also does admit there is something wrong with that hypothesis they tested. You do it too: "the software is not calculating the null hypothesis that users expect", but then you also want to say the null hypothesis they tested is true! Just bizarre, what underlying confusion is making people repeat something clearly incorrect? The null hypothesis cannot be both true and false at the same time.

There is a big difference between a "positive" result that is a "false positive" and a "positive" result due to a false null hypothesis. This is a clear cut case of the second.

> You have agreed that something was wrong, but want to call it something other than the null model.

That's not really what I've said.

In the previous NeuroImage paper, they generated null data by "imposing" block and event-related designs over resting state data. When they did a single subject analysis of that data with SPM, they found that the block designs had excess "positive[1]" results. However, analyzing data with an event-related design had the expected proportion of positive events.

Based on this, the event-related null model looks fine. The block designs may also be fine when data from enough data from multiple subjects is analyzed together. This makes sense because the problem was related to low-frequency oscillations that aren't synchronized across subjects.

However, you don't even have to assume this. The right-most panel of Figure 1A (and S5, S6, and S11) of the PNAS paper repeats this analysis and the voxel-by-voxel tests are appropriately sized, p<0.05 yields a 5% FWER.

> If the null is true, the pvalues should be samples from a uniform distribution. Another thing is they should have shown histograms of these. The 5% below 0.05 is not the whole story. There could be other ways the deviation manifests and/or they just chose sample sizes to be powered to get that result.

I agree that a p-value histogram would be nice, but I don't think it's essential. However, there's no way the sample sizes were cherry-picked, as you suggested: they get essentially the same result with 20, 40, and 499 subjects


[1] As in the paper, I think of these as false positives. We know that there's no way the designs actually influenced neural activity. They are totally arbitrary and were assigned after the data was collected.

You seem to want to call these true positives instead. There are two reasons you might want to do this, but they both strike me as a little off. I suppose these are "true positives" from the perspective of the final t-test, but only because an earlier part of the analysis failed to remove something it should have. It seems like weird to draw a line in the middle of the analysis like that.

Alternately, you might call them true positives because you're not convinced that the slice-and-dice procedure generates two sets of indistinguishable data. If so, you should say that and say why you think that. The one sentence you quoted does not count, for the reasons I outlined above.

> "I agree that a p-value histogram would be nice, but I don't think it's essential."

Sure it is, here is an example using R where only about 2% of the tests report p<0.05:

  Nsim=1000;   n=100
  for(i in 1:Nsim){
    a=rnorm(n,0,1); b=rcauchy(n,0,1)
    p[i,] = t.test(a,b, var.equal=T)$p.value
    m[i,] = cbind(mean(a),mean(b))
    sig = sum(ifelse(p<0.05,1,0), na.rm=T)/i
    hist(p, col="Grey", breaks=seq(0,1,by=.01), freq=F, main=round(sig,4))
You can see from the histogram that there is a very clear deviation from the null hypothesis, and the %p values under 0.05 doesn't come close to telling the story of what is going on: https://s31.postimg.org/n9w3ydr63/Capture.jpg

> "I suppose these are "true positives" from the perspective of the final t-test, but only because an earlier part of the analysis failed to remove something it should have. It seems like weird to draw a line in the middle of the analysis like that."

Yes, precisely. That is the only perspective that matters because the p-value is being used as the actionable statistic in the end. This p-value has no necessary connection to what you think the null hypothesis was, it has to do with the actual values and calculations that were used.

Any other perspective is the perspective of someone who is confused about what hypothesis they are testing. This is, once again, not any fault of the statistics (maybe stats teachers... but that is another issue). Choosing an appropriate null hypothesis (that actually reflects what you believe is going on) is an investigation-specific logical problem outside the realm of statistics.

> That is the only perspective that matters because the p-value is being used as the actionable statistic in the end.

You're obviously entitled to your own perspective, but extracting the t-test from the rest of analysis like that is...idiosyncratic...at best.

Forget about all the MRI stuff for a second and imagine we were working on a grocery store self-checkout system. As you scan your purchases, it weighs each item and tests the weight against some distribution of weights for that product. If the weight is way out in the tails of the distribution, a human checks your cart; this keeps you from scanning some carrots while stuffing your cart with prime rib.

The checkout computer will occasionally flag a legitimate purchase. Perhaps the customer found an incredibly dense head of lettuce[1] . I would call this a false positive: the z-test (or whatever) is incorrectly reporting that the sample is drawn from a different distribution.

Now, suppose that the scale incorrectly adds some weight to each item. Maybe the sensor is broken or a bunch of gunk has accumulated on the tray. As a result, the checkout system now flags more legit purchases for human review. Are you actually refusing to call these "an increase in false positives", since the z-test is working correctly, but it's been fed a database of accurate weights instead of "item weights + gunk weight")? How about if the item database is wrong instead (e.g., chips now come in a slightly smaller bag)?

Or, here's a purely statistical version. Suppose you fit two linear models to some data--one full model and one reduced one--then compare them via an F-test. However, the regression code has a bug that somehow deflates the SSE for the reduced model. I would say this procedure has an inflated false positive rate. You seem to be saying that this does not count as a false positive: the F-test is doing the right thing given its garbage input.

> Any other perspective is the perspective of someone who is confused about what hypothesis they are testing.

Again, this almost makes sense from the perspective of the t-test, but that is a bizarre perspective to take. A MRI researcher wants to know if the BOLD signal in a cluster, once corrected for various nuisance factors, varies across conditions or not. That ("or not") is a perfectly reasonable null hypothesis.

If the corrections are imperfect, it doesn't mean that fuzzy thinking lead them to choose the wrong null hypothesis. At worst, it means that they were too trusting with regard to the correction (but let's not be too hard on people--this is a hard problem).


[1] Assume it's sold by the head and not by weight.

[2] And sure, you can write out what the motion correction, field imhomgenity correction, etc. mean in excruciating detail if you want to actually calculate them.

>"extracting the t-test from the rest of analysis like that is...idiosyncratic...at best."

So much nonsense about statistics has been institutionalized at this point that this sounds like a compliment. To be sure, the issue here is minor compared to the various misconceptions like "p-value is the probability my theory is wrong", etc that are dominant.

>"Are you actually refusing to call these "an increase in false positives", since the z-test is working correctly, but it's been fed a database of accurate weights instead of "item weights + gunk weight")? "

Yes, the hypothesis should include extra uncertainty due to gunk if that is an issue. These are true positive rejections. It is a bad hypothesis.

>"How about if the item database is wrong instead (e.g., chips now come in a slightly smaller bag)?"

Once again, the hypothesis was wrong. We are right to reject it.

In both these cases calling it a "false positive" is misleading because it focuses the attention on something other than the source of the problem: a bad hypothesis. These are true positives.

>"A MRI researcher wants to know if the BOLD signal in a cluster, once corrected for various nuisance factors, varies across conditions or not. That ("or not") is a perfectly reasonable null hypothesis."

Then they should deduce a distribution of expected results directly from that hypothesis and compare the observations to that prediction. What seems to be going on is they say that is the hypothesis but then go onto test a different one. Honestly, just the fact they are using a t-test as part of a complicated process like this makes me suspect they don't know what they are doing... Maybe it makes sense somehow, but I doubt it. Disclaimer: I haven't looked into the code or anything in detail.

Maybe this will help. I am thinking of the same distinction discussed by Meehl in Figure 2 of this paper:

"The poor social scientist, confronted with the twofold problem of dangerous inferential passage (right-to-left) in Figure 2 is rescued as to the (H → O) problem by the statistician. Comforted by these “objective” inferential tools (formulas and tables), the social scientist easily forgets about the far more serious, and less tractable, (T → H) problem, which the statistics text does not address." http://rhowell.ba.ttu.edu/Meehl1.pdf

Here, we are not even yet talking about the substantive theory, although that issue will also exist. There is also a difference between the statistical hypothesis (the mathematical object) and what the researcher wants to test. I think you are conflating those two hypotheses.

See mattkrause's comment below. I think you might not be understanding what they're testing. They're taking only resting state data, randomly sorting some of it into "pretend active state data" and then asking whether they get any statistically significant difference between these two groups when they shouldn't. But they do. That means the tests they used, the same tests many authors use, are giving false positives. The null hypothesis is "there will be no difference between testing state data and whatever data we get when we ask the subject to do some activity". They can "reject" that hypothesis using only randomly shuffled resting state data, so there's something wrong with the stats packages themselves

I think that like mattkrause (and the authors of the current work), you have forgotten that the null hypothesis is something larger than condition 1 == condition 2. There are various other components, usually (somewhat misleadingly) referred to as assumptions, that can also cause the predictions derived from the null model to deviate from the data.

For stuff like a t-test, one parameter value (ie the mean of the distribution) gets all the attention. But this is wrong, it is only one part of the model being tested.

I'm aware that there are assumptions implicit in what the null hypothesis is. You are the one who keeps saying the authors don't even realize what those assumptions are, but you haven't pointed out anything besides what the authors said. What are the other faulty assumptions you've identified that the authors are missing? I guess you're referring to some sort of issue with power that you mentioned in your previous comment?

>"You are the one who keeps saying the authors don't even realize what those assumptions are, but you haven't pointed out anything besides what the authors said."

I am saying they are confused because they say "the null hypothesis should be true" under their conditions, when they know for certain that it is false! Therefore these "false positives" are not false at all. They are totally legit "true positives".

These authors are blaming the statistical test when the problem lies with their crappy choice of null hypothesis.

There may very well be other issues, but I have not inspected the code or done anything other than read the description in the paper.

Okay, why do you think the null is crappy?

In the absence of any information whatsoever, the idea that an analysis pipeline produces false positives at or below its nominal rate seems pretty reasonable.

But, they have some prior information.

Let's look at the E1 paradigm (2 sec on, 6 sec off). In the NeuroImage Paper (Figure 1A, 2A), the FWER on voxel tests is statistically indistinguishable from 5%. In other words, it's appropriately sized. They replicate this result in the rightmost panel of the PNAS paper, where it's also within the 95% CI around 5%.

Now, for the cluster inference, look at the left-most panel of PNAS Figure 1A. The E1 paradigm is the green bar. Using the defaults for FSL (left panel) and SPM (middle panel), the FWER is about 30% and 25%, respectively. That is not good.

I agree that the Block designs look awful in the NeuroImage paper, which makes it hard to say whether this phenomena makes it worse in the PNAS data. It's unfortunate that these numbers are going to be in the press release (70% is much sexier than 30%), but 6x the nominal rate is still bad.

>"Okay, why do you think the null is crappy?"

Because their goal is to determine if some sort of treatment has an effect. If the null is false for other reasons, then statistical significance can't be used to support the existence of a treatment effect. So these would be pointless, pedantic calculations.

@nonbel: Are you saying that an fMRI, when taken of a subject at rest, twice - then what the data here show - is that this is likely to be interpreted as a subject in two different states? And the researchers are ignoring the fact, rather insisting that we should be able to tell that these two states are the same (and then, perhaps tell them from other states)?

I see a few ways this could come about: perhaps the way we record and model activity doesn't conform to the distribution we assume (I'm not sure if they assume a normal distribution here - or if that even makes sense given the nature of the data) -- or perhaps the issue is with taking 3d/4d data and "turning it into" an easy-to-model statistical model (like the normal distribution)?

At any rate, it does seem that they're saying we can't tell that one individual at rest, measured twice, is in the same (rest) state both times? Hence, they're null hypothesis is bunk?

>"Are you saying that an fMRI, when taken of a subject at rest, twice - then what the data here show - is that this is likely to be interpreted as a subject in two different states?"

Yes, that is what they seem to be saying. I didn't read the code, or even the paper very closely. However, from what I quoted, they seem to be saying there is some assumption about autocorrelation that introduces what they call "false positives".

I am saying they have mischaracterized the problem. These are true positives.

@mattkrause: you beat me to it

Eh, you'll scoop me on something when it matters.

PS: If you are the song bird guy, I saw a talk of yours on youtube. Cool stuff.

Doesn't sound like a straight up bug, but rather unsound statistical methods which can happen with or without software. You get the same problem with finite element analysis software: the operator has to be aware of all the assumptions baked in, and has to ensure that the input conforms to them.

I'd say it's a bug, since the unsound statistical methods are incorporated into the three most common software packages that are specifically intended to be used for fMRI analysis. If these methods don't work well for fMRI data, fMRI software shouldn't be using them.

From the paper: "Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%." (http://www.pnas.org/content/early/2016/06/27/1602413113.full)

Choosing a bad algorithm for your software is as much of a bug as a null-pointer crash, and is something that needs to be tested for when building software.

No, a statistical modeling assumption is not the same as a software bug. There is no perfect model, and nonparametric tests are approximations at best. The important thing is that the publication reports the modeling assumption so that readers can judge the conclusions relative to that choice. Now we know to judge results using the Gaussian model to have a high false positive rate.

On the other hand, no researcher would knowingly publish results affected by software bugs, e.g. null pointers or arithmetic errors... hopefully.

And it's a little disheartening when it takes this long to find such a significant bug. Lots of work gone to waste. I wonder if the data can be rerun? Sounds like lots of the raw data wasn't archived correctly or at all.

The real "bug" here isn't even really statistical. It is the usual logical issue that has been well known for years (yet the problem has only grown and spread):

"In most psychological research, improved power of a statistical design leads to a prior probability approaching 1/2 of finding a significant difference in the theoretically predicted direction. Hence the corroboration yielded by “success” is very weak, and becomes weaker with increased precision. “Statistical significance” plays a logical role in psychology precisely the reverse of its role in physics." http://cerco.ups-tlse.fr/pdf0609/Meehl_1967.pdf

In other words, the fMRI researchers have been testing a null hypothesis known to be false beforehand, so statistical significance tells us nothing of value, and definitely nothing about the theory of interest.

This paper is really interesting, because the authors don't seem to realize that this is what they are discovering. Getting a "statistically significant" result is the correct answer for the tests to return, these are not false positives. The problem is a poorly chosen null model.

Are you sure about that? As I understand it, they are testing a really-really-null hypothesis.

Specifically, they started with some resting state data (a condition where the subjects just lie there, attempting to remain alive). They then took this data and imposed fake task structures over it, as if the subject were doing two different things (e.g., condition #1 starts at t=0, t=1 min, t=3 min, t=6 min; condition #2 starts at t=2, t=4, t=5 min). Once this is done, the task+data is submitted to their analysis pipeline.

The null hypothesis (fake condition #1 == fake condition #2) has to be true, by construction. This assumes that the original data doesn't have any structure, which may or may not be true (they discuss it a bit towards the end of the paper).

>"The null hypothesis (fake condition #1 == fake condition #2) has to be true, by construction."

The null hypothesis is not only that condition 1 == condition 2, it involves other assumptions being made. From their own description, they knew it was false before even doing this analysis and used this knowledge to design the study. Apparently there is some incorrect assumption about autocorrelation being made.

It seems the authors did not really grasp what hypothesis they were testing, leading to their confused (but still productive) description of the problem. I go into more in this post: https://news.ycombinator.com/item?id=12032772

You're correct that the main finding reflects flawed statistical assumptions made by the major neuroimaging packages (or so these authors contend), but they did uncover a specific bug in one of the three packages (AFNI):

"... a 15-year-old bug was found in 3dClustSim while testing the three software packages (the bug was fixed by the AFNI group as of May 2015, during preparation of this manuscript). The bug essentially reduced the size of the image searched for clusters, underestimating the severity of the multiplicity correction and overestimating significance ..."

The paper has been rebutted by other researchers who argue that the original results hold:

"This technical report revisits the analysis of family-wise error rates in statistical parametric mapping - using random field theory - reported in (Eklund et al., 2015). Contrary to the understandable spin that these sorts of analyses attract, a review of their results suggests that they endorse the use of parametric assumptions - and random field theory - in the analysis of functional neuroimaging data. We briefly rehearse the advantages parametric analyses offer over nonparametric alternatives and then unpack the implications of (Eklund et al., 2015) for parametric procedures."


Sort of. The rebuttal (by Flandin and Friston) suggests that properly-applied parametric statistics of the kind they favor are valid. Eklund et al. wouldn't disagree with that because their own findings support it, but they would point out that not all researchers necessarily adhered to the conservative statistical approach that F&F discuss. More specifically, both sets of authors describe the importance of using a conservative "cluster defining threshold" to identify spatially contiguous 3D blobs of brain activation. Eklund et al. use their findings to raise the question of whether the bulk of fMRI reports were conservative in this regard.

"Further: “Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape”."

This has nothing to do with bugs and everything to do with bad statistical analysis. It's Google Flu all over again.

It's all relative. At the right level of abstraction, bad statistical analysis is a bug.

"Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape."

In other words, researchers cut corners. You should never assume that something is a certain way without rigorously proving it. How did these papers make it past peer review?

Peer-review in biology-related fields doesn't bother too much about the details of your code. Only recently a few journals begin to ask guys to submit the source code if the manuscript is mainly about some new analysis methods. So it is more than likely even if you are reading submitted manuscript of a computer program, you won't read the source code with the paper.

Another thing is that statistical analysis is kind of the missing piece in many researches training. It maybe not be "researchers cut corners", but they might not know some of the prerequirements for the analyses used. Even in this year, 2016 when I talked with a colleague in my institute I found out he didn't know a single assumption for t-test, who is a senior postdoc and used it everyday.

They were probably peer-reviewed by biologists, not statisticians. Peer review is mostly just a good think and read for about a month, not a perfect analysis.

...as long as by "read for about a month" you mean "sit on it for a month, then read for a few hours".

haha, so true

Developing a statistical threshold requires some null hypothesis, which very often takes Gaussian or linear form. The paper on which the method is based likely states their assumptions clearly, allowing their result to pass peer review.

Papers merely employing some method are rarely reviewed by the same peers as those of the methods papers, and often neither aware of or state the assumptions of the method.

This is how this happens, at least in neuroscience. You can look down on it and judge, but neuroscience has not really found its Kuhnian paradigm yet, so be nice.

I reckon there is a vast number of similar problems in other studies across most fields. Linear regression, ANOVA, and t-tests are widely used techniques with an assumption of Gaussian errors that is rarely checked. I wonder how many of these papers would become null results if switched to nonparametric tests...

fMRI papers tend to be written by neuro-psychologists or similar folks. Not engineers, statisticians, or physicists. They get reviewed by them too.

n=30 ought to be enough for anybody

Depends how big the effect size is. For putting a loaded revolver next to your head, and pulling the trigger, n=30 is plenty.

n>=30 is not the same as gaussian distribution


The dead salmon study seems relevant here in discussion of how fMRI is used, especially the theory -ladenness of observations.


Maybe I missed the link, but the full text of the readable, relevant, and enjoyable article that blog post discusses is here:

The principled control of false positives in neuroimaging

Bennett, Wolford, Miller 2009


Thanks. I'm on a mobile in a foreign land, so had issues tracking down a useful link.

And from a couple of days before.


Just goes to show when you're doing science you need to test and validate your experimental methodology, including the tools you use. In computer vision, its common to need to do some kind of calibration for many algorithms which can usually reveal some kind of statistical error or problem. I wonder why none of the researchers thought to do some very simple validation of the data?

And I wonder if the software was at one point correct and then this bug was introduced at a later point? Many times it feels like after a company does a formal scientific validation they never do it again despite the fact they have engineers constantly working on the code...

Well, I think the problems with interpreting fMRI scans have been at least vaguely known since that time a dead salmon activated its neurons when asked to judge the of a person from a photo, this was in 2009.


The dead salmon article is a bit of a red herring here. It's a clear demonstration that a shoddy statistical approach can undermine fMRI findings. Critically, the implications of the current paper extend to even research that has been rigorously analyzed using field-standard software. Statistical issues are at the heart of both papers, but the newer paper identifies problems that are subtle and ubiquitous.

I don't see how that's really relevant. The point of the dead salmon paper was "do your damn FWE corrections". The point of paper, is rather "popular tools make unsound assumptions" (or, inversely, "researchers are violating the assumptions made by their tools").

The problem in the salmon paper is easy to fix. The problem in this paper is trickier.

If this results to be true, this could be one of the most expensive bugs in computer history.

Only, if you assume that those studies were of value without the error. obligatory xkcd: https://xkcd.com/1453/

Does it mean studies like these are likely bunk?

And with bunk I mean doesn't show what they claim.

"How X looks like" -- what's with this gramatical mistake? I see it everywhere. Is it a regional thing?

It's a common error for native Swedes, It's a direct translation from Swedish. Could be true for other languages as well, I believe most Germanic languages use "how".

English uses "how" too: it's correct to say "how X looks", with no "like".

"what … like" is synonymous with "how".

It's already been known for several years that almost all MRI brain scan research is wrong, what exactly is new here?

There's this weird snobbishness about fMRI: it's uniquely terrible, the people are hacks, etc. It seems particularly common amongst first and second-year grad students who are doing something they think is "harder" science. I hate it.

In the right hands, fMRI can be a really powerful technique for probing neural activity in healthy human subjects; in fact, it's one of the only ways to do so that has decent spatial resolution and thus lets you link brain structure and function.

It certainly does have problems. There are plenty of ways to subtly mess up the data analysis or over-interpret results. The experimental design is often lacking, etc.

However, I think these largely reflect the very low barrier of entry to fMRI research--all you really need is a laptop and somewhere willing to sell you scanner time (almost all major universities or hospitals)--rather than some intrinsic limitation of the field. The good work remains very good.

It's not that the research is 100% wrong, but the way the research is reported in the news is generally wrong. No, researchers can't really tell if you're thinking about a sailboat from looking at an fMRI, despite whatever the newspaper says.

Source: worked in an fMRI research lab for 4 years.

There's a fair bit of junky data and quite a few analysis methods that cannot compare data accurately between magnets, position wrt isocentre, or even processing software revisions. But your claim is a lot bolder than that.

> There's a fair bit of junky data and quite a few analysis methods that cannot compare data accurately between magnets, position wrt isocentre, or even processing software revisions. But your claim is a lot bolder than that.

What is the prevalence of each of those types of errors? Since AFAIK many of those errors occur in 30%+ of papers, unless you're assuming close to 0% independence, my claim doesn't seem especially bold...

The methods I am familiar with, mostly dealing with volumetric measurements, registration, and automated segmentation were either robust or had well characterized limitations.

I take some offense to blanketing the entire field of MRI based on a couple articles pointing out that some (admittedly a lot) fMRI experiments are statistically unsound.

How would you estimate the prevalence of inaccurate research in the field, if not by multiplying each (known) source of error by its estimated prevalence, and also using some estimate of the independence between errors? (And of course including general sources of error that affect all scientific research and aren't specific to any particular field.)

And as far as independence goes, I haven't seen any research to suggest these errors have anything less than 100% independence, although if there is any research suggesting that these errors (or any other errors in science research) are correlated then I'd love to see that.

> almost all MRI brain scan research is wrong

Source for a layman?

I wasn't in fmri research but some fellow students in my lab were on it. I do know there was a study with dead salmon which were showing up as having "active" brain regions. I have to find it but am on my phone.

It's easy to imagine the main difficulty with this is mapping signal to actual thoughts and regions. It's a really complex biology and physics to reduce.

The salmon thing was a statistical problem, not a technical one.

fMRI data generate an incredible number of data points: imagine a movie, but in three dimensions, so you get a sequence of x/y/z-volumes. A typical scan has ~128 to 256 voxels in each spatial dimension, for ~1 million voxels per volume.

This means that if your analysis contains voxel-by-voxel tests, you're going to be running a huge number of them. Even if each test has a fairly low false-positive rate (say 0.1%), there are still a huge number of tests, and thus, a huge number of false positives.

There are principled ways of correcting for this; there are also hacky "folk methods" like setting a more stringent false positive threshold. The fish poster argues that the latter doesn't work, using a deliberately silly example.


> The salmon thing was a statistical problem, not a technical one.

How is that relevant to the question of whether or not most fMRI research is wrong?

1. It's trivially checkable for any existing individual paper. You skim the methods or search for "multiple comparisons" or "false discovery rate" or something like that.

2. For papers where this wasn't done, the already-collected data can be reanalyzed. In fact, you can often get correct it without access to the raw data (at least approximately).

3. It means that future papers (where future is somewhere after 2008-9 here) can be done correctly from the get-go; it's not a limitation of the technique or the signal itself.

If you believe that the majority of papers in the field are not incorrect, what are the assumptions behind your estimation of the numbers?

The significance of the dead salmon paper is waaaay overblown.

That paper was a warning to researchers that family-wise error corrections are, indeed, important. A small (but non-negligible) portion of researchers were not doing these corrections simply because "the results looked nicer".

The dead fish guys wanted to demonstrate the extent to which this could produce false positives. The (rightly) make no comment on the soundness of fMRI as a tool for cognitive science.

Searching for "dead salmon fMRI" turned up the following:


Back when I was involved in MRI research, multiple comparisons correction was the rock that many a paper in the field crashed upon. Neuro psych. suffers from weak stats, which is a far cry from saying either fMRI itself is meaningless or saying MRI as a whole is.

what is your field? Neuroscience?

Electrical engineering (medical imaging - medical imaging technologies)

This is the best I could find for the paper. not what I originally meant to cite. http://onlinelibrary.wiley.com/doi/10.1002/scin.5591761320/a...

Here are a couple Ben Goldacre articles on neuroscience in general:


For fMRI research specifically, I seem to remember that almost all of the analysis software is closed source, and that comparing the readings between software versions isn't actually meaningful even though a lot of papers do so.

> almost all of the analysis software is closed source

This is almost certainly not true. BrainSurfer is closed-source, but almost everything else I can think of is open-source.

* SPM is available under the GPL: http://www.fil.ion.ucl.ac.uk/spm/software/

* FSL (including the source) is available under their own noncommercial "As-Is" license. FSLView seems to have its own separate GPL license: http://fsl.fmrib.ox.ac.uk/fsldownloads/

* AFNI used to require (free) registration, but is now GPL: https://afni.nimh.nih.gov/afni/about/legal

* The BIC-MNI tools + source appear to be freely available for free under an "as-is" + attribution license. I only checked a few; please let me know if any are not and I will nag people accordingly: https://github.com/BIC-MNI

* Caret is GPL: http://brainvis.wustl.edu/wiki/index.php/Caret:About

* FreeSurfer has its own license which is very open: http://www.freesurfer.net/fswiki/FreeSurferSoftwareLicense

I suspect you're thinking of this, which reported differences between FreeSurfer versions: http://journals.plos.org/plosone/article/comment?id=info%3Ad...

That turned out to be a combination of 1) algorithmic changes 2) Different random seeds (some of these functions use Monte Carlo-like methods), and 3) 32 vs 64 bit platforms.

Err...I was actually thinking of Brain Voyager, which is closed source and incredibly expensive. However, luckily for me, there is also a closed source package called Brain Surfer(!)

Actually for fMRI studies most people use matlab and share it. It is not really closed-sourced. But I guess many guys in the field won't bother too much to read the codes from a widely used and well-cited matlab script

That's actually the same article on two sites. Itself just a layman's piece on an actual article in Nature Neuroscience.

Oops, deleted the second link. The reason I didn't link to the original journal article is that JumpCrisscross specifically asked for a source for a layman.


I thought that fMRI hasn't been taken seriously since they scanned a dead fish and found it was thinking.

>fMRI hasn't been taken seriously

I don't know where you got that (very strange) impression.

The dead fish paper showed that multiple comparisons yield false positives, which is a problem that is much broader than fMRI methodology. With proper method, fMRI is a very reliable and insightful tool. It's just very difficult to do properly.

That was just a false positive.

fMRI is great for generating headlines and pretty pictures for the popular media.

However most neurologists view the vast majority of fMRI research as junk science.

Title should really read "fMRI" instead of "MRI". The referenced journal article is titled "Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates".

Fwiw there's a response to an earlier (preprint) version of that paper from some of the developers of the packages in question.

Preprint version of the "Cluster failure" paper, from last year: http://arxiv.org/abs/1511.01863

Response: http://arxiv.org/abs/1606.08199

Flaundin & Friston's response is interesting because it essentially endorses the findings of Eklund et al. (except for one element of the Eklund analysis that they suggest is a modest error). F&F believe that by setting one parameter correctly (i.e., using a conservative cluster forming threshold) the validity of their preferred parametric statistical approach is upheld. Eklund et al. might quibble because their take-home message is that non-parametric methods should be used instead, but their findings are not misrepresented by F&F.

Regardless, an open and important question is how often other authors used a sufficiently conservative cluster forming threshold for their fMRI analyses. If nothing else, Eklund et al. will cause future reports to be more cautious in this regard.

The same tools (AFNI, SPM, and FSL) are used in a lot of other MR research protocols as well - not just fMRI.

They are but 3dClustSim (the culprit) is essentially used only for fMRI.

This is right and wrong: AFNI's 3dClustStim did contain a specific bug (since corrected) and is used for fMRI analysis, but the problem that the authors identified is more general than that and encompasses all of the tested packages.

Ok, we consed an f.

What a crap article.

This is why I always eat my science which a large helping of humble pie with extra skepticism.

So people don't perform complex, goal-focused motor tasks without having a goal ahead of time after all.

Wow, philosophy people.

EDIT: Cry about it all you want. It won't change the fact that in 100 years people will look back and wonder if an entire academic discipline was afflicted with some form of literal mental retardation.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact