Reproducibility trial: Biologists get different results from same data sets

karmakaze · on Oct 13, 2023

The 'results' here isn't about confirming or rejecting a hypothesis, it's about analyzing data for a strength of effect, which seems fair enough.

> The researchers gave scientist-participants one of two data sets and an accompanying research question: either “To what extent is the growth of nestling blue tits (Cyanistes caeruleus) influenced by competition with siblings?” or “How does grass cover influence Eucalyptus spp. seedling recruitment?”

What follows is that the variations in choices of how the data is analyzed gives a wide range of 'strengths' that when averaged is not significant, or extreme and opposing individual strength results. What this is highlighting is that the data analysis (even for identically reproduced data) is a source of irreproducibility meaning that it itself needs to be explicit and validated.

rrsp · on Oct 14, 2023

Does anyone know if there’s a reason why we’ve not agreed on a standard way of analysing data for any given type of study or type of outcome?

Maybe it’s naive but my intuition would be that if someone were conducting a randomised control trial with a hypothesis that A > B then there should be a known best practice for analysing that data to check the hypothesis. A way that is reproducible, otherwise surely it’ll become a shitfest of people trying multiple methods of analysis and then publishing whichever produced the ‘best’ result.

mcphage · on Oct 13, 2023

This doesn't seem like quite the same issue as the reproducibility problems in psychology, for instance. Researchers were presented with some data, and they came up with different analyses for it. That doesn't seem especially surprising—for a complex phenomenon, where you don't have a ton of data, and maybe the thing you're studying isn't very strong—yeah, you're going to get different results.

It's like people arguing over which sports team is better—the facts of the game are the facts of the game, but there's plenty of room for disagreement over what stats are most important, who has an easier season—and even if the two teams were to play each other directly, it's not always that the "better" team wins.

But the reproducibility crisis in social sciences seemed much deeper—one team would say that some effect exists, and other teams would try to reproduce the results and come up with nothing.

tgv · on Oct 13, 2023

There is a relation between them, but not a strong one, I agree. The more conclusions you can draw from a data set, the more will be wrong. A possible reason, hinted at in the article, is underestimating the noise. It's one of the things that went bad in psych studies, but also in genetics (in a lecture some 10 years ago, a statistician explained how results in genetics would be reproducible only at 5 sigma or higher).

thenerdhead · on Oct 13, 2023

> “It’s about sitting back and watching what the natural world throws at you — which is a lot of variation.”

Great quote.

fwungy · on Oct 13, 2023

Replication is a real problem in modern science, especially with more complex subjects like medicine[0].

https://en.wikipedia.org/wiki/Replication_crisis

rando_dfad · on Oct 13, 2023

So as I understand Fischer's stance on using stats, you are supposed to start with a hypothesis. As in, you have a mechanism which could cause the effect. The stats are there to provide supporting evidence, to limit BS causal chains.

Data mining is a bit of the reverse of that.

jewayne · on Oct 13, 2023

Exactly. I feel like this is asking the biologists a trick question. The correct answer is, "I don't fucking know. Are you offering me a grant to study this?"

icegreentea2 · on Oct 13, 2023

Starting with a hypothesis doesn't require a detailed mechanism. Both questions posed in the article have a proposed mechanism - they clearly have a cause and proposed effect listed. This isn't the same thing as mining a giant generic database looking for arbitrary correlation.

The replication issue highlighted here is how choices in analytical approach can yield differing answers. This would be true even if you had a detailed proposed mechanism.

jewayne · on Oct 16, 2023

But neither one specified a falsifiable hypothesis. Which, to this layman's mind, means that any general statement made based upon the data is itself a hypothesis.

sinuhe69 · on Oct 13, 2023

So, essentially all statistics in (live-science) research are just unreliable? And in general, you can not draw a single conclusion from any research?

That would be disastrous, isn't it?

User23 · on Oct 13, 2023

Is this more clearly stated something like “Biological models are not uniquely determined by data sets?”

RandomLensman · on Oct 13, 2023

Not surprised, given the rather vague question asked in relation to the data (e.g., "how does x influence y") - many choices from there.

light_hue_1 · on Oct 13, 2023

That's what research is. It's not homework. No one defines the question for you. You just become curious about some relationships.

If they had asked very specific questions the whole study would be meaningless.

The questions they asked are spon on: "To what extent is the growth of nestling blue tits (Cyanistes caeruleus) influenced by competition with siblings?" and "How does grass cover influence Eucalyptus spp. seedling recruitment?"

RandomLensman · on Oct 13, 2023

I disagree for a study on replication, as people having different interpretations is totally fine then but not really related to replication as such. Many valid answers here to the questions. They studied comprehensiveness or something like that (edit: and to some extent craftsmanship of sorts) .

Imagine in chemistry they had asked to synthesize a compound given a range of inputs and then see a variety of yields based on different synthesis approaches.

mcphage · on Oct 13, 2023

Phrased as "how does X influence Y" makes it sound vague, but I think that the question they asked would be better phrased "to what degree does X influence Y"—they got results which varied along a continuum, but not in random different ways. So the actual questions themselves weren't vague, and the researchers all treated it as meaning the same thing, they just came up with different degrees in the results.

RandomLensman · on Oct 13, 2023

But how does that really relate to replication? What they found out is more like: with interpretability come different results.

mcphage · on Oct 13, 2023

I agree (and said as much in another comment). The different groups didn't come up with different effects, they came up with different magnitudes of the same effect. This doesn't seem like a replication crisis.

amelius · on Oct 13, 2023

What they should test next is if the scientific method would upvote the biologists that produced the more correct results.

jewayne · on Oct 13, 2023

Non-scientist here. Isn't this just stating that "experiments", even those involving non-laboratory measurements, need to begin with a falsifiable hypothesis? From my understanding, you can't draw any real conclusions from the data unless you predicted that it would look that way beforehand.

Frost1x · on Oct 13, 2023

>From my understanding, you can't draw any real conclusions from the data unless you predicted that it would look that way beforehand.

Oh my sweet summer child, a lot of lab work and data collection is expensive and in the game of research, you spend a lot of time gaming the system and meeting expectations relative to doing actual fundamental research. So much work wants to take the hard low return work like doing tests and collecting data and releveraging it with increasingly complex statistical approaches.

I've worked in so many research environments that you find more often the case is that there's a selection towards research that could be pursued and falsified with existing data vs the other way around. Here's this set of data and how it was collected, what arbitrarily new novel thing can we say about it? It may not be something interesting but it may be statistically or theoretically valid. The result is you get a paper/publication out of it without doing the footwork.

This is part of the reason researchers often hold their data tightly. You'd think scientists would want to share data but it's a highly competitive environment and if you took the risk to invest time and money in some costly data collection process, you want to do everything you can to say everything you can about it before someone else does it without any of the underlying cost. Sure, you may get a reference or footnote for your data but that's not going to help that much in the big scheme of things, not as much as a fresh publication. Also, if you're only being referenced for the data collection portion of your work... it doesn't speak alot about the work you did around that data collection.

jewayne · on Oct 13, 2023

But doesn't that explain the reproducibility crisis, in a nutshell? If you work backwards from a dataset and look for correlations, doesn't that effectively set the p-value to 1?

Frost1x · on Oct 13, 2023

I'm not disagreeing with you, just pointing out how reality diverges from how science is often taught to be practiced as this innocent discovery process.

The structure behind research and funding in research, which leads towards all sorts of system gaming, is definitely a core contributor to reproducibility but it's not the the only issue. A lot of issues are in sheer variability of things being studied and how context sensitive some things can be. Combine that with limited time, funding, and expectation to produce positive results in some weird form and you yet a lot of the reproducibility issues out there. Another huge factor beyond just analytic approach is analytic follow through. There's a lot of highly questionable computational code out there that researchers earnestly believe is doing something that it's actually not.

readthenotes1 · on Oct 13, 2023

Explaining the reproducibility crisis: Pride, greed, envy, sloth.

(Alt) Explaining the reproducibility crisis: acquisitiveness, rivalry, vanity, power-seeking.

(Alt) Explaining the reproducibility crisis: being human.

mjburgess · on Oct 13, 2023

The issue here is the same across all 'science' which is unable to effectively control causes, and which studies non-linear phenomena (or, more simply, cases where the distribution of means isnt defined, or is far away from normal).

The issue is: you can't do stats. The reality is it is highly likely that all data here is consistent with any effect size you care to choose.

What this study does, nevertheless, is quantify the 'model risk' ie., the "p-value of the method itself" -- how often does research provide reliable effect sizes? Rarely. So the research methods themselves have a huge 'epistemic discounting' associated with them.

But we can go much further than this -- recurse: what is the risk of using this methodology to assess methodology? Actually, it's fairly high. This isn't a very reliable method for judging research methods. And so on.

What we find here is that in the vast majority of cases, where we'd care to research anything, recursing risks lead to an essential insurmountable error: measurement error, model error, methodology error, research-assumptions-error, etc. There isnt enough data to resolve between competing possibilities at the outer-most stage of error (often you'd need much more data than practically possible to collect).

What do we do with this? I'd say, mostly, throw away stats here. Do case analyses, risk-based recommendations, even broadly philosophical analysis.

The alterantive is just staring at tea-leaves.

DanielBMarkham · on Oct 13, 2023

You're not wrong, but I think you may have headed out the wrong way from your conclusion.

Reproducibility trumps causal models. If you call the psychic hotline and get the same results time-after-time, then keep using the psychic hotline. Leave the modeling to others.

When developing new science, stats lag behind reproducibility. First you see something that correlates, then somebody is able to make A always infer B, then you start doing some statistical surveying in an effort to build up a larger model.

Observation first, then Abduction (in the Peircian sense), then measurement, then modeling, and finally Deductive/Inductive reasoning.

We keep trying to do it backwards where we have the model for everything, then bang actual data into various parts of the model. The saddest thing in this process is that if we're only doing incremental, normal science (Kuhn), none's the wiser. This is why reproducibility tests are so important. We lie to ourselves. We're quite good at it.

BTW, I agree that with you general conclusion about case analysis and such, but only in the sense that it drives Abduction, not as a goal in itself.

mjburgess · on Oct 13, 2023

The case analysis would drive, what i'd hope to be, "humble abduction" rather than the variety we've become acustomed to.

Ie., rather than say "vaccines cause autism" we'd say: here are some cases where they co-occur; here is an empirical investigation into possible causes in these cases; here's the risk profile across various possible 'abductions'.

Were we to lack any good science of immunology, etc. we'd have to stop there and then start a risk analysis assuming each possible universal conclusion was true, hence giving a kind of "policy ensemble".

Since we have good-enough immunological science, and so on, we can show that the lack of anything other than time-correlated instances makes the likelihood of a causal relationship near zero (in epistemic risk terms, at least).

In otherwords, I expect for most areas we're interested in, there is never going to be anything close even to immunology -- let alone physics. I dont think many of the world's universals are statable.

Namely, i think chaotic non-linear systems preclude it; and systems with power-law like behaviour make collecting enough data practically impossible.

I'm sceptical there will every be anything more than sociology for society; literature for articulating inner psychological experiences; etc.

This is, in part, why stats is doubly harmful in these cases. It isnt a kind of protoscience, it's acutally pseudoscience which gets in the way of 'humble abductions'

User23 · on Oct 13, 2023

To continue mining the Peircian vein, it appears to me that the most troubled areas of inquiry are the ones where the pragmatic maxim gives us the least.

DanielBMarkham · on Oct 13, 2023

I've been using a phrase for several years, "profound ignorance". I might use "humble abductions." Thanks.

We always end up back with cluster/cohort analysis, then definitions. I think a big part of the problem is that we've already convinced ourselves of the definitions of the terms. These definitions prevent actual progress. We end up defining ourselves into our own locked room.

mjburgess · on Oct 13, 2023

I think the first task in this sort of convo is to convince the audience (client) that we are necessarily profoundly ignorant. You cannot predict prices, you cannot predict interest rates, and so on --- its still always a little suprising to see the shock in their eyes.

No, indeed, the thing you've been asked to do is an exercise in risk management -- there is no statistical reality to the predictions here; ignore them.

So, i'd still say that opener is vital. I try to move on to ensemble scenario modelling prose-style by several experts, etc. then, eventually, i tell them about geometric brownian motion -- only when i'm convinced they're absolutey certain it's smoke-and-mirrors

And that therefore they are aware that their job is primarily to manage risk, and a statistical model is only one minor (often blinding) tool in that box

If you need a point estimate, so be it, but claw it from my cold dead fingers. Most of the world isnt knowable, sorry to disappoint. Now go actually manage your damn responsibilities

rendang · on Oct 13, 2023

Or compensate by expecting much, much larger sample sizes than are currently typical?