Hacker News new | past | comments | ask | show | jobs | submit login
What my retraction taught me (nature.com)
128 points by mellosouls 45 days ago | hide | past | favorite | 52 comments



The workflow for modern scientific research always seems so amateurish from the perspective of basic software development.

> We set up a video meeting, and decided that Susanne would go through simulations, and I would go through my old data, if I could dig them up. That was a challenge. Only months before, my current university had suffered a cyberattack, and access to my back-up drive was prohibited at first. ... I spent a week piecing together the necessary files and coding a pipeline to reproduce the original findings. To my horror, I also reproduced the problem that Susanne had found. The main issue was that I had used the same data for selection and comparison, a circularity that crops up again and again.

The proper solution would be to publish the code along with the paper so that others could directly review it. Ideally, the raw data would also be published, but there might be privacy issues with that. Peer review for journals is supposed to catch these type of workflow problems, but just throwing the code & paper up onto GitHub would probably be more effective than publishing just the paper and waiting for others to catch the problem.


My lab practice is to publish the repo path and the full hash of the commit that gave the results in the paper. The repo might have subsequent bug fixes, new code or data, etc. But you can get and run the exact code & data that the claims are based on if you wish.

The published paper contains the hash printed in the text.


I partially blame software infrastructure here.

I've tried to do reproducible research, but just setting something up when I get the exact version of GCC, python and python packages I want, in such a way I can get the same versions 2 years later, is a collosal pain.

Just dumping a pile of code people won't be able to run isn't super useful, and becomes a source of continous complaints and requests for fixes.


You might be interested in using Guix [1] for reproducible research, as is reported earlier (for genomics analysis) [2] and still actively developed [3].

As a bonus, Guix and Nix [4] allow for modular and composable pipelines, where you can swap different components (such as to upgrade Python from 3.8 to 3.9, or to enable certain optimizations for some Python libraries, or to run on MacOS and Linux while keeping the whole stack more or less the same), which would be more difficult to do reproducibly with other tools such as Docker (composing Guix or Nix functions are easier than composing DOCKERFILE or Docker Compose).

[1]: https://ambrevar.xyz/guix-advance/

[2]: https://academic.oup.com/gigascience/article/7/12/giy123/511...

[3]: https://hpc.guix.info/blog/2020/11/hpc-reproducible-research...

[4]: https://builtwithnix.org/


When you publish the code, the chances of someone getting it to run increases from 0% to a much much higher number. Maybe 50% of those who really try can get it done. That it is hard to get to 100% does not at all invalidate this value.

It is fine to not do much support and bugfixing on such a codebase - but one should make sure to state it clearly in the documentation.

I try out published research code (in machine learning) all the time. Almost every time I have to debug and fix things. But in a couple of hours or days I can usually get it fixed, normally without any help. That is a shorter time than it take (in the best case) to even ask for access to code from a researcher.

Reduce barriers to entry. Make it easy for others to build on your work. Your chances of impact and value created from your work will be higher.


I have over a dozen open source projects, many where I am the major author :) Therefore, I also get significant numbers of complaining emails.

For a while my e-mail address was in a header in gcc's C++ standard library, in retrospect that was a seriously bad choice.


You could make a container with the pipeline and data set up to go. Also, ideally, you'd publicly host the Dockerfile or w/e build time config, too.


> Also, ideally, you'd publicly host the Dockerfile

Docker doesn't make it easy to produce a reproducible config. People commonly write Dockerfiles with stuff like this in them:

RUN apt-get install -y my-favourite-package

And then which version of my-favourite-package you get depends on what is the latest one on the update servers at a time. Maybe a new version is released tomorrow with a regression bug that breaks everything.

There are ways to get around this problem, but the default ways of doing things in Docker doesn't encourage reproducibility.


The solution is to put the actual package into the Dockerfile instead of the reference? Asking because I don't actually know the answer.


The solution is to publish the created docker container for examination and reuse in recreating the results. It is possible to create reproducible docker containers, but not with docker.


In one company I worked we hosted our own repo for npm and mirrored all packages we used there. For our own stuff we stored all old versions in it too. The software itself always installed specific version rather than "latest"

Another option is to store the whole docker images, granted this will not give you the ability to always rebuild it, but at least you have a "binary" you will be able to run later.


This is true, although the built Docker image itself (rather than Dockerfile) wouldn’t have these problems. I think in situations where reproducibility is paramount, it would make sense to offer a tarball of the image (`docker save`) along with the Dockerfile itself.


After making Dockerfile you can additionally push your built docker image to docker hub or other public facing repo, People can then directly pull your docker image in case any issue with Dockerfile.


I look at this whole idea another way: if your results are so fragile that they need a docker container (with all the caveats others are mentioning about versions etc), it's not really "reproducible" just because someone can re-run your code. Real reproducibility would come from having discovered some objective fact that shows itself again when similar work is attempted.

Reproducibility =/= generality which is a much more important aspect of research.


While I agree that science should aim for replicability (same result for similar methods on the same scientific question) instead of reproducibility (same methods) [1], it does not hurt to let others to reuse part of your research pipeline.

I think we can in fact test and improve replicability if the whole pipeline is published. We should aim for allowing others to change certain parameters or data in the pipeline. That is, the result should be robust with respect to minor changes in the pipeline, which could be more easily tested by others with a published, fully specified pipeline.

Referring to my other comment [2], Guix or Nix really shine in allowing others to change part of the pipelines, due to their functional approach to composing components.

[1]: https://phys.org/news/2019-05-replicability-science.html

[2]: https://news.ycombinator.com/item?id=25866651


Downvoters: explain? Maybe this suggestion is aimed too much at a specific piece of technology (Docker). But there are plenty of tools out there to create reproducible builds, which I think is the overall point of this response


There is so much on the plate of the academic that to learn yet another piece of software — like docker which can be trying at the best of times — is yet another straw that can break the back.


Speculation: many downvoters are people who have repeatedly had a frustrating user experiences and are just exasperated at being repeatedly told how thier tools are good enough.


I didn't downvote but...

I'm skeptical that specific tools and "formalities" like requiring Docker containers will do a lot to improve the quality of scientific research. Obviously, being able to re-run an old analysis is better than not being able to do so but that wouldn't have helped (much) with this issue

The root cause here was a conceptual problem, not a library version mismatch. They used the same data to a) group data points into bins and b) test for differences between those bins. Having a perfectly-engineered, smoothly-running pipeline might have made it easier to confirm the result, but I don't think it would help detect it. The things that would have things like rewarding collaboration (more eyes make bugs shallower) instead of rewarding mostly lead authorship, time and support to carefully mull over findings instead of a helter-skelter rush to publish, and a publication culture that doesn't avoid ambiguity. Those are much harder to achieve though....


Because, as far as I can tell, docker doesn't make easy reproducible images.

How do I make a docker file which will create the same image, bit for bit (at least version for version) in 2 years time? I find usually it's required to update packages, which is good for security, bad for not breaking things.


> workflow for modern scientific research

careful with the word modern, as the latest generation of students and postdocs is using Git, CI/CD etc, sometimes militantly but this amateurish approach reflects the incentive structure, where the goal isn't to produce a robust solution but to achieve notoriety and impact in one's domain. This impact factor plays into getting the next grant which creates a feedback loop that makes any shortcuts to the latest shiny thing obligatory.

Even when failures become apparent, scientists can play the feature-not-a-bug card very frequently to stay on top.


The journal I edit has a policy that the code and data has to be made available to the reviewers. I was surprised at the number of reviewers that dig into the code.


TBH, code (even when all you can do is just looking at it and understanding how it works) is probably the most useful thing to the reviewers themselves.

Imagine the equivalent for reviewing an experimental paper, if you were given a free instant travel to the place where they did the experiment and you could look at their setup and tinker with it and see how it works. I guarantee you a lot of reviewers would go for that.


I've learned that what you say is true. This is particularly the case for papers that use small datasets and not especially complicated empirical analysis.


Surprised that the reviewers do or don't look at the code?


Surprised that the reviewers do look at the code. I had assumed they wouldn't want to put the time in. In hindsight, I think it's often the case that it's faster to look at the code than to read an unclear methodological description.


Speaking as a reviewer, reviews do take an immense amount of time, especially just in trying to figure out what was done and how. Code would actually speed things up a lot.


Yes, but there needs to be a more complete solution to this. For this to happen, somebody should start funding one (relatively) professional developer per research group. Or, at least, there should be a way to fund "methodology/development" researchers/PhD students that have some domain knowledge, but mainly focus on aspects of publishing code, data, stats methodology, maintaining it, and ensuring that others can use it. Unfortunately, in many fields there does not seem to be a niche for that. But clearly this is a very important problem, possibly more important than funding a larger number of research ideas?


In most fields papers / articles are still the one critical metric crucial for career advancement. You will sit on data and code until you are ready to publish, sometimes for years.

Datasets and analytical code must be regarded as a scientific achievement on its on, albeit not as substitute for the "full" substantive analysis in a paper.

A social science PhD in Denmark, for example, must write two to four peer reviewed papers at the moment. The option to substitute one of the articles with a data set?

Things get better though: People nowadays at least understand my citations better, since I cite at least the explicitly included R-packages used in my work.


> The workflow for modern scientific research always seems so amateurish from the perspective of basic software development.

It's sad but it's true. Although there are fundamental pressures that come because academia is academia, I pretty firmly believe that this can be fixed culturally too.

Ultimately, I think people in some fields enjoy programming but view it to be either beneath them or beneath their paper. This is clearly not the case for Computer Science, but I noticed a few papers in the tire society journal which presented promising results but with absolutely no source code.

One slightly ugly truth is that a lot of academics don't really know how to program properly or how computer actually work (Again, outside CS for obvious reasons). I'm reminded of the Imperial Covid Model's bugs as a fairly easy example.

As to culture-question - the obvious solution is to start requiring disclosure of almost all materials and data used to construct a paper - for instance if Hendrik Schon had to publish the data behind the curves in his graphs his fraud could've been spotted almost immediately.


I think the catch here is that people in academia who like cs/eng rapidly leave academia for the much better paying sde/data science roles. Then what you have left are the people who care less about the code, at least in aggregate.


I've gotta say, I don't read conservative papers, but I follow news closely, and up to this day I didn't know about the bugs. The media landscape is crazy.

However, it looks like they've been updating the model (and are at version 10 on github). I have no idea what their results look like now.


> The proper solution would be to publish the code along with the paper so that others could directly review it. Ideally, the raw data would also be published,

Definitely. Some journals and conferences have badging for papers that also submit their digital artifacts, and also for experiments that have been independently reproduced.

Personally I'd like to see entire journals or conference sessions where the code and experimental data for every paper are available for review.

Sometimes authors can't submit their code for corporate IP reasons and data for privacy or other reasons, but the more digital artifacts that can be submitted, reviewed, and built upon, the better.


> The proper solution would be to publish the code along with the paper so that others could directly review it.

This is taking root! More CS conferences should get onboard.

https://www.artifact-eval.org/about.html

(certificate seems to be expired...)


Universities are trying, but the incentives for researchers just aren't there. https://deepblue.lib.umich.edu/


> The proper solution would be to publish the code along with the paper so that others could directly review it.

Most ML researchers write code to prove their theory works, not to check if it works.


Bummer that they didn't catch it until years after it was published, but it is heartwarming to hear that the process went smoothly at least.

I had a near miss publishing a paper in grad school, where buried in the ~10k LOC data analysis script there was a bug in my data processing pipeline. In summary, I had meant to evaluate B=sqrt(f(A^2)) but what I actually evaluated was B=sqrt(f(A)^2), which caused the resulting output to slightly off. In the review process, one of the reviewers looked at the output, and said wait a second this seems fishy, can you explain why it has such and such artifact. Their comments quickly allowed me to pinpoint what was going wrong, and correct the analysis script appropriately -- which actually ended up improving the result significantly!

What I take away from this all is that for every article about academic misconduct and p hacking there are 100 more where the peer review process (both before and during submission to a journal) caught the issues in time.

But... also probably a decent number where the errors is still in the wild to this day...


> What I take away from this all is that for every article about academic misconduct and p hacking there are 100 more where the peer review process (both before and during submission to a journal) caught the issues in time.

p-hacking isn't a mathematical error. It cannot be "caught" because it is not presented to referees--you slightly modify your hypotheses after you experiment based on results, or you throw out "outliers" that blow up your theory. These are things that don't even show up in a paper, they happen during the compilation of the paper.

How you could conclude that p-hacking is rare based on a completely unrelated experience is beyond me.


>How you could conclude that p-hacking is rare based on a completely unrelated experience is beyond me.

I was involved in the submission of >100 papers through peer review processes, none of which involved p-hacking. In fact, they couldn't have been p-hacked because their novelty did not rely on any statistical analysis, or they were preregistered with the journal.

I did have a run-in with 1 publication that I suspected involved academic misconduct (fabrication of experimental results), but it was a thesis so it did not go through the peer review process.


That is great. Pre-registration is far from standard in many fields though? And more and more papers are involving some level of statistical analysis?


I listen to a weekly podcast titled "Everything Hertz" in which includes a large amount of discussion related to methodology as well as problematically citing previous research etc. It's pretty fascinating (as is this article) - https://everythinghertz.com/.


Thank you, this looks interesting indeed


Really happy to see this in a major journal. Simultaneous sharing of code and data necessary to replicate a result should be the minimum expectation for journal publication.



This stuck out as an important point: [As a] "student, I was even told never to attempt to replicate before I publish."

An actual scientist would never say this. I know about the real world, costs and pragmatism etc. No actual scientist should ever be proud to refuse any replication attempts to allow verification. Its that simple. Fundamentally that should be enough to initiate various investigation towards disciplinary proceedings. Its that serious a offence, eg a form of corruption.

Fake scientists, charlatans, "science believers", science authoritarians... sure. But who wants them around?

Admitting a mistake is actually good science.


I shudder to imagine what the world of software would be like if discovering bugs in your code would limit your career prospects similar to how negative results or methodology errors do in scientific research.


> That [the method of statistical analysis] could be a problem in our particular context didn’t dawn on me and my colleagues — nor on anyone else in the field — before [whoever]'s discovery.

This is troubling to me. It does not seem like the bias was due to something arcane, some finer point in advanced statistics, something hidden from view etc. The poster says:

> It involved regression towards the mean — when noisy data are measured repeatedly, values that at first look extreme become less so.

This may not be 100% straightforward when it's buried in the middle of a paper, but if you actually consider the methodolgy you are likely to notice this happening.

So here's what _I_ learn from this case:

* It is possible that the reviewers at Nature don't properly scrutinize the methodological soundness of some submissions (I say "possible" since this is a single example, not a pattern)

* PhD avisors, like the author's, may not be exercising due dilligence on statistical research done with their PhD candidates. The author's advisor had this to say:

> "It's great that we’ve persisted in attempting to understand our methodology and findings!”

so he says it's "great" that they did not fully understand their methodology before submitting a paper using it. Maybe that's not exactly what he meant, but still, pretty worrying.


You present a view of journals as an outlet for correct results. Realistically, that's not possible, and I wish more people accepted that it's not possible. If a result is confirmed in many studies by many authors using many datasets and many methodologies, that's how we know we can trust a result. I personally do not put much weight on a single paper's results unless there's something special about it.

The most important thing is that everything about the investigation is open: the methodology (including everything not found in the paper), the programs, the data. It's more like posting code for a large project on Github and then new researchers make PRs to correct bugs or extend the program in useful ways.


I think you're conflating correctness and degrees of validity.

Yes, I expect journals to have to have correct results - not in the sense that the generalizations from their findings are universally valid, but in that the statements of fact and of logical implications are valid.

To be more explicit:

* "Events X, Y, Z occurred" <- expect this kind of sentences to be correct

* "We did X, Y, Z" <- expect this kind of sentences to be correct

* "A and A->B, so B" <- expect this kind of sentences to be correct

* "We therefore conclude that X" <- Don't expect X to necessarily be correct.


> in that the statements of fact and of logical implications are valid.

See for example https://slatestarcodex.com/2019/05/07/5-httlpr-a-pointed-rev... for a slightly different take.


I don't see how that disagrees with what I wrote.


Pay attention to the context of that quote about regression to the mean. The problem in this case manifested as the opposite: 'egression from the mean', which can happen due to the combined effects of binning, cross-thresholding and heteroscedasticity (see https://www.biorxiv.org/content/10.1101/2020.12.15.422942v1). I think that _is_ a fairly arcane point, at least to most life scientists.




Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: