> We set up a video meeting, and decided that Susanne would go through simulations, and I would go through my old data, if I could dig them up. That was a challenge. Only months before, my current university had suffered a cyberattack, and access to my back-up drive was prohibited at first. ... I spent a week piecing together the necessary files and coding a pipeline to reproduce the original findings. To my horror, I also reproduced the problem that Susanne had found. The main issue was that I had used the same data for selection and comparison, a circularity that crops up again and again.
The proper solution would be to publish the code along with the paper so that others could directly review it. Ideally, the raw data would also be published, but there might be privacy issues with that. Peer review for journals is supposed to catch these type of workflow problems, but just throwing the code & paper up onto GitHub would probably be more effective than publishing just the paper and waiting for others to catch the problem.
The published paper contains the hash printed in the text.
I've tried to do reproducible research, but just setting something up when I get the exact version of GCC, python and python packages I want, in such a way I can get the same versions 2 years later, is a collosal pain.
Just dumping a pile of code people won't be able to run isn't super useful, and becomes a source of continous complaints and requests for fixes.
As a bonus, Guix and Nix  allow for modular and composable pipelines, where you can swap different components (such as to upgrade Python from 3.8 to 3.9, or to enable certain optimizations for some Python libraries, or to run on MacOS and Linux while keeping the whole stack more or less the same), which would be more difficult to do reproducibly with other tools such as Docker (composing Guix or Nix functions are easier than composing DOCKERFILE or Docker Compose).
It is fine to not do much support and bugfixing on such a codebase - but one should make sure to state it clearly in the documentation.
I try out published research code (in machine learning) all the time. Almost every time I have to debug and fix things. But in a couple of hours or days I can usually get it fixed, normally without any help. That is a shorter time than it take (in the best case) to even ask for access to code from a researcher.
Reduce barriers to entry. Make it easy for others to build on your work. Your chances of impact and value created from your work will be higher.
For a while my e-mail address was in a header in gcc's C++ standard library, in retrospect that was a seriously bad choice.
Docker doesn't make it easy to produce a reproducible config. People commonly write Dockerfiles with stuff like this in them:
RUN apt-get install -y my-favourite-package
And then which version of my-favourite-package you get depends on what is the latest one on the update servers at a time. Maybe a new version is released tomorrow with a regression bug that breaks everything.
There are ways to get around this problem, but the default ways of doing things in Docker doesn't encourage reproducibility.
Another option is to store the whole docker images, granted this will not give you the ability to always rebuild it, but at least you have a "binary" you will be able to run later.
Reproducibility =/= generality which is a much more important aspect of research.
I think we can in fact test and improve replicability if the whole pipeline is published. We should aim for allowing others to change certain parameters or data in the pipeline. That is, the result should be robust with respect to minor changes in the pipeline, which could be more easily tested by others with a published, fully specified pipeline.
Referring to my other comment , Guix or Nix really shine in allowing others to change part of the pipelines, due to their functional approach to composing components.
I'm skeptical that specific tools and "formalities" like requiring Docker containers will do a lot to improve the quality of scientific research. Obviously, being able to re-run an old analysis is better than not being able to do so but that wouldn't have helped (much) with this issue
The root cause here was a conceptual problem, not a library version mismatch. They used the same data to a) group data points into bins and b) test for differences between those bins. Having a perfectly-engineered, smoothly-running pipeline might have made it easier to confirm the result, but I don't think it would help detect it. The things that would have things like rewarding collaboration (more eyes make bugs shallower) instead of rewarding mostly lead authorship, time and support to carefully mull over findings instead of a helter-skelter rush to publish, and a publication culture that doesn't avoid ambiguity. Those are much harder to achieve though....
How do I make a docker file which will create the same image, bit for bit (at least version for version) in 2 years time? I find usually it's required to update packages, which is good for security, bad for not breaking things.
careful with the word modern, as the latest generation of students and postdocs is using Git, CI/CD etc, sometimes militantly but this amateurish approach reflects the incentive structure, where the goal isn't to produce a robust solution but to achieve notoriety and impact in one's domain. This impact factor plays into getting the next grant which creates a feedback loop that makes any shortcuts to the latest shiny thing obligatory.
Even when failures become apparent, scientists can play the feature-not-a-bug card very frequently to stay on top.
Imagine the equivalent for reviewing an experimental paper, if you were given a free instant travel to the place where they did the experiment and you could look at their setup and tinker with it and see how it works. I guarantee you a lot of reviewers would go for that.
Datasets and analytical code must be regarded as a scientific achievement on its on, albeit not as substitute for the "full" substantive analysis in a paper.
A social science PhD in Denmark, for example, must write two to four peer reviewed papers at the moment. The option to substitute one of the articles with a data set?
Things get better though: People nowadays at least understand my citations better, since I cite at least the explicitly included R-packages used in my work.
It's sad but it's true. Although there are fundamental pressures that come because academia is academia, I pretty firmly believe that this can be fixed culturally too.
Ultimately, I think people in some fields enjoy programming but view it to be either beneath them or beneath their paper. This is clearly not the case for Computer Science, but I noticed a few papers in the tire society journal which presented promising results but with absolutely no source code.
One slightly ugly truth is that a lot of academics don't really know how to program properly or how computer actually work (Again, outside CS for obvious reasons). I'm reminded of the Imperial Covid Model's bugs as a fairly easy example.
As to culture-question - the obvious solution is to start requiring disclosure of almost all materials and data used to construct a paper - for instance if Hendrik Schon had to publish the data behind the curves in his graphs his fraud could've been spotted almost immediately.
However, it looks like they've been updating the model (and are at version 10 on github). I have no idea what their results look like now.
Definitely. Some journals and conferences have badging for papers that also submit their digital artifacts, and also for experiments that have been independently reproduced.
Personally I'd like to see entire journals or conference sessions where the code and experimental data for every paper are available for review.
Sometimes authors can't submit their code for corporate IP reasons and data for privacy or other reasons, but the more digital artifacts that can be submitted, reviewed, and built upon, the better.
This is taking root! More CS conferences should get onboard.
(certificate seems to be expired...)
Most ML researchers write code to prove their theory works, not to check if it works.
I had a near miss publishing a paper in grad school, where buried in the ~10k LOC data analysis script there was a bug in my data processing pipeline. In summary, I had meant to evaluate B=sqrt(f(A^2)) but what I actually evaluated was B=sqrt(f(A)^2), which caused the resulting output to slightly off.
In the review process, one of the reviewers looked at the output, and said wait a second this seems fishy, can you explain why it has such and such artifact. Their comments quickly allowed me to pinpoint what was going wrong, and correct the analysis script appropriately -- which actually ended up improving the result significantly!
What I take away from this all is that for every article about academic misconduct and p hacking there are 100 more where the peer review process (both before and during submission to a journal) caught the issues in time.
But... also probably a decent number where the errors is still in the wild to this day...
p-hacking isn't a mathematical error. It cannot be "caught" because it is not presented to referees--you slightly modify your hypotheses after you experiment based on results, or you throw out "outliers" that blow up your theory. These are things that don't even show up in a paper, they happen during the compilation of the paper.
How you could conclude that p-hacking is rare based on a completely unrelated experience is beyond me.
I was involved in the submission of >100 papers through peer review processes, none of which involved p-hacking. In fact, they couldn't have been p-hacked because their novelty did not rely on any statistical analysis, or they were preregistered with the journal.
I did have a run-in with 1 publication that I suspected involved academic misconduct (fabrication of experimental results), but it was a thesis so it did not go through the peer review process.
The problem is far more rife than I realised...
An actual scientist would never say this. I know about the real world, costs and pragmatism etc. No actual scientist should ever be proud to refuse any replication attempts to allow verification. Its that simple. Fundamentally that should be enough to initiate various investigation towards disciplinary proceedings. Its that serious a offence, eg a form of corruption.
Fake scientists, charlatans, "science believers", science authoritarians... sure. But who wants them around?
Admitting a mistake is actually good science.
This is troubling to me. It does not seem like the bias was due to something arcane, some finer point in advanced statistics, something hidden from view etc. The poster says:
> It involved regression towards the mean — when noisy data are measured repeatedly, values that at first look extreme become less so.
This may not be 100% straightforward when it's buried in the middle of a paper, but if you actually consider the methodolgy you are likely to notice this happening.
So here's what _I_ learn from this case:
* It is possible that the reviewers at Nature don't properly scrutinize the methodological soundness of some submissions (I say "possible" since this is a single example, not a pattern)
* PhD avisors, like the author's, may not be exercising due dilligence on statistical research done with their PhD candidates. The author's advisor had this to say:
> "It's great that we’ve persisted in attempting to understand our methodology and findings!”
so he says it's "great" that they did not fully understand their methodology before submitting a paper using it. Maybe that's not exactly what he meant, but still, pretty worrying.
The most important thing is that everything about the investigation is open: the methodology (including everything not found in the paper), the programs, the data. It's more like posting code for a large project on Github and then new researchers make PRs to correct bugs or extend the program in useful ways.
Yes, I expect journals to have to have correct results - not in the sense that the generalizations from their findings are universally valid, but in that the statements of fact and of logical implications are valid.
To be more explicit:
* "Events X, Y, Z occurred" <- expect this kind of sentences to be correct
* "We did X, Y, Z" <- expect this kind of sentences to be correct
* "A and A->B, so B" <- expect this kind of sentences to be correct
* "We therefore conclude that X" <- Don't expect X to necessarily be correct.
See for example https://slatestarcodex.com/2019/05/07/5-httlpr-a-pointed-rev... for a slightly different take.