So to reproduce paper A from 2015 that used numpy 1.234 in Python 2.654 on a 32-bit Windows system and version abcxyz123 of obscure library X, I would need to recreate those exact conditions.
Whatever happened in the future of the softwares is irrelevant. Bugfixes might change results anyways.
I'm working through this now, actually, trying to reproduce some research from last year, and between fixing things that changed between Pytorch 0.3 and 0.4, and cleaning up some of hte underlying data, I've wasted a couple of days (the fixes were easy; figuring out what caused the problems in the first place, not so much).
> the main issue is that you cannot suppose that everyone (all program authors and users) know exactly what to do and do it correctly. In practice, the approach you describe almost never works because some information is missing. To make it practical, we'd need easy-to-use tooling for all phases: producing a complete list of versioned dependencies (including C libraries), verifying the completeness of this list, and restoring the environment on a different machine.
> There are three main reasons why this is not a sufficient solution:
> 1. Freezing code is fine for archival reproducibility, as I mentioned in my original post. It is not sufficient for living code bases that people work on over decades. ...
(His example was his MMTK project which started in the 1.3 days. Twenty years later, it will take a lot of work to make the 2->3 transition.)
> 2. The technical solutions proposed all depend on yet more infrastructure whose longevity is uncertain. For how long will a Docker container image produced in 2017 remain usable? For how long will conda and its repositories be supported, and for how long will the binaries in these repositories function on current platforms?
> 3. None of today’s code freezing approaches comes with easy-to-use tooling and clear documentation that make it accessible to the average computational scientist. The technologies are today in a “good for early adopters” state. This means we cannot rely on them to preserve today’s research even though they may well take on this role in the future.
There are many issues related to reproducibility. I think it's acceptable for different people to focus on different pain points, and that one need not mention all of the possible pain points that others may have when talking about one's own concerns.
2. True longevity is a fully self-contained system image. Of course relying on third-parties is not a good idea.
3. Install VirtualBox, doubleclick image, run.
pandas changed NaN handling for ewma pre 0.15, which can greatly change the output if your data is sparse
And even then his only supporting anecdote is that matplotlib made a breaking change to the way it handles legends. Meanwhile he was perfectly capable of reproducing his scientific results 4 years after publication despite the update from python 2.7 to 3.5 and minor updates to the rest of the cited libraries.
In light of that paucity of evidence I find it hard to support the many hyperbolic statements that the situation is a "big mistake", "calamity", or "earthquake" for the scientific community.
I do agree with the more general point that scientific code requires funding consideration for long term maintenance. Many aspects of research have adopted provisions for equipment like reagents and computing hardware. These are considered core infrastructure and are often shared among labs. I could see a future where software development is supported in a similar way.
And if you concede that the author’s proposed timescales have merit, how does 4 years of stability (interrupted by the requirement to make minor changes) meet the requirement?
To me, the premise of longer-than-four-year reproducibility timelines seems obvious. And I hope it at least seems plausible to others. Hinsen’s point is that the SciPy community actively markets itself to the scientific community as a way of carrying out computationally aided research, but hasn’t even articulated what its disposition toward reproducibility is. I think transparency in this regard is probably the right thing. And I also find compelling Hinsen’s supposition that we should bias a little more in favor of reproducibility.
EDIT: and I should note that, given the SciPy community has chosen to make its home on top of Python, and is the layer of abstraction that many researchers are now interacting with, the Python 2 -> 3 transition is very much a SciPy issue... specifically so for the reasons articulated in this article.
Why? If you store software and data together, that shouldn't be an issue? Anyway, this is the case with any software not just Python or SciPy.
However, that's an entirely different point from "the move from 2 -> 3 means that there's a bunch of science that isn't reproducible anymore". No there isn't, just use older versions of Python. Python 2.7 isn't going to self-destruct in 2020, you can still install it if you want to verify and examine an old result.
It's not exactly the same issue, but it feels similar to me - asking other people to hold back on what is presumably, for them, progress, so that they will remain able to use something he wrote without any special effort on his part.
It ain't that easy. The main issue is summarized in this tweet:
In an attempt to summarize Hinsens argument:
Reproducibility is important for computational science. This means it needs to be possible to run the same code with the same data, and get the same result. And because code represents scientific models, not experiments, and these models are used for decades, reproducibility for decades is needed.
Yet Hinsen (who is a main contributor of both Numerical Python (numpy) and Scientific Python, observes that a typical Python script will only run two or three years, not five or more.
Further, Hinsen points out that scientific software consists of four layers: domain-specific code, domain-specific libraries, scientific infrastructure, and non-scientific infrastructure. While the scientific infrastructure code, for example Numerical Python or Pandas is already updated to use Python3, for many domain-specific libraries, this is not going to happen, because of restrictions on time and funding.
(An interesting insight for me: While by far most of the actual code in any given program will consist of OS routines, system libraries, and libraries such as Numpy, the total amount of scientific code outside of these layers is far larger, and will probably never be rewritten.)
There are two other interesting statements Hinsen has cited in other blog posts. One is Linus Torvalds ["We don't break user space"](https://lkml.org/lkml/2012/12/23/75).
The other is Rich Hicke's "Spec-ulation" keynote on maintaining compatibility of interfaces - here is an old HN thread: https://news.ycombinator.com/item?id=13085952
The article also seems to think other languages are somehow immune:
> Today’s Java implementations will run the very first Java code from 1995 without changes,
Even though Java release notes document incompatible changes: http://www.oracle.com/technetwork/java/javase/8-compatibilit...
Python has one backwards-incompatible change in 10 years, and I wouldn’t even call it a major one at the time. In these 10 years nobody who’s on these long-term projects and complaining on Twitter seems to have much as started testing on 3, let alone budgeted time or resources for a migration. Someone like Red Hat keeps coming to the rescue when the complaints get loud and extends the date by a few years — and rather than taking that as a clue that time is limited and to start working on a gentle, n-year upgrade plan, these groups just go back to acting like they have an infinite timeline. Until it happens all over again.
The good thing about linux’s backwards compatibility that they mention means that the python2 interpreter will probably be compilable long after its EOL.
* the developer variety: "running the same code produces a bitwise-identical result"
* the scientific variety: "repeating the experiment produces the same result"
The second one is much broader; a lot of the work in science is documenting what factors are relevant in "repeating" an experiment: a result which could only be obtained with one exact piece of apparatus would be dangerously close to being irreproducible. Similarly, the idea of "the same result" is subject to interpretation - for instance, measuring the muon lifetime by timing decays in a scintillator should produce a result close to the known value but it can't produce the same value every time since it's a fundamentally random process.
I think at universities and research facilities, there are many Python scripts that see few development, and with those, breaking changes are of course a problem, as the person who wrote them has potentially already left and unit tests are not common at all. So I understand his view, although I don't make it my own. I think they could get a long way by version pinning and using virtual environments.
I never experienced many problems with breaking changes in numpy and scipy, sometimes maybe with pandas. Writing and operating python based web services that heavily use the scientific python stack, I have hardly felt problems in terms of breaking changes, even though the code bases where considerable in size. I remember that binary compatibility of numpy had been an issue once, which we did resolve by linking against an older numpy version when building wheels, then running with more recent numpy versions.
Hinsen mainly thinks in terms of namespacing and maintaining backwards compatibility, yet I think this is a tough burden to impose upon the open source developers. What I would like is having each breaking change being announced in the latest patch release before the breaking change by a deprectaion warning that I can programmatically check for.
I enjoy Rich hickey's talks very much. His approach to versioning however I am careful about, when writing clojure code, I repeatedly stumbled over API "bugs" or irregularities, which were not fixed because they did not want to introduce a breaking change. In the end, this meant I lost much time writing my new code, because I thought I used a mature API and in fact it was sprinkled with immature functionality that just stayed there for the sake of backwards compatibility.
It's easy for a paper to say something like "pick the set of N points that minimizes `median(f(x))`" without saying how the points were chosen (gradient descent? simulated annealing? brute force? some analytical method?).
What happens if you pick one optimization method and your results are different than the ones in the paper? There's no way for you to know if the paper is wrong or if you just picked the wrong method.
Losing the work generated before or making it not clearly reproducible is bad, but trying to be a stick in the mud will prevent advancement for the generations hereafter as well as for the High Energy Physics example: when was the last time you see undergraduates taking fortran en masse? Keeping an esoteric version of software only helps you unless you make sure it helps those who come after.
However, it wom’t be possible to use them together in the same project with modern python3 code. So that’s bad. I wonder whether one option is to create a dockerized version of the library runnng against python2 and the write a serialization/deserialization layer allowing the old API to be implemented in python3?
Great Filter Hypothesis: Software rewrite chokes civilizations.
But...there is the problem of context. In their case, they had relatively few dependencies and I was able to get everything to run, but there are more complex environments and ecosystems. Even if I create a docker container, at some point it will no longer run. I think what we can do is try to make it possible for referees and early authors to run our code for a time. We can't hope that 5 or 10 years from now that this will be possible--but hopefully if we document our reduction steps, then if someone really wants to reproduce the work, they can see the flow.
Now, why might this become important? One example is a case of outright fraud. I went to a talk by someone from MD Anderson who gave a talk about a case of fraud at Duke in an oncology study. They saw an amazing result and their colleagues wanted to be able to use the same statistical methodology. They initially tried to work with the original authors, but once they discovered problems with the work, the original author stopped being responsive. They spend an amazing number of man-years trying to reproduce the result and figuring out what went wrong (intentionally and not). This was important because human trials were beginning. If the original source code (and infrastructure) was publicly available, this could have been avoided.
For those that say a mathematical description should be sufficient--I would say, not always. In some cases, the math could be fine, but the implementation could be flawed. Often if you find an error in a previous result, you need to at least make a guess as to what could have gone wrong before. The early days of Monte Carlo simulations sometimes suffered from flaws in the implementations of random number generators even if the over algorithm was fine...
Containerization might solve the problem over the short term (which I would argue is the most relevant time period). But, it won't solve the author's second problem which is maintaining software. Here, I think the problem is a lack of resources--there's not much credit or funding for maintaining scientific software...
- Matlab changes some behaviour here and there from time to time, deprecates functionality, ...
- both C and C++ have gone through quite some changes as well, and so have the compilers; so it's not exactly uncommon to find 20 year old code written without this in mind and making use of some specifics which now are unavailable
- assembly code written for specific DSP hardware can quickly turn obsolete: hardware unavailable anymore, build tools not running on recent OS etc
- not an uncommon problem in psychological research etc: parallel ports are disappearing
Problems with hardware/compiler/platform-specifics changing can usually be avoided in software with the proper abstractions (quite the work though, sometimes) but it's kinda hard to foresee what the evolution of a language/library is going to be in 10 or 20 years, let alone work around that. Or if you're going the other way and don't want to update to newer versions but are looking to recreate environments/dependencies: what is going to happen to the tools you are using to recreate environments in 20 years, what if they get breaking changes? And to the sources used to recreate the environment?
tldr; not sure if there really is a complete failsafe solution for this which spans multiple decades
Also looking just at the end results of science: suppose you need to reproduce a result of a dataset in 20 years, maybe there's a point where trying to make everything future-proof now is more work than starting over from scratch in 20 years (just the analysis, supposing the data is still there)? For example at some point there was C but not yet SciPy. I can certainly imagine cases where writing certain analysis again from scratch again now in SciPy would be less work than trying to get the ancient C working again.
C, C++, and Fortran have ISO standards, and if you write standard-conforming code, the committees that revise standards are very careful not to break things. It's not like Python where the BDFL changes the syntax of the print statement.
You still need the right interpreter version. And of course if you have any exotic dependencies, you’d need to figure out what to do there.
I’d say the major difference is one of disposition. The R community seems more on board with the premise that this is a problem worthy of solving. I suspect that that may be a function of the R community having a longer / deeper academic history than the Python community.
EDIT: I should mention that the R community maintains some documentation about the issue as well. https://cran.r-project.org/web/views/ReproducibleResearch.ht...