Hacker News new | comments | ask | show | jobs | submit login
A plea for stability in the SciPy ecosystem (khinsen.net)
53 points by jnxx 8 months ago | hide | past | web | favorite | 47 comments

I am not convinced with the reproducability issues. To me, to reproduce a scientific finding using the exact same versions of all softwares involved is obvious (unless the authors rule out specific side-effects).

So to reproduce paper A from 2015 that used numpy 1.234 in Python 2.654 on a 32-bit Windows system and version abcxyz123 of obscure library X, I would need to recreate those exact conditions.

Whatever happened in the future of the softwares is irrelevant. Bugfixes might change results anyways.

Yeah, that'll work when papers stop being accepted/published/paid attention to unless the author's publish a container/image, or at least an environment.txt file listing out all dependencies, and a Python version.

I'm working through this now, actually, trying to reproduce some research from last year, and between fixing things that changed between Pytorch 0.3 and 0.4, and cleaning up some of hte underlying data, I've wasted a couple of days (the fixes were easy; figuring out what caused the problems in the first place, not so much).

The author wrote this in the comments, and it fits as a reply to your comment:

> the main issue is that you cannot suppose that everyone (all program authors and users) know exactly what to do and do it correctly. In practice, the approach you describe almost never works because some information is missing. To make it practical, we'd need easy-to-use tooling for all phases: producing a complete list of versioned dependencies (including C libraries), verifying the completeness of this list, and restoring the environment on a different machine.

So basically a container or VM image. To be honest the python version seems like the smallest issue for reproducibility. I would be more worried about different floating point operations with respect to GPUs for instance.

The author addressed "a container or VM image" in the followup, at http://blog.khinsen.net/posts/2017/11/22/stability-in-the-sc... :

> There are three main reasons why this is not a sufficient solution:

> 1. Freezing code is fine for archival reproducibility, as I mentioned in my original post. It is not sufficient for living code bases that people work on over decades. ...

(His example was his MMTK project which started in the 1.3 days. Twenty years later, it will take a lot of work to make the 2->3 transition.)

> 2. The technical solutions proposed all depend on yet more infrastructure whose longevity is uncertain. For how long will a Docker container image produced in 2017 remain usable? For how long will conda and its repositories be supported, and for how long will the binaries in these repositories function on current platforms?

> 3. None of today’s code freezing approaches comes with easy-to-use tooling and clear documentation that make it accessible to the average computational scientist. The technologies are today in a “good for early adopters” state. This means we cannot rely on them to preserve today’s research even though they may well take on this role in the future.

There are many issues related to reproducibility. I think it's acceptable for different people to focus on different pain points, and that one need not mention all of the possible pain points that others may have when talking about one's own concerns.

1. That's not what reproducibility is about imo. Living code bases are a whole different game.

2. True longevity is a fully self-contained system image. Of course relying on third-parties is not a good idea.

3. Install VirtualBox, doubleclick image, run.

alternatively, scientific results should be robust enough to still be observable with marginally different software versions

Not sure if that’s possible in practice. Every analysis would have to be written using several different software stacks... It’s probably doable only if the software is fairly simple or if someone has a lot if extra funding to burn through.

Key word there is marginally. If your finding holds up using NumPy 1.10 but not 1.11... you do not have a finding.

That's probably true for numpy 1.10-1.11, but I"ve seen other libraries make dramatic changes (for example changing the default behavior of a function)


pandas changed NaN handling for ewma pre 0.15, which can greatly change the output if your data is sparse

But again, if there are NaN's cropping up in your analysis, and your results depend on what an intermediate tool decides to do with them, you have overlooked something substantial.


Imagine if BLAS had to be frozen and run from a container.

This article never really addresses SciPy or NumPy despite the title and much of the discussion. Rather, the author is ranting about the change from python 2 to 3.

And even then his only supporting anecdote is that matplotlib made a breaking change to the way it handles legends. Meanwhile he was perfectly capable of reproducing his scientific results 4 years after publication despite the update from python 2.7 to 3.5 and minor updates to the rest of the cited libraries.

In light of that paucity of evidence I find it hard to support the many hyperbolic statements that the situation is a "big mistake", "calamity", or "earthquake" for the scientific community.

I do agree with the more general point that scientific code requires funding consideration for long term maintenance. Many aspects of research have adopted provisions for equipment like reagents and computing hardware. These are considered core infrastructure and are often shared among labs. I could see a future where software development is supported in a similar way.

Do you disagree with the author’s premise that multi-decade reproducibility is of value to the scientific community? On what basis would you make that disagreement?

And if you concede that the author’s proposed timescales have merit, how does 4 years of stability (interrupted by the requirement to make minor changes) meet the requirement?

To me, the premise of longer-than-four-year reproducibility timelines seems obvious. And I hope it at least seems plausible to others. Hinsen’s point is that the SciPy community actively markets itself to the scientific community as a way of carrying out computationally aided research, but hasn’t even articulated what its disposition toward reproducibility is. I think transparency in this regard is probably the right thing. And I also find compelling Hinsen’s supposition that we should bias a little more in favor of reproducibility.

EDIT: and I should note that, given the SciPy community has chosen to make its home on top of Python, and is the layer of abstraction that many researchers are now interacting with, the Python 2 -> 3 transition is very much a SciPy issue... specifically so for the reasons articulated in this article.

"The disappearance of Python 2 will leave much scientific software orphaned, and many published results irreproducible. Yes, the big well-known packages of the SciPy ecosystem all work with Python 3 by now, but the same cannot be said for many domain-specific libraries that have a much smaller user and developer base, and much more limited resources."

Why? If you store software and data together, that shouldn't be an issue? Anyway, this is the case with any software not just Python or SciPy.

I think he's referring to the resource problem. Let's suppose that you write a library in your copious free time (like he apparently did). Once you reach around 100,000 lines of code, it may be nontrivial to update it (for example, if it has say bindings to C)--given that he probably doesn't have a team, but maybe just himself and a graduate student.

I think it's a fair point to say "the move from 2 -> 3 caused an enormous burden for library maintainers, and many packages aren't going to be upgraded for 3". There's no question that this is true, inside the scientific computing community and out. It was arguably worth it to get a better language, but still, the transition doesn't come without cost.

However, that's an entirely different point from "the move from 2 -> 3 means that there's a bunch of science that isn't reproducible anymore". No there isn't, just use older versions of Python. Python 2.7 isn't going to self-destruct in 2020, you can still install it if you want to verify and examine an old result.

I'm curious if people who sunk large efforts into MATLAB or LabView packages have ever made demands that people stop migrating to Python so that people can keep using their stuff.

It's not exactly the same issue, but it feels similar to me - asking other people to hold back on what is presumably, for them, progress, so that they will remain able to use something he wrote without any special effort on his part.

Instead of asking others to shoulder the (enormous) burden of backwards compatibility so that his software may continue to work, I think this person would do well to ask package maintainers to make sure their package metadata is correct (e.g. it lists which versions of Python are supported), and to adopt containerization-based tools for packaging his software dependencies in a sustainable way. The beauty of Docker (and LXC and other container technologies) is that the Linux kernel ABI becomes your backwards-compatible interface, and the kernel maintainers are willing to shoulder that burden.

The end-of-life of Python 2 will cause a lot of scientific software to break. Many readers who work in different domains of software development will just shrug and say "Then, Scientists should just change to Python 3 and write all new code in it".

It ain't that easy. The main issue is summarized in this tweet:


In an attempt to summarize Hinsens argument:

Reproducibility is important for computational science. This means it needs to be possible to run the same code with the same data, and get the same result. And because code represents scientific models, not experiments, and these models are used for decades, reproducibility for decades is needed.

Yet Hinsen (who is a main contributor of both Numerical Python (numpy) and Scientific Python, observes that a typical Python script will only run two or three years, not five or more.

Further, Hinsen points out that scientific software consists of four layers: domain-specific code, domain-specific libraries, scientific infrastructure, and non-scientific infrastructure. While the scientific infrastructure code, for example Numerical Python or Pandas is already updated to use Python3, for many domain-specific libraries, this is not going to happen, because of restrictions on time and funding.

(An interesting insight for me: While by far most of the actual code in any given program will consist of OS routines, system libraries, and libraries such as Numpy, the total amount of scientific code outside of these layers is far larger, and will probably never be rewritten.)

There are two other interesting statements Hinsen has cited in other blog posts. One is Linus Torvalds ["We don't break user space"](https://lkml.org/lkml/2012/12/23/75).

The other is Rich Hicke's "Spec-ulation" keynote on maintaining compatibility of interfaces - here is an old HN thread: https://news.ycombinator.com/item?id=13085952

I understand a lot of the changes cause pain in the academic community. There's a lot that could be better. But I really don't agree with the reproducibility issues. Python 2 doesn't "disappear". It's still available. D.Beazley did an experiment of compiling old python versions recently and went back all the way to the pre-1.0, pre-vcs versions. They still compile (with minimal changes) and still run. Old packages are still available (you should keep copies of specific versions anyway if you want reproducibility). It's not trivial, but it's not rocket surgery either.

The article also seems to think other languages are somehow immune:

> Today’s Java implementations will run the very first Java code from 1995 without changes,

Even though Java release notes document incompatible changes: http://www.oracle.com/technetwork/java/javase/8-compatibilit...

I don’t understand what they mean with the Java comments. Moving from 5 to 6 was a massive pain at one of my companies. It was a two year transition that went into the EOL of 5. 6 to 7 was just barely any better. Maybe they wrote forward-compatible Java and never had that ugly experience. Given that python 3 has been on the radar for 10 years, they could have done the same for python.

Python has one backwards-incompatible change in 10 years, and I wouldn’t even call it a major one at the time. In these 10 years nobody who’s on these long-term projects and complaining on Twitter seems to have much as started testing on 3, let alone budgeted time or resources for a migration. Someone like Red Hat keeps coming to the rescue when the complaints get loud and extends the date by a few years — and rather than taking that as a clue that time is limited and to start working on a gentle, n-year upgrade plan, these groups just go back to acting like they have an infinite timeline. Until it happens all over again.

The good thing about linux’s backwards compatibility that they mention means that the python2 interpreter will probably be compilable long after its EOL.

It might compile, but with modern compilers doing increasingly aggressive optimizations (especially on undefined behaviour), it becomes more of a gamble whether the old source with a modern compiler produces the same results as the old code with an old compiler.

If it doesn’t, the results from that computation are scientifically worthless. Reproducibility means more than just getting the same results from the same code with the same compiler (although that’s clearly a desirable prerequisite)

So use an old compiler?

Seems like there are two different senses of "reproducibility" happening here:

* the developer variety: "running the same code produces a bitwise-identical result"

* the scientific variety: "repeating the experiment produces the same result"

The second one is much broader; a lot of the work in science is documenting what factors are relevant in "repeating" an experiment: a result which could only be obtained with one exact piece of apparatus would be dangerously close to being irreproducible. Similarly, the idea of "the same result" is subject to interpretation - for instance, measuring the muon lifetime by timing decays in a scintillator should produce a result close to the known value but it can't produce the same value every time since it's a fundamentally random process.

I remember reading Hinsen's blog post in 2017 when it made rounds in the scientific python community.

I think at universities and research facilities, there are many Python scripts that see few development, and with those, breaking changes are of course a problem, as the person who wrote them has potentially already left and unit tests are not common at all. So I understand his view, although I don't make it my own. I think they could get a long way by version pinning and using virtual environments.

I never experienced many problems with breaking changes in numpy and scipy, sometimes maybe with pandas. Writing and operating python based web services that heavily use the scientific python stack, I have hardly felt problems in terms of breaking changes, even though the code bases where considerable in size. I remember that binary compatibility of numpy had been an issue once, which we did resolve by linking against an older numpy version when building wheels, then running with more recent numpy versions.

Hinsen mainly thinks in terms of namespacing and maintaining backwards compatibility, yet I think this is a tough burden to impose upon the open source developers. What I would like is having each breaking change being announced in the latest patch release before the breaking change by a deprectaion warning that I can programmatically check for.

I enjoy Rich hickey's talks very much. His approach to versioning however I am careful about, when writing clojure code, I repeatedly stumbled over API "bugs" or irregularities, which were not fixed because they did not want to introduce a breaking change. In the end, this meant I lost much time writing my new code, because I thought I used a mature API and in fact it was sprinkled with immature functionality that just stayed there for the sake of backwards compatibility.

This may sound trite, but this is why science is based on math, and not code. Math results can be fully reproduced from a paper document no matter how old it is. That is the point of math. Until code reaches that level of self consistency and rigor, no one is going to waste their time building a friggin virtual machine for their one-off incremental molecular dynamics result that all of 2 people will try to reproduce.

"Science" is not discussed here, but academic research publishing. And to state that "Academic research papers are based on math" is, lacking a more polite term, false.

There are some cases where math alone is not enough to reproduce the results though.

It's easy for a paper to say something like "pick the set of N points that minimizes `median(f(x))`" without saying how the points were chosen (gradient descent? simulated annealing? brute force? some analytical method?).

What happens if you pick one optimization method and your results are different than the ones in the paper? There's no way for you to know if the paper is wrong or if you just picked the wrong method.

This feels like a plea for Fortran that has hindered some of the other communities of scientific software like High Energy Physics where they are struggling to get key software ported to C++. If the SciPy community glues itself to Python 2 it's just going to be that much worse when Python 4 exists years from now. I agree with a lot of the other people commenting here that you should be able to reproduce the results with the code at the time it was written, if that is not being preserved as either an artifact or a versioning piece that is a process problem when releasing work.

Losing the work generated before or making it not clearly reproducible is bad, but trying to be a stick in the mud will prevent advancement for the generations hereafter as well as for the High Energy Physics example: when was the last time you see undergraduates taking fortran en masse? Keeping an esoteric version of software only helps you unless you make sure it helps those who come after.

I don't think reproducability will be actually hurt, you can always download an old copy of python 2 somewhere (there are places that archive p2) and run the code. Unsupported doesn't mean the software version implodes into vacuum. It just means if you install it, it'll be a rather ancient version that has possibly many bugs and vulnerabilities and you should probably not have that touch production.

So development on the author’s Molecular Modeling Toolkit will stop, I see that. But it sounds like these older packages can still be used, it’s just that scientists can’t blindly run everything on the same python installation. But with pyenv/virtualenv, or dockerized versions of python2, it should be perfectly possible to keep using them for a very long time. For the same reason, the author’s comment that scientific python scripts tend not to be usable after 5 or more years seems not entirely accurate.

However, it wom’t be possible to use them together in the same project with modern python3 code. So that’s bad. I wonder whether one option is to create a dockerized version of the library runnng against python2 and the write a serialization/deserialization layer allowing the old API to be implemented in python3?

I asked the author; he says that there would be too much serialization/deserialization overhead.


It wold be great to have PDF/A equivalent for a subset of scientific software that can distributed and archived. It would allow easy access for the future generations. Alas, it's not going to happen. Perlis epigram #14: "In the long run every program becomes rococo - then rubble."

Great Filter Hypothesis: Software rewrite chokes civilizations.

SciPy was 0.x software until 2017. Demanding stability to 0.x software is unreasonable. The author used SciPy with full knowledge that it's 0.x software.

I'll start by saying that I don't have a good answer to this problem. But, I do have some thoughts. I recently reviewed a paper and was extremely happy to see that they included a Jupyter notebook and their Keras model along with some testing data. For those of you not in science, this was rather amazing And when the paper is published, it will be part of the supplemental materials, so at least for awhile, readers will be able to play with it (and the authors plan to release their training data once they find a good way to share such a large set of data--again, a limit for academic researchers). So, this is going above and beyond what I've seen in normal practice (I'm in condensed matter physics).

But...there is the problem of context. In their case, they had relatively few dependencies and I was able to get everything to run, but there are more complex environments and ecosystems. Even if I create a docker container, at some point it will no longer run. I think what we can do is try to make it possible for referees and early authors to run our code for a time. We can't hope that 5 or 10 years from now that this will be possible--but hopefully if we document our reduction steps, then if someone really wants to reproduce the work, they can see the flow.

Now, why might this become important? One example is a case of outright fraud. I went to a talk by someone from MD Anderson who gave a talk about a case of fraud at Duke in an oncology study. They saw an amazing result and their colleagues wanted to be able to use the same statistical methodology. They initially tried to work with the original authors, but once they discovered problems with the work, the original author stopped being responsive. They spend an amazing number of man-years trying to reproduce the result and figuring out what went wrong (intentionally and not). This was important because human trials were beginning. If the original source code (and infrastructure) was publicly available, this could have been avoided.

For those that say a mathematical description should be sufficient--I would say, not always. In some cases, the math could be fine, but the implementation could be flawed. Often if you find an error in a previous result, you need to at least make a guess as to what could have gone wrong before. The early days of Monte Carlo simulations sometimes suffered from flaws in the implementations of random number generators even if the over algorithm was fine...

Containerization might solve the problem over the short term (which I would argue is the most relevant time period). But, it won't solve the author's second problem which is maintaining software. Here, I think the problem is a lack of resources--there's not much credit or funding for maintaining scientific software...

For sharing large training data have a look at https://zenodo.org which is run by the CERN people. Up to 50GB is no problem and after that they say just talk to us :).

I've given this some brief thought and looking at other software and hardware being used in scientific research it is clear this isn't limited to Python, at all. In other words: no matter what you use, it looks like at some point you're going to have to change something (or run the entire thing in a virtualized environment which will last way longer but has it's own problems). What exactly needs changing, and how much work it is, depends on the implementation originally selected and on the amount of time spent trying to avoid such change, but it looks like this is nearly unavoidable. From the top of my head these are problems we had to deal with in the past year or so:

- Matlab changes some behaviour here and there from time to time, deprecates functionality, ...

- both C and C++ have gone through quite some changes as well, and so have the compilers; so it's not exactly uncommon to find 20 year old code written without this in mind and making use of some specifics which now are unavailable

- assembly code written for specific DSP hardware can quickly turn obsolete: hardware unavailable anymore, build tools not running on recent OS etc

- not an uncommon problem in psychological research etc: parallel ports are disappearing

Problems with hardware/compiler/platform-specifics changing can usually be avoided in software with the proper abstractions (quite the work though, sometimes) but it's kinda hard to foresee what the evolution of a language/library is going to be in 10 or 20 years, let alone work around that. Or if you're going the other way and don't want to update to newer versions but are looking to recreate environments/dependencies: what is going to happen to the tools you are using to recreate environments in 20 years, what if they get breaking changes? And to the sources used to recreate the environment?

tldr; not sure if there really is a complete failsafe solution for this which spans multiple decades

Also looking just at the end results of science: suppose you need to reproduce a result of a dataset in 20 years, maybe there's a point where trying to make everything future-proof now is more work than starting over from scratch in 20 years (just the analysis, supposing the data is still there)? For example at some point there was C but not yet SciPy. I can certainly imagine cases where writing certain analysis again from scratch again now in SciPy would be less work than trying to get the ancient C working again.

"- both C and C++ have gone through quite some changes as well, and so have the compilers; so it's not exactly uncommon to find 20 year old code written without this in mind and making use of some specifics which now are unavailable"

C, C++, and Fortran have ISO standards, and if you write standard-conforming code, the committees that revise standards are very careful not to break things. It's not like Python where the BDFL changes the syntax of the print statement.

Do these issues impact the R community to the same extent?

They’ve got some infrastructure such as “Packrat” that helps a little. You can at least be sure you have the correct version of each dependency stored away with your sources.

You still need the right interpreter version. And of course if you have any exotic dependencies, you’d need to figure out what to do there.

I’d say the major difference is one of disposition. The R community seems more on board with the premise that this is a problem worthy of solving. I suspect that that may be a function of the R community having a longer / deeper academic history than the Python community.

EDIT: I should mention that the R community maintains some documentation about the issue as well. https://cran.r-project.org/web/views/ReproducibleResearch.ht...

I can't speak to how the extent compares as I don't use Python nearly as much as R, but it is an issue. In some ways it may be worse, as there are a much wider variety of packages available for R from many different authors, so it isn't unheard of for packages to not be available for newer versions of R. That said, many of those tend to be niche, and I haven't really had old files need any kind of major revision.


? The date given is 2017-11-16. The blog started in 2015.

Sorry, must have mixed up dates in my head. You are correct.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact