Hacker News new | past | comments | ask | show | jobs | submit login
Challenge to scientists: does your ten-year-old code still run? (nature.com)
305 points by sohkamyung on Aug 24, 2020 | hide | past | favorite | 477 comments



This article brings up scientific code from 10 years ago, but how about code from .. right now? Scientists really need to publish their code artifacts, and we can no longer just say "Well they're scientists or mathematicians" and allow that as an excuse for terrible code with no testing specs. Take this for example:

https://github.com/mrc-ide/covid-sim/blob/e8f7864ad150f40022...

This was used by the Imperial College for COVID-19 predictions. It has race conditions, seeds the model multiple times, and therefore has totally non-deterministic results[0]. Also, this is the cleaned up repo. The original is not available[1].

A lot of my homework from over 10 years ago still runs (Some require the right Docker container: https://github.com/sumdog/assignments/). If journals really care about the reproducibility crisis, artifact reviews need to be part of the editorial process. Scientific code needs to have tests, a minimal amount of test coverage, and code/data used really need to be published and run by volunteers/editors in the same way papers are reviewed, even for non-computer science journals.

[0] https://lockdownsceptics.org/code-review-of-fergusons-model/

[1] https://github.com/mrc-ide/covid-sim/issues/179


I am all for open science, but you understand that the links in your post are the exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.

I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.

By the way, yes I tested my ten year old code and it does still work. What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts. In fact the time spent making it that way is time that a scientist spends doing software engineering instead of science, which isn't very efficient.


Let's be clear - scientific-grade code is a substandard of production-grade code. But it is still a real standard.

Does scientific-grade code need to handle a large number of users running it at the same time? Probably not a genuine concern, since those users will run their own copies of the code on their own hardware, and it's not necessary or relevant for users to see the same networked results from the same instance of the program running on a central machine.

Does scientific-grade code need to publish telemetry? Eh, usually no. Set up alerting so that on-call engineers can be paged when (not if) it falls over? Nope.

Does scientific-grade code need to handle the authorization and authentication of users? Nope.

Does scientific-grade code need to be reproducible? Yes. Fundamentally yes. The reproducibility of results is core to the scientific method. Yes, that includes Monte Carlo code, when there is no such thing as truly random number generation on contemporary computers, only pseudorandom number generation, and what matters for cryptographic purposes is that the seed numbers for the pseudorandom generation are sufficiently hidden / unknown. For scientific purposes, the seed numbers should be published on purpose, so that a) the exact results you found, sufficiently random as they are for the purpose of your experiment, can still be independently verified by a peer reviewer, b) a peer reviewer can intentionally decide to pick a different seed value, which will lead to different results but should still lead to the same conclusion if your decision to reject / refuse to reject the null hypothesis was correct.


> Does scientific-grade code need to be reproducible? Yes. Fundamentally yes.

I agree that this is a good property for scientific code to have, but I think we need to be careful not to treat re-running of existing code the same way we treat genuinely independent replication.

Traditionally, people freshly constructed any necessary apparatus, and people walked through the steps of the procedures. This is an interaction between experiment and human brain meats that's missing when code is simply reused (whether we consider it apparatus or procedure).

Once we have multiple implementations, if there is a meaningful difference between them, at that point replayability is of tremendous value in identifying why they differ.

But it is not reproducibility, as we want that term to be used in science.


This! I struggled with this topic in university. I was studying pulsar astronomy, and there was only one or two common tools used at the lower levels of data processing, and had been the same tools used for a couple of decades.

The software was "reproducible" in that the same starting conditions produced the same output, but that didn't mean the _science_ was reproducible, as every study used the same software.

I repeatedly brought it up, but I wasn't advanced enough in my studies to be able to do anything about it. By the time I felt comfortable with that, I was on my way out of the field and into an non-academic career.

I have kept up with the field to a certain extent, and there is now a project in progress to create a fully independent replacement for that original code that should help shed some light (in progress for a few years now, and still going strong).


> The software was "reproducible" in that the same starting conditions produced the same output, but that didn't mean the _science_ was reproducible, as every study used the same software.

This is the difference between reproducibility and replicability [1]. Reproducibility is the ability to run the same software on the same input data to get the same output; replication would be analyzing the same input data (or new, replicated data following the original collection protocol) with new software and getting the same result.

I've experienced the same lack of interest with established researchers in my field, but I can at least ensure that all my studies are both reproducible and replicable by sharing my code and data.

[1] Plesser HE. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Front Neuroinform. 2018;11:76.


This is almost an argument for not publishing code. If you publish all the equations, then everybody has to write their own implementation from that.

Something like this is the norm in some more mathematical fields, where only the polished final version is published, as if done by pure thought. To build that, first you have to reproduce it, invariably by building your own code -- perhaps equally awful, but independent.


Maybe gate release of the code by some number of attempted replications.


Should this be surprising? I'm not saying it is correct, but it is similar to the response many managers give concerning a badly needed rewrite of business software. Doing so is very risky and the benefits aren't always easy to quantify. Also, nobody wants to pay you to do that. Research is highly competitive, so no researcher is going to want to spend valuable time making a new tool that already exists even if needed if no other researchers are doing that.


Conversely though, it is often impossible to obtain the original code to replay and identify differences once that step is reached without some sort of strong incentive or mandate for researchers to publish it. When the only copy is lost in the now-inaccessible home folder of some former grad student's old lab machine, there is a strong disincentive to try replicating at all because one has little to consult on whether/how close the replicated methods are to the original ones.


And so we find ourselves in the same situation as the rest of the scientific process, throughout history. When I try to replicate your published paper and I fail, it's completely unclear whether it's "your fault" or "my fault" or pure happenstance, and there's a lot of picking apart that needs to be done with usually no access to the original experimental apparatus and sometimes no access to the original experimenters.

The fact that we can have that option is an amazing opportunity that a confluence of attributes of software (specificity, replayability, easy of copying) afford us. Where we are not exploiting this like we could be, it is a failure of our institutions! But it is different-in-kind from traditional reproducibility.


Of course, but the flip side is that same confluence of attributes has also exacerbated issues of reproducibility. Just as science and the methods/mediums by which we conduct/disseminate it have changed, so too should the standard of what is considered acceptable to reproduce. This is especially relevant given how much broader the societal and policy implications have become.

More concretely, it is 100% fair (and I might argue necessary) to demand more of our institutions and work to improve their failures. I'm sure many researchers have encountered publications of the form "we applied <proprietary model (TM)> (not explained) to <proprietary data> (partially explained) after <two sentence description of preprocessing> and obtained SOTA results!" in a reputable venue. Sure, this might be even less reproducible 200 years ago than now, but the authors would also be less likely to be competing with you for limited funding! Debating about the traditional definition of reproducibility has its place, but we should also be doing as much as possible to give reviewers and replicators a leg up. This is often flies in the face of many incentives the research community faces, but shifting blame to institutions by default (not saying you're doing this, but I've seen many who do) is taking the easy road out and does little to help the imbalanced ratio of discussion:progress.


This. I absolutely agree there needs to be more transparency, and scientific code should be as open as possible. But this should not replace replication.


But "rerunning reproducability" is mostly a neccessary requirement for independent reproducability. If you can't even run the original calculations against the original data again how can you be sure that you are not comparing apples to oranges?


In some simulations, each rerun produces different results as you’re simulating random events (like lightning formation) or using a non-deterministic algorithm (like Monte Carlo sampling). Just “saving the random seed” might not be sufficient to make it deterministic either, as if you do parallelized or concurrent actions in your code (common in scientific code) the same pseudorandom numbers may be used in different orders each time you run it.

But repeating the simulation a large number of times, with different random seeds, should produce statistically similar output if the code is rigorous. So even if each simulation is not reproducible, as long as the statistical distribution of outputs is reproducible, that should be sufficient.


Very interesting. I was thinking of software as most similar to apparatus, and secondarily to procedure. You raise a third possible comparison: calculations, which IIUC would be expected to be included in the paper.

There are some kinds of code (a script that controls a sensor or an actuator) where I think that doesn't match up well at all. There are plenty of kinds of code where they are, in fact, simply crunching numbers produced earlier. For the latter, I'm honestly not sure the best way to treat it, except to say that we should be sure that enough information is included in some form that replication should be possible, and that we keep in mind the idea that replication should involve human interaction.


This is not clear at all. It depends on the "result" in question. If I wrote a paper describing a super numerical algorithm for inverting matrices, and no one is able to replicate the superior performance of my algorithm despite following the recipe in my paper, then whether they can run my code or not doesn't seem to be of the highest priority.

Edit: more careful phrasing.


> whether they can run my code or not doesn't seem to be of the highest priority.

On the contrary; in that case there are four possibilities:

a: your algorithm doesn't work at all, and your observations are a artifact of convenient inputs or inept measurements.

b: your algorithm works, but the description in the paper is wrong or incomplete

c: your algorithm works as described, but the replicater implemented it incorrectly

d: other

Having the original implementation code is necessary to distinguish between cases a and b versus case c, and if the former, the code for the test harness is likely to help distinguish a versus b. (Case d is of course still a problem, but that doesn't mean it's reasonable to just give up.)


I agree with the case analysis, but disagree with the implication that the code needs to be runnable (which is seems to be the point of the discussion at hand). In many cases having source code, even if it no longer runs, should be sufficient.


I complete agree with your case analysis, but disagree with the conclusion that the code needs to be runnable for it to be useful -- I thought this was the point of the discussion at hand? In most situations, having source code, even if it no longer runs, would be sufficient to conduct the analysis you describe.

I'm all for more transparency, and this includes making codes and data public as much as is reasonable. But the real test is if someone can independently replicate the result, and how to incentivize replication studies (in both computational and experimental science) is also important, and in my view should not be divorced from discussions of reproducibility.

Edit: rewrote to clarify my position.


I would also place as a requirement that the code be comprehensible to someone familiar with the domain - .i.e. a "peer".


I do agree with you on publishing seeds for Monte Carlo simulations however the argument against it is also very strong. Usually when you run a monte carlo simulation you are quoting the results in terms of statistics. I think it would be sufficient to say that you can 'reproduce' the results as long as your statistics (over many simulations with different seeds) is consistent with the published results. If you run a single simulation with are particular seed you should get the same results however this might be cherry picking a particular simulation result. This is good for code testing but probably not for scientific results. I think by running the code with new seeds is a better way to test the science.


As an ex-scientist who used to run lots of simulations, I really fail to see a truly compelling reason why most numerical results (for publication purposes) truly need to publish (and support) deterministic seeding.

We've certainly done a lot, scientifically speaking (in terms of post-validated studies), without that level of reproducibility.


If nothing else, it helps debugging code which tries to reproduce your findings.


The code I work with is not debuggable in that way under most circumstances. It's a complex distributed system. You don't attempt to debug it by being deterministic- you debug it by sampling its properties.


> there is no such thing as truly random number generation on contemporary computers

well that's just not true. there's no shortage of noise we can sample to get true random numbers. we just often stretch the random numbers for performance purposes.


There is a shortage of truly random easily sampled noise though.


You can run the script multiple times and get a statistical representation of what the results should be. That's the point of science.

This reminds me of being in gradschool and the comp-sci people complaining that we don't get bit-wise equal floats when we solve DEs.

Having to re-implement a library from scratch for a project is much more valuable than running the same code in two places. The same way that getting the same results from two different machines is a lot more significant than getting the same result from two cloned machines.

In short: code does not need to be reproducible because scientists know how to average.


> Does scientific-grade code need to be reproducible? Yes. Fundamentally yes

This is definitely not correct. The experiment as a whole needs to be reproducible independently. This is very different, and more robust, from requiring that a particular portion of a previous version of the experiment to be reproducible in isolation.


> Does scientific-grade code need to be reproducible? Yes. Fundamentally yes. The reproducibility of results is core to the scientific method. Yes, that includes Monte Carlo code, [...]

Reproducibility in the scientific sense is different from running the same program with the same input, and getting exactly the same result. Repreducibility means that if you repeat the measurements in another environment, getting somewhat different data, and apply the same theory and methods, you get to the same conclusion.

The property of a computer program that when you run it again with the same input, you get the same output, is nice and very helpful for debugging. But the fact that you can run the same program does not mean that it is bug-free, as much as the fact that you can copy a paper with a mathematical proof does not mean that the proof is correct.

Also, multi-threaded and parallel code is inherently non-deterministic.

> when there is no such thing as truly random number generation on contemporary computers, only pseudorandom number generation,

That is wrong. Linux for example uses latency measurements from drivers such as HDD drive seek latencies or keyboards to generate entropy. While it might not the best thing to rely on for purposes of cryptography, it is surely not deterministic. If it would matter, you could download real-time astronomical noise measurements and use them to seed your Mersenne Twister generator.


As a software developer, I tgi k maybe you misunderstand scientific reproducibility.

Other scientists should be building their own apparatus, writing and running their own code. That the experiment is actually different is what validates the hypothesis, which specifies the salient conditions leading to the outcome.

That an identical experiment leads to an identical outcome fails to validate the hypothesis, because the casual factors may have been misidentified.

Precise reproducibility still matters if generalized reproduction fails, however, because the differences in the experimental implementation may lead to new and more accurate hypothesis about causality.


> Other scientists should be building their own apparatus, writing and running their own code. That the experiment is actually different is what validates the hypothesis, which specifies the salient conditions leading to the outcome.

The problem here is a step before this: The results of these "identical" experiments are so wildly different there is nothing valid to propose, let alone for a recreation to compare against.


> That an identical experiment leads to an identical outcome [..]

So what happens if/when an identical experiment fails to lead to an identical outcome?


Controlling randomness can be extremely difficult to get right, especially when there's anything asynchronous about the code (e.g. multiple worker threads populating a queue to load data). In machine learning, some of the most popular frameworks (e.g. TensorFlow [0]) don't offer this as a feature, and in other frameworks that do (PyTorch [1]) it will cripple the speed you get as a result as GPU accelerators rely on non-deterministic accumulation for reasonable speed.

Scientific reproducibility does not mean, and has never meant, you rerun the code and the output perfectly matches bit-for-bit every time. If you can achieve that, great -- it's certainly a useful property to have for debugging. But a much stronger and more relevant form of reproducibility for actually advancing science is running the same study e.g. on different groups of participants (or in computer science / applied math/stats / etc., with different codebases, with different model variants/hyperparameters, on different datasets) and the overall conclusions hold.

To paraphrase a comment I saw from another thread on HN: "Plenty of good science got done before modern devops came to be."

[0] https://github.com/tensorflow/tensorflow/issues/12871 https://github.com/tensorflow/tensorflow/issues/18096

[1] https://pytorch.org/docs/stable/notes/randomness.html

==========

EDIT to reply to solatic's replies below (I'm being rate-limited):

The social science arguments are probably fair (or at least I'll leave it to someone more knowledgeable to defend them if they want) -- perhaps I shouldn't have led with the example of "different groups of participants".

> If you can achieve that, for the area of study in which you conduct your experiment, it should be required. Deciding to forego formal reproducibility should be justified with a clear explanation as to why reproducibility is infeasible for your experiment, and peer-review should reject studies that could have be reproducible but weren't in practice.

This might be a reasonable thing to enforce if everyone in the field were using the same computing platform. Given that they're not (and that telling everyone that all published results have to be done using AWS with this particular machine configuration is not a tenable solution) I don't see how this could ever be a realistic requirement. Or if you don't want to enforce that the results remain identical across different platforms, what's the point of the requirement in the first place? How would it be enforced if nobody else has the exact combination of hardware/software to do so? And then even if someone does, almost inevitably there'll be some detail of the setup that the researcher didn't think to report and results will differ slightly anyway.

Besides, if you're allowing for exemptions, just about every paper in machine learning studying datasets larger than MNIST (where asynchronous prefetching of data is pretty much required to achieve decent speeds) would have a good reason to be exempt. It's possible that there are other fields where this sort of requirement would be both useful and feasible for a large amount of the research in that field, but I don't know what they are.

> Also, reading through the issues you linked points to: https://github.com/NVIDIA/framework-determinism which is a relatively recent attempt by nVidia to support deterministic computation for TensorFlow. Not perfect yet, but the effort is going there.

(From your other comment.) Yes, there exists a $300B company with an ongoing-but-incomplete funded effort of so far >6 months' work (and that's just the part they've done in public) to make one of its own APIs optionally deterministic when it's being used through a single downstream client framework. If this isn't a perfect illustration that it's not realistic to expect exact determinism from software written by individual grad students studying chemistry, I'm not sure what to say.


You're right about bit-for-bit reproducibility possibly being overkill, but I don't think that invalidates the parent's point that Monte Carlo randomization doesn't obviate reproducibility concerns. It just means that e.g. your results shouldn't be hypersensitive to the details of the randomization. That is, reviewers should be able to take your code, feed it different random data from a similar distribution to what you claimed to use (perhaps by choosing a different seed), and get substantively similar results.


That brings up a separate issue that I didn't comment on above: the expectation that the code runs in a completely different development/execution environment (e.g. the one the reviewer is using vs. the one that the researcher used). That means making it run regardless of the OS (Windows/OSX/Linux/...) and hardware (CPU/GPU/TPU, and even within those, which one) the reviewer is using. This would be an extremely difficult if not impossible thing for even a professional software engineer to achieve. It could easily be a full time job. There are daily issues on even the most well-funded projects in machine learning by huge companies (ex: TF, PyTorch) that the latest update doesn't work on GPU X or CUDA version Y or OS Z. It's not a realistic expectation for a researcher even in computer science, let alone researchers in other fields, most of whom are already at the top of the game programming-wise if they would even think to reach for a "script" to automate repetitive data entry tasks etc.

==========

EDIT to reply to BadInformatics' reply below (I'm being rate-limited): I fully agree that a lot of ML code releases could be better about this, and it's even reasonable to expect them to do some of these more basic things like you mention. I don't agree that bit-for-bit reproducibility is a realistic standard that will get us there.


I don't think that removes the need to provide enough detail to replicate the original environment though. We write one-off scripts with no expectation that they will see outside usage, whereas research publications are meant for just that! The bar isn't terribly high either: for ML, a requirements.txt + OS version + CUDA version would go a long way, no need to learn docker just for this.


Have you tried running a specific Cuda version from 10 years ago?

Because I have and I pity anyone who tries and builds a kernel that can run it.


It does seem like a valid response to OP's objection to the imperial college's COVID model, though. Doesn't it?


Reviewing the original comment, I think so (that the original comment is overcritical). For purpose of reproducibility, it's enough that you can validate that you can run the model with different random data and see that their results aren't due to pathological choices of initial conditions. If the race conditions and non-determinism just transform the random data into another set of valid random data, that doesn't compromise reproducibility.


> or in computer science / applied math/stats / etc., with different codebases, with different model variants, on different datasets) and the overall conclusions hold

A lot of open sourced CS research is not reproducible.

"the code still runs and gives the same output" is not the same as reproducibility.


> A lot of open sourced CS research is not reproducible.

I'm not sure if this was meant to be a counter-argument to me, but I completely agree!

> "the code still runs and gives the same output" is not the same as reproducibility.

Yes, bit-for-bit identical results are neither necessary nor sufficient for reproducibility in the usual scientific sense.


> I'm not sure if this was meant to be a counter-argument to me

It wasn't :)


> But a much stronger and more relevant form of reproducibility for actually advancing science is running the same study e.g. on different groups of participants (or in computer science / applied math/stats / etc., with different codebases, with different model variants/hyperparameters, on different datasets) and the overall conclusions hold

> Plenty of good science got done before modern devops came to be

This isn't as strong of an argument as you think. This is more-or-less the underlying foundation behind the social sciences, which argues that no social sampling can ever be entirely reproduced since no two people are alike, and even the same person cannot be reliably sampled twice as people change with time.

Has there been "good science" done in the social sciences? Sure. I don't think that you're going to find anybody arguing that the state of the social sciences today is about the same as it was in the Dark Ages.

With that said, one of the reasons why so many laypeople look at the social sciences as a kind of joke is because so many contradictory studies come out of these peer-reviewed journals that their trustworthiness is quite low. One of the reasons why there's so much confusion surrounding what constitutes a healthy diet and how people should best attempt to lose weight is precisely because diet-and-exercise studies are more-or-less impossible to reproduce.

> If you can achieve that, great -- it's certainly a useful property to have for debugging

If you can achieve that, for the area of study in which you conduct your experiment, it should be required. Deciding to forego formal reproducibility should be justified with a clear explanation as to why reproducibility is infeasible for your experiment, and peer-review should reject studies that could have be reproducible but weren't in practice.


Plenty of good physics got done before modern devops came to be, too! Maybe the pace of advancement was slower when the best practice was to publish a cryptographic hash of your discoveries in the form of a poetic latin anagram rather than just straight-up saying it, but it's not like Hooke's law is considered unreproducible today because you can't deterministically re-instantiate his experimental setup with a centuries-old piece of brass and get the same result to n significant figures.


And physicists have been writing code for a while simply because the number of software engineers with a working knowledge of physics (as in ready for research), have been trained in numerical analysis (as in being able to read applied mathematics) and then are willing to help you with your paper for peanuts is about zero.

I don't understand why it is so hard to see that you need either a pretty big collaboration where somebody else has isolated the specifications so you don't need to know anything about the problem your code solves really, or becoming a physics graduate student yourself for this line of work.


I wouldn't argue for it, but I would be extremely reluctant to argue against the assertion that the state of the social 'sciences' today is about the same as it was in the Dark Ages. To the extent that any "good science" gets done in the social 'sciences', it is entirely in spite of the entire (social-'science'-specific) underlying foundation thereof. If your results aren't reproducible with (the overwhelming majority of[0]) other samples collected based on the same criteria, your results aren't.

0: specifically, for results claiming a 95% confidence level (p<0.05), if nineteen replications are attempted, you should encounter a replication failure exactly once. I would accept perhaps four or five out of nineteen (or one out of two or three) under the reasoning that the law of large numbers hasn't kicked in yet, but anything with zero successful replications is not science, it's evidence (in this, against the entire field of study).


Also, reading through the issues you linked points to: https://github.com/NVIDIA/framework-determinism which is a relatively recent attempt by nVidia to support deterministic computation for TensorFlow. Not perfect yet, but the effort is going there.


the correct way to control randomness in scientific code is to have the RNG be seeded with a flag and have the result check out with a snapshot value. Almost no one does this, but that doesn't mean it shouldn't be done.


This is not correct on several levels. Reproducibility is not achievable in many real world scenarios, but worse it's not even very informative.

Contra your assertion, many people do some sort of regression testing like this but it's isn't terribly useful for verification or validation - but it is good at catching bad patches.


Did you read my post? I know what a seed is. Setting one is typically not enough to ensure bit-for-bit identical results in high-performance code. I gave two examples of this: CUDA GPUs (which do non-deterministic accumulation) and asynchronous threads (which won't always run operations in the same order).


Most scientific runs are scaled where you run multiple replicates. And not all scientific runs are high-performance in the HPC sense. Even if your code is HPC in the HPC sense, and requires CUDA, and 40,000 cores, you should consider creating a release flag where an end user can do at least single "slow" run on a CPU on a reduced dataset, in single threaded mode, to sanity check the results and at least verify that the computational and algorithmic pipeline is sound at the most basic level.

I used to be a scientist. I get it, getting scientists to do this is like pulling teeth, but it's the least you could do to give other people confidence in your results.


> consider creating a release flag where an end user can do at least single "slow" run on a CPU on a reduced dataset, in single threaded mode, to sanity check the results and at least verify that the computational and algorithmic pipeline is sound at the most basic level.

Ok, that's a reasonable ask :) But yeah as you implied, good luck getting the average scientist, who in the best case begrudgingly uses version control, to care enough to do this.


Monte-Carlo can and should be deterministic and repeatable. It’s a matter of correctly initializing you random number generators and providing a known/same random seed from run to run. If you aren’t doing that, you aren’t running your Monte-Carlo correctly. That’s a huge red flag.

Scientists need to get over this fear about their code. They need to produce better code and need to actually start educating their students on how to write and produce code. For too long many in the physics community have trivialized programming and seen it as assumed knowledge.

Having open code will allow you to become better and you’ll produce better results.

Side note: 25 years ago I worked in accelerator science too.


Hello fellow accelerator physicist!

Yes I understand how seeding PRNGs work and I personally do that for my own code for debugging purposes. My point was that not using a fixed seed doesn't invalidate their result. It's just a cheap shot and, to me, demonstrates that the lockdownskeptics author doesn't have a real understanding of the methods being used.

Also, to be clear, I support open science and have some of my own open-source projects out in the wild (which is not the norm in my own field yet). I'm not arguing against releasing code, I'm arguing against OP arguing against this particular piece of code.


Indeed it was a cheap shot, the code does give reproducible results: https://www.nature.com/articles/d41586-020-01685-y

The main issue is if it used sensible inputs, but that's entirely different from code quality and requires subject matter expertise, so programmers don't bother with such details -_-


I write M-H samplers for a living. While I agree that being able to rerun a chain using the same seed as before is crucial for debugging, and while I'm very strongly in favour of publishing the code used for a production analysis, I'm generally opposed to publishing the corresponding RNG seeds. If you need the seeds to reproduce my results, then the results aren't worth the PDF they're printed on. [edit: typo]


> Monte-Carlo can and should be deterministic and repeatable

I guess it can be made so, but not necessarily easy / fast (if it's parallel, and sensitive to floating point rounding). And sounds like the kind of engineering effort GP is saying isn't worth it. Re-running exactly the same monte-carlo chain does tell you something, but is perhaps the wrong level to be checking. Re-running from a different seed, and getting results that are within error, might be much more useful.


I guess the best thing would be that it uses a different random seed every time it's run (so that, when re-running the code you'll see similar results which verifies that the result is not sensitive to the seed), but the particular seed that produced the particular results published in a paper is noted.

But still, for code running on different machines, especially for numeric-heavy code that might be running on a particular GPU setup, distributed big data source (where you pull the first available data rather than read in a fixed order), or even on some special supercomputer, it's hard to ask that it be totally reproducible down to the smallest rounding error.


Then you need to re-imagine the system in such a way that junior scientific programmers (i.e. Grad Students) can at least imagine having enough job security for code maintainability to matter, and for PIs to invest in their students' knowledge with a horizon longer than a couple person-years.


> Monte-Carlo can and should be deterministic and repeatable.

That's a nitpick, but if the computation is executed in parallel threads (e.g. on multicore, or on a multicomputer), and individual terms are, for example, summed in a random order, caused by the non-determinism introduced by the parallel computation, then the result is not strictly deterministic. This is a property of floating-point computation, more specifically, the finite accuracy of real floating-point implementations.

So, it is not deterministic, but that should not cause large qualitative differences.


> Monte-Carlo can and should be deterministic and repeatable. It’s a matter of correctly initializing you random number generators and providing a known/same random seed from run to run.

Perhaps if you use only single-threaded computation, you are interested in averages, and the processes you are interested in behave well and mostly linear.

But

- running code in parallel easily introduces non-determinism, even if your result computation is as simple as summing up results from different threads

- the processes one is examining might be highly non-linear - like lightning, weather forecasts, simulation of wildfires, and also epidemic simulations

- especially for all kind of safety research, you might actually be interested not only in averages, but in freak events, like "what is the likelihood that you have two or three hurricanes at the same time in the Gulf of Mexico", or "what happens if your nuclear plant gets struck by freak lightning in the first second of a power failure".

What should be reproducible are the conclusions you come to, not the hashed bits of program output.

> If you aren’t doing that, you aren’t running your Monte- Carlo correctly. That’s a huge red flag.

No, it does not follow from that.


Since I have a bit of experience in this area, quasi-Monte Carlo methods also work quite well and ensure deterministic results. They're not applicable for all situations though.


Doesn't it concern you that it would be possible for critics to look at your scientific software and find mistakes (some of which the OP mentioned are not "minor") so easily?

Given that such software forms the very foundation of the results of such papers, why shouldn't it fall under scrutiny, even for "minor" points? If you are unable to produce good technical content, why are you qualified to declare what is or isn't minor? Isn't the whole point that scrutiny is best left to technical experts (and not subject experts)?


> Doesn't it concern you that it would be possible for critics to look at your scientific software and find mistakes (some of which the OP mentioned are not "minor") so easily?

A non-native English speaker may make grammatical mistakes when communicating their research in English—it does not in any way invalidate their results or hint that there is anything amiss. It is simply what happens when you are a non-native speaker.

Some (many?) code critiques by people unfamiliar with the field of study the research will be about superficial mistakes that do not invalidate the results. They are the code equivalents of grammatical mistakes. That's what the OP is talking about.


Journals employ copy editors to address just those sorts of mistakes, why should we not hold software to the same standard as academic language? But more importantly, these software best practices aren't mere "grammatical mistakes," they exist because well-organized, well-tested code has fewer bugs and is easier for third parties to verify. Third-parties validating that the code underlying an academic paper executes as expected is no different than third-parties replicating the results of a physical experiment. You can be damn sure that an experimental methodology error invalidates a paper, and you can be damn sure that bad documentation of the methodology dramatically reduces the value/reliability of the paper. Code is no different. It's just been the wild west because it is a relatively new and immature field, so most academics have never been taught coding as a discipline nor held to rigorous standards in their own work. Is it annoying that they now have to learn how to use these tools properly? I'm sure it is. That doesn't mean it isn't a standard we should aim for, nor that we shouldn't teach the relevant skills to current students in sciences so that they are better prepared when they become researchers themselves.


> Third-parties validating that the code underlying an academic paper executes as expected is no different than third-parties replicating the results of a physical experiment.

First, it's not no different--it's completely different. Third parties have always constructed their own apparatus to reproduce an experiment. They don't go to the original author's lab to perform the experiment!

Second, a lot of scientific code won't run at all outside the environment it was developed in.

If it's HPC code, it's very likely that the code makes assumptions about the HPC cluster that will cause it to break on a different cluster. If it's experiment control / data-acquisition code, you'll almost certainly need the exact same peripherals for the program to do anything at all sensible.

I see a lot of people here on HN vastly over-estimating the value of bit-for-bit reproducibility of one implementation, and vastly underestimating the value of having a diversity of implementations to test an idea.


I’m glad someone else feels this way. It’s an expectation that scientists can share their with other scientists using language. Scientists aren’t always the best writers, but there are standards there. Writing good code is a form of communication. It baffles me that there are absolutely no standards there.


I agree with your overall point, but I just want to point out that many (most?) journals don't employ copy-editors, or if they do, then they overlook many errors, especially in the methods section of papers.


On the contrary: If I'm (in industry) doing a code review and see simple, obvious mistakes like infinite loops, obvious null pointer exceptions, ignored compiler warnings, etc., in my mind it casts a good deal of doubt over the entire code. If the author is so careless with these obvious errors, what else is he/she being careless about?

Same with grammatical or spelling errors. I don't review research but I do review resumes, and I've seen atrocious spelling on resumes. Here's the candidate's first chance to make an impression. They have all the time in the world to proofread, hone, and have other eyes edit it. Yet, they still miss obvious mistakes. If hired, will their work product also be sloppy?


This sort of scrutiny only matters once someone else has a totally different code that gives incompatible results, before that point there's no sense in looking for bugs because all you're proving is that there are no obvious mistakes: you don't say anything about the interesting questions since you only bother with codes for things with non-obvious answers.


When you say OP, do you mean djsumdog? If so, what mistakes does he mention that aren't minor?


How is it possible to know the difference between minor and major, if the mistakes are kept secret?

If we're supposed to accept scientific results on faith, why bother with science at all?


> exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.

If code is what is substantiating a scientific claim, then code needs to stand up to scientific scrutiny. This is how science is done.

I came from physics, but systems and computer engineering was always an interest of mine, even before physics, I thought it was kooky-dooks that CS people can release papers w/o code, fine if the paper contains all the proofs but otherwise it shouldn't even be looked at. PoS (proof-of-science) or GTFO.

We are the point in human and scientific civilization that knowledge needs to prove itself correct. Papers should be self contained execution environments that generate PDFs and resulting datasets. The code doesn't need to be pretty, or robust, but it needs to be sealed inside of a container so that it can be re-run, re-validated and someone else can confirm the result X years from now. And it isn't about trusting or not trusting the researcher, we need to fundamentally trust the results.


The history of physics is full of complex, one-off custom hardware. Reviewers have not been expected to take the full technical specs and actually build and run the exact same hardware, just to verify correctness for publication.

I doubt any physicist believes we need to get the Tevatron running again just to check decade-old measurements of the top quark. I don't understand why decade-old scientific software code must meet that bar.


They didn't rebuild the Tevatron but still were able to rediscover the top within a different experimental environment (i.e. LHC with tons of different discovery channels) and have lots of fits for it properties from indirect measurements (LEP, Belle). Physics is not an exact science. If you have only one measurement (no matter if its software- or hardware-based), no serious physicist would fully trust in the result as long as it wasn't confirmed by an independent research group (by doing more than just rebuilding/copying the initial experiment but maybe using slightly different approximations or different models/techniques). I'm not so much in computer science, but I guess here it might be a bit different ones a prove is based on rigorous math. However even if so, I guess, it's sometimes questionable if the prove is applicable to real-world systems and then one might be in a similar situation.

Anyways, in physics they always require several experimental proves for our theory. They also have several "software experiments" for e.g. predicting the same observables. Therefore, researchers need to be able to compile and run the code of their competitors in order to compare and verify the results in detail. In this place, bug-hunting/fixing is sometimes also taking place - of course. So applying the articles suggestions would have the potential to accelerate scientific collaboration.

btw; I know some people who do still work with the data taken at the LEP experiment which was shut down almost 20 (!) years ago and they have a hard time in combining old detector-simulations, monte-carlos etc. with new data-analysis techniques for the exact same reasons mentioned in the article. For large-scale experiments it is a serious problem which nowadays has much more attention than at LEP ages, since LHC has anyways obvious big-data problems to solve before their next upgrade, including also software-solutions.


If you could have spun up a Tevatron at will for $10, would the culture be the same today?

I suspect that software really is different in this way, and treating it like it's complex, one off hardware is cultural inertia that's going to fade away.


All of my 2010 scientific code runs on the then-current edition of Docker. /s


I made no mention of Docker, VMs or any virtualization system. Those would be an implementation detail and would obviously change over time.

A container can be a .tar.gz, a zip or a disk image of artifacts, code, data and downstream deps. The generic word has been co-opted to mean a specific thing which is very unfortunate.


My point, which I guess I did not make clearly enough, is that container systems don't necessarily exist or remain supported over the ten-year period being discussed. The idea of ironing over long-term compatibility issues using a container environment seems like a great one! (For the record, .tgz -- the "standard" format for scientific code releases in 2010, does not solve these problems at all.)

But the "implementation detail" of which container format you use, and whether it will still be supported in 10 years, is not an implementation detail at all -- since this will determine whether containerization actually solves the problem of helping your code run a decade later. This gets worse as the number, complexity and of container formats expands.

Of course if what you mean is that researchers should provide perpetual maintenance for their older code packages, moving them from one obsolete platform to a more recent one, then you're making a totally different and very expensive suggestion.


Of course of course. I am not trying to boil the ocean here, or we would have a VM like wasm and a execution env like wasi and run all our 1000 year code inside of that.

The first step is just having your code, data and deps in an archive. Depending on the project and the age, more stuff makes it into the archive. I have been on projects where the source to the compiler toolchain was checked into the source repo and the first step was to boostrap the tooling (from a compiler binary checked into the repo).

We aren't even to the .tar.gz stage yet.


> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

Specifically, to that point, I want to cite the saying:

"The dogs bark, but the caravan passes."

(There is a more colorful German variant which is, translated: "What does it bother the mighty old oak tree if a dog takes a piss...").

Of course, if you publish your code, you expose it to critics. Some of this will be unqualified. And as we have seen in the case e.g. of climate scientists, some might be even nasty. But who cares? What matters is open discussion which is a core value of science.


That's not how the game is played. If you cannot the release the code because the code is too ugly or untested or has bugs, how do you expect anyone with the right expertise to assess your findings?

It reminds me of Kerckhoffs's principle in cryptography, which states: A cryptosystem should be secure even if everything about the system, except the key, is public knowledge.


In GIS, there's a saying "the map is not the terrain". It seems like HN is in a little SWE bubble, and needs to understand "the code is not the science".

In science, code is not an end in-and-of-itself. It is a tool for simulation, data reduction, calculation, etc. It is a way to test scientific ideas.

> how do you expect anyone with the right expertise to assess your findings

I would expect other experts in the field to write their own implementation of the scientific ideas expressed in a paper. If the idea has any merit, their implementations should produce similar results. Which is exactly what they would do if it were a physical experiment.


No one is saying that code is the science.

If I'm given bad information and I act on that information, then problems can occur.

Similarly, if the software is giving the scientist bad information, problems can occur.

How many more stories do we have to read about some research getting published in a journal only to have to retract it down the road because they had a bug in the software before we start asking if maybe there needs to be more rigor in the software portion of the research as well?

There was a story on HN a while back about a professor who had written software, had come to some conclusions, and even had a Ph.D. student working on research based on that work. Only to find out that a software flaw meant the conclusions weren't useful to anyone and that student ended up wasting years of their life.

---

This stuff matters. This isn't a model of reality, it's an exploration of reality. It would be like telling a hiker that terrain doesn't matter. They would, rightfully, disagree with you.


> How many more stories do we have to read about some research getting published in a journal only to have to retract it down the road because they had a bug in the software before we start asking if maybe there needs to be more rigor in the software

We will always hear stories like that, as we will always hear stories about major bugs in stable software releases. Asking a scientist to do better than whole teams of software engineers makes little sense to me.

Of course, a bug that was introduced or kept with the counscious intention of fooling the reviewers and the readers is another story.


> Asking a scientist to do better than whole teams of software engineers makes little sense to me.

This is not what is being asked, shame on you for the strawman.

Your entire post can be summed up with the following sentence: "if we can't be perfect then we may as well not try to be better".


I was reacting to the part of your post I quoted.

The thing is that it has little to do about rigor -- or if I may sin again, it is equivalent to say that software developers lack rigor: sure, some of them do (as some scientists do), but even among the most significant and severe bugs of the history of software, it is seldom the case that we can tell "right, definitely the guy who wrote that lacked rigor and seriousness".

Of course this is not a blank forgiveness for every bad scientist out there. Of course we should aim at getting better. But we should make the difference between the ideal science process and the science as performed by a human, prone to errors, misunderstandings and mistakes, and realize that these things will always happen, however many times we call for "more rigor because bugs have consequences".


All you did was restate the argument that I've already rejected.

And stop comparing scientists to software developers, it's a hidden argument by authority, and it isn't needed.


I don't even understand the point you are making then (apart from me apparently arguing solely with sophisms, which is kind of a prowess).

How this "rigor" you are calling for should manifest, then? Put bluntly, my point was that every software has bug, so how "more rigor" would help? What should we do, what should we ask for in _practical_ terms?

Also, please do not rephrase this last sentence as "oh so since every software has bugs, then you obviously say that we shouldn't fix bugs, anyway other bugs will remain!".


> Also, please do not rephrase this last sentence as "oh so since every software has bugs, then you obviously say that we shouldn't fix bugs, anyway other bugs will remain!".

That's exactly what I'm going to do. Point out that we can demand better even in the face of a lack of perfection.

There are two problems here with your stance.

1. The assumption that all bugs are created equal, and 2. The assumption that the truth isn't the overriding concern of science.

It's real easy to define the set of bugs that are unacceptable in science. Any bug that would render the results inaccurate is unacceptable.

The fact that some jackass web developer wrote a bug that deleted an entire database in no way obviates that responsibility of the scientists.


I don't entirely disagree, but haven't there also been cases of experimental results being invalidated due to subtle mechanical, electrical, chemical, etc complications with the test equipment, when none of the people involved in the experiment were experts in those fields?

I think that, while we could use a bit more training in software engineering best-practices in the science, the thesis is still that science is hard and we need real replication of everything before reaching important conclusions, and over-focusing on one specific type of errors isn't all that helpful.


If they're setting up experiments whose correct results require electrical expertise, then yes, they should either get better training or bring in someone who has it.

It's not clear to me why you think I would argue that inaccuracies should be avoided in software but accept that they're ok for electrical systems.


> In GIS, there's a saying "the map is not the terrain". It seems like HN is in a little SWE bubble, and needs to understand "the code is not the science".

And if you're a map maker, it's a bit rich to start claiming that the accuracy of your maps is unimportant. If code is "a way to test scientific ideas", then it kinda needs to work if you want meaningful results. Would you run an experiment with thermometers that were accurate to +-30° and reactants from a source known for contamination?


In many parts of scientific research, researchers are, to stay in your metaphor, more travelers using a map, than map makers.

Of course, it is a difference whether you make a clinical study on drugs, and use a pocket calculator to compute a mean, or whether you research in numerical analysis, or are presenting a paper in how to use Coq to more efficiently prove the four-color theorem or Fermat's last theorem.

In short, much of science is not computer science, and for it, computation is just a tool.


Mathematicians are expected to publish their proofs. Not so that people can do the proof again independently, but so that other mathematicians can find and point out if they have a tangible error in their proof that tangibly invalidates the result.

Sure, some people might point out spurious bugs and "design issues" or whatever, boo hoo. But others might actually find flaws in the code that meaningfully affect science itself: true bugs.

Sure, they could do this by doing a full replication in a lab and then custom coding everything from scratch. But even then, all you have is two conflicting results, with no good way yet to determine which one is more right or why they disagree. Technically, you can use the scientific progress to eventually find bugs in the scientific process, but why waste so much time when publishing the code will allow for reviews to find bugs so much faster. Its pure benefit to science to not obscure its proofs and rigor.


If you’re saying you produced certain results with code, then the code is indeed the science. Not being able to vouch for the code is like believing a mathematical theorem without seeing the proof.


You are missing the point.

How many actually try to reproduce the results by writing corresponding code themselves? Apparently lot of papers with slightly wrong findings because code errors have passed the peer review (all of us in the SWE bubble know how often bugs occur), at least in less prestigious journals.

There is nothing wrong with mandating the code to be supplied with the paper. Because, many time code is somewhere between the experimental setup and proof / result.


The findings really should be independent of the code. Reproduction should occur by taking the methodology and re-implementing the software and running new experiments.


That's exactly the philosophy we follow e.g. in particle physics and its a common excuse to dismiss all guidelines made in the article. However, this kind of validation/falsification is often done between different research groups (maybe using different but formally equivalent approaches) while people within the same group have to deal with the 10 years old code base.

I myself had very bad experience with extending the undocumented Fortran 77 code (lots of gotos and common blocks) of my supervisor. Finally, I decided to rewrite the whole thing including my new results instead of just somehow embedding my results into the old code for two reasons: (1) I'm presumably faster in rewriting the whole thing including my new research rather than struggling with the old code and (2) I simply would not trust in the numerical results/phenomenology produced by the code. After all, I'm wasting 2 months of my PhD for the marriage of my own results with known results which -in principle- could have been done within one day if the code base would allow for it.

So yes, If it's a one-man-show I would not give too much on code quality (though unit tests and git can safe quite a lot of time during development) but if there is a chance that someone else is going to touch the code in near future it will save time to your colleagues and improve the overall (scientific) productivity.

PS: quite excited about my first post here


> If it's a one-man-show I would not give too much on code quality

This makes me a little uneasy, as I'm not too worried about code quality can easily translate into Yes I know my code is full of undefined behaviour, and I don't care.

> PS: quite excited about my first post here

Welcome to HN! reddit has more cats, Slashdot has more jokes about sharks and laserbeams, but somehow we get by.


Are we talking actual undefined behavior or just behavior that's undefined by the language standard?

The latter isn't great practice, but if your environment handles behavior deterministically, and you publish the version of the compiler you're using, it doesn't seem to be a problem for this type of code.


> Are we talking actual undefined behavior or just behavior that's undefined by the language standard?

'Undefined behaviour' is a term-of-art in C/C++ programming, there's no ambiguity.

> if your environment handles behavior deterministically, and you publish the version of the compiler you're using, it doesn't seem to be a problem for this type of code.

Code should be correct by construction, not correct by coincidence. Results from such code shouldn't be considered publishable. Mathematicians don't get credit for invalid proofs that happen to reach a conclusion which is correct.

Again, this isn't some theoretical quibble. There are plenty of sneaky ways undefined behaviour can manifest and cause trouble. [0][1][2]

In the domain of safety-critical software development in C, extreme measures are taken to ensure the absence of undefined behaviour. If scientists adopt a sloppier attitude toward code quality, they should expect to end up publishing invalid results. Frankly, this isn't news, and I'm surprised the standards seem to be so low.

Also, of all the languages out there, C and C++ are among the most unforgiving of minor bugs, and are a bad choice of language for writing poor-quality code. Ada and Java, for instance, won't give you undefined behaviour for writing int i; int j = i;.

[0] https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...

[1] https://blog.regehr.org/archives/213

[2] https://cryptoservices.github.io/fde/2018/11/30/undefined-be...

See also my longer ramble on this topic at https://news.ycombinator.com/item?id=24264376


I think its poor practice, but undefined behavior shouldn't instantly invalidate results. In fact, this mindset is what keeps people from publishing the code in the first place.

Let the scientists publish UB code, and even the artifacts produced, the executables. Then, if such problems are found in the code by professionals, they can investigate it fully and find if it leads to a tangible flaw that invalidates the research or not.

You would drive yourself mad pointing out places in math proofs where some steps, even seemingly important ones, were skipped. But the papers are not retracted unless such a gap actually holds a flaw that invalidates the rest of thr proof.

Let thdm publish their gross, awful, and even buggy code. Sometimes the bugs don't effect the outcomes.


> undefined behavior shouldn't instantly invalidate results

Granted, it's not a guarantee that the results are wrong, but it's a serious issue with the experiment. I agree it wouldn't generally make sense to retract a publication unless it can be determined that the results are invalid. It should be possible to independently investigate this, if the source-code and input data are published, as they should be.

(It isn't universally true that reproduction of the experiment should be practical given that the source and data are published, as it may be difficult to reproduce supercomputer-powered experiments. iirc, training AlphaGo cost several million dollars of compute time, for instance.)

> this mindset is what keeps people from publishing the code in the first place

As I explained in [0], this attitude makes no sense at all. It has no place in modern science, and it's unfortunate the publication norms haven't caught up.

Scientific publication is meant to enable critical independent review of work, not to shield scientists from criticism from their peers, which is the exact opposite.

> Let the scientists publish UB code, and even the artifacts produced, the executables. Then, if such problems are found in the code by professionals, they can investigate it fully and find if it leads to a tangible flaw that invalidates the research or not.

I'm not sure what to make of 'professionals', but otherwise I agree, go ahead and publish the binaries too, as much as applicable. Could be a valuable addition. (In some cases it might not be possible/practical to publish machine-code binaries, such as when working with GPUs, or Java. These platforms tend to be JIT based, and hostile to dumping and restoring exact binaries.)

I agree with your final two paragraphs.

[0] https://news.ycombinator.com/item?id=24264376


> Code should be correct by construction, not correct by coincidence.

Glad we agree, if you're aware of how your compiler handles these things, you can construct it to be correct in this way.

It won't be portable at all (even to the next patch version of the compiler), I would never let it pass a code review, but that doesn't sound like an issue that's relevant here.


> if you're aware of how your compiler handles these things, you can construct it to be correct in this way.

I presume we agree but I'll do my usual rant against UB: Deliberately introducing undefined behaviour into your code is playing with fire, and trying to outsmart the compiler is generally a bad idea. Unless the compiler documentation officially commits to a certain behaviour (rollover arithmetic for signed types, say), then you should take steps to avoid undefined behaviour. Otherwise, you're just going with guesswork, and if the compiler generates insane code, the standards documents define it to be your fault.

It might be reasonable to make carefully disciplined and justified exceptions, but that should be done very cautiously. JIT relies on undefined behaviour, for instance, as ultimately you're treating an array as a function pointer.

> It won't be portable at all (even to the next patch version of the compiler)

Right, doing this kind of thing is extremely fragile. Does it ever crop up in real-life? I've never had cause to rely on this kind of thing.

It would be possible to use a static assertion to ensure my code only compiles on the desired compiler, preventing unpleasant surprises elsewhere, but I've never seen a situation where it's helpful.

This isn't the same thing as relying on 'ordinary' compiler-specific functionality, such as GCC's fixed-point functionality. Such code will simply refuse to compile on other compilers.

> I would never let it pass a code review, but that doesn't sound like an issue that's relevant here.

Disagree. It should be possible to independently reproduce the experiment. Robust code helps with this. Code shouldn't depend on an exact compiler version, there's no good reason code should.


> After all, I'm wasting 2 months of my PhD for the marriage of my own results with known results which -in principle- could have been done within one day if the code base would allow for it.

Sounds like it is quite good science to do that, because it puts the computation on a pair of independent feet.

Otherwise, it could just be that the code you are using as a bug and nobody notes until it is too late.


I see your and MaxBarraclough concerns. In my case, there exist 5-6 codes which do -at their core- the same thing as ours does and they all have been cross-checked against each other within either theoretical or numerical precision (where possible). That's the spirit that sjburt was referring to, I guess, and which triggered me because it is only true to a certain extend.

The cross-checking is anyways good scientific practise, not only because of bugs in the code (that's actually a sub-leading problem imho), but because of the degree of difficulty of the problems and the complexity of their solutions (and their reproducibility). In that sense, cross-checking should discover both, scientific "bugs" and programming-bugs. The "debugging" is partly also done at the community level - at least in our field of research.

However, it is also a matter of efficiency. I -and many others too- need to re-implement not because of bug-hunting/cross-checking but simply because we do not understand the "ugly" code of our colleagues and instead of taking the risk to break existing code we simply write new one which is extremely inefficient (others may take the risk and then waste months on debugging and reverse-engineering which is also inefficient). So my point on writing "good code" is not so much about avoiding bugs but about being kind to you colleagues, saving them nerves and time (which they can then spend on actual science) and thus also saving taxpayers money...


> If you cannot the release the code because the code is too ugly or untested or has bugs, how do you expect anyone with the right expertise to assess your findings?

Yes, that should be this way.

Also all cases where some company research team goes to a scientific conference and presents a nifty solution for problem X without telling how it was purportedly done, it should be absolutely required to publish code and data for this.

*And that's also something which is broken about software patents - patents are about open knowledge, software which uses such patents is not open - this combination should not be allowed at all).


With the caveat that while in some cases, like computational science, numerical analysis, machine learning algorithms, computer-assisted proofs, and so on, details of the code could be crucial, in other cases, they should not matter that much. I too have the impression that the HN public tends to over-value the importance of code in these cases when it is mostly a tool for evaluating a scientific result.


I am interested to know the distinction between "production-ready" and "science-ready" code.

I do not think "non-experts" should be able to use your code, but I do think an expert who was not involved in writing it should be.


Hard-coded file paths for input data. File paths hard-coded to use somebody's Google Drive so that it only runs if you know their password. Passwords hard-coded to get around the above problem.

In-code selection statements like `if( True ) {...}`, where you have no idea what is being selected or why.

Code that only runs in the particular workspace image that contains some function that was hacked out to make things work during a debugging session 5 years ago.

Distributed projects where one person wrote the preprocessor, another wrote the simulation software, and a third wrote the analysis scripts, and they all share undocumented assumptions worked out between the three researchers over the course of two years.

Depending on implementation-defined behavior (like zeroing out of data structures).

Function and variable names, like `doit()` and `hold`, which make it hard to understand the intention.

Files that contain thousands of lines of imperative instructions with documentation like "Per researcher X" every 100 lines or so.

Code that runs fine for 6 hours, then stops because some command-line input had the wrong value.

I've seen all of these over the years. Even as a domain expert who has spoken directly with authors and project leads, this kind of stuff makes it very hard to tease out what the code actually does, and how the code corresponds to the papers written about the results.


You’re giving me flashbacks! I spent a year as an admin on an HPC cluster at my university building tools/software and helping researchers get their projects running and re-lead the implementation of container usage. The amount of scientific code/projects that required libraries/files to be in specific locations, or assumed that everything was being run from a home directory, or sourced shell scripts at run time (that would break in containers) was staggering. A lot of stuff had the clear “this worked on my system so...” vibe about it.

As an admin it was quite frustrating, but I understand it sometimes when you know the person/project isn’t tested in a distributed environment. But when it’s the projects that do know how they’re used and still do those things...


One example: My code used to crash for a long time if you set the thermal speed to something greater than the speed if light. Should the code crash? No. And by now I have found the time to write extra code to catch the error and midly insult the user (It says "Faster than light? Please share that trick with me!") Does it matter? No. It didn't run and give plausible-but-wrong results. So that is code that I would call "science-ready" but I wouldn't want it criticized by people outside my domain.


I don't think that would be any problem (why should it?).

Code exhibiting undefined behavior is a different kettle of fish...


Which is why I run valgrind on my code (with a parameter file containing physically valid inputs) to get rid of all undefined behavior. But I gave up on running afl-fuzz, because all it found was crashes following from physically invalid inputs. I fixed the obvious once to make the code nicer for new users, but once afl started to find only very creative corner cases I stopped.


Well done!


Then you publish your work and critics publish theirs and the community decides which claims have proven their merit. This is the fundamental structure of the scientific community.

How is "your code has error and I rebuke you" a more painful critique than "you are hiding your methodology and so I rebuke you"?


Nothing limits the field of critics to people who have written their own code and know what they are doing.


There's a ton of overlap, because science code might be a long running, multi-engineer distributed system and production code might be a script that supports a temporary business process. But let's assume production ready is a multi customer application and science ready is computations to reproduce results in a paper.

Here's a quick pass, I'm sure I'm missing stuff, but I've needed to code review a lot of science and production output and below is how I tend to think of it, especially taking efficiency of engineer/scientist time into account.

Production Ready?

* code well factored for extensibility, feature change, and multi-engineer contribution

* robust against hostile user input

* unit and integration tested

Science Ready?

* code well factored for readability and reproducibility (e.g. random numbers seeded, time calcs not set against 'now')

* robust against expected user input

* input data available? testing optional but desired, esp unit tests of algorithmic functions

* input data not available? a schema-correct facsimile of input data available in a unit test context to verify algorithms correct

Both?

* security needs assessed and met (science code might be dealing with highly secure data, as might production code)

* performance and stability needs met (production code more often requires long term stability, science sometimes needs performance within expected Big O to save compute time if it's a big calculation)


Your requirements seem to push 'Science ready' far into what I'd consider "worthless waste of time", coming from the perspective of code that's used for data analysis for a particular paper.

The key aspect of that code is that it's going to be run once or twice, ever, and it's only ever going to be run on a particular known set of input data. It's a tool (though complex) that we used (once) to get from A to B. It does not need to get refactored, because the expectation is that it's only ever going to be used as-is (as it was used once, and will be used only for reproducing results), it's not intended to be built upon or maintained. It's not the basis of the research, it's not the point of research, it's not a deliverable in that research, it's just a scaffold that was temporarily neccessary to do some task - one which might have been done manually earlier through great effort, but that's automated now. It's expected that the vast majority of the readers of that paper won't ever need to touch that code, they care only about the results and a few key aspects of the methodology, which are (or should be) all mentioned in the paper.

It should be reproducible to ensure that we (or someone else) can obtain the same B from A in future, but that's it, it does not need to be robust to input that's not in the input datafile - noone in the world has another set of real data that could/should be processed with that code. If after a few years we or someone else will obtain another dataset, then (after those few years, if that dataset happens) there would be a need to ensure that it works on that dataset before writing a paper about that dataset, but it's overwhelmingly likely that you'd want to modify that code anyway both because that new dataset would not be 'compatible' (because the code will be tightly coupled to all the assumptions in the methodology you used to get that data, and because it's likely to be richer in ways you can't predict right now) and you'd want to extend the analysis in some way.

It should have a 'toy example' - what you call 'a schema-correct facsimile of input data' that's used for testing and validation before you run it on the actual dataset, and it should have test scenarios and/or unit tests that are preferably manually verifiable for correctness.

But the key thing here is that no matter what you do, that's still in most cases going to be "write once, run once, read never" code, as long as we're talking about the auxiliary code that supports some experimental conclusions, not the "here's a slightly better method for doing the same thing" CS papers. We are striving for reproducible code, but actual reproductions are quite rare, the incentives are just not there. We publish the code as a matter of principle, knowing all well that most likely noone will download and read it. The community needs the possibility for reproduction for the cases where the results are suspect (which is the main scenario where someone is likely to attempt reproducing that code), it's there to ensure that if we later suspect that the code is flawed in a way where the flaws affect the conclusions then we can go back to the code and review it - which is plausible, but not that likely. Also, if someone does not trust our code, they can (and possibly should) simply ignore it and perform a 'from scratch' analysis of the data based what's said in the paper. With a reimplementation, some nuances in the results might be slightly different, but all the conclusions in the paper should still be valid, if the paper is actually meaningful - if a reimplementation breaks the conclusions, that would be a successful, valuable non-reproduction of the results.

This is a big change from industry practice where you have mantras like "a line of code is written once but read ten times", in a scientific environment that ratio is the other way around, so the tradeoffs are different - it's not worth investing refactoring time to improve readability, if it's expected that most likely noone will ever read that code; it makes sense to spend that effort only if and when you need it.


Yep! I don't disagree with anything you're saying when I think from a particular context. It's really hard to generalize about the needs of 'science code', and my stab at doing so was certain to be off the mark for a lot of cases.


Yes, there are huge differences between the needs of various fields. For example, some fields have a lot of papers where the authors are presenting a superior method for doing something, and if code is a key part of that new "method and apparatus", then it's a key deliverable of that paper and its accessibility and (re-)usability is very important; and if a core claim of their paper is that "we coded A and B, and experimentally demonstrated that A is better than B" then any flaws in that code may invalidate the whole experiment.

But I seem to get the vibe that this original Nature article is mostly about the auxiliary data analysis code for "non-simulated" experiments, while Hacker News seems biased towards fields like computer science, machine learning, etc.


> the distinction between "production-ready" and "science-ready" code

In the first case, you must take into account all (un)imaginable corner cases and never allow the code to fail or hang up. In the second case it needs to produce a reproducible result at least for the published case. And do not expect it to be user-friendly at all.


I would regard (from experience) "science ready" code as something that you run just often enough to get the results to create publications.

Any effort to get code working for other people, or documented in any way would probably be seen as wasted effort that could be used to write more papers or create more results to create new papers.

This kind of reasoning was one of the many reasons I left academic research - I personally didn't value publications as deliverables.


My experience has been similar.

Still, there's plenty of room to encourage good(/better) practices which cost essentially nothing, e.g. using $PWD rather than /home/bob/foo


If your experiment is not repeatable, it's an anecdote not data.

Any effort to write a paper readable for other people, or document the experiment in any way would probably be seen as wasted effort that could be used to create more results.

The "don't show your work" argument only makes sense if you are doing PR, not science.


If it's repeatable by you then it's a trade secret, not an anecdote


I specifically got told off by my supervisor for trying to "improve" some of the software we were working on!


Disclaimer, I'm a professional engineer and not a researcher.

The kind of code I'll ship for production will include unit testing designed around edge or degenerate cases that arose from case analysis, usually some kind of end to end integration test, aggressive linting and crashing on warnings, and enforcing of style guidelines with auto formatting tools. The last one is more important than people give it credit for.

For research it would probably be sufficient to test that the code compiles and given a set of known valid input the program terminates successfully.


>I am interested to know the distinction between "production-ready" and "science-ready" code.

In general, scientists don't care how long it takes or how many resources the code uses. It is not a big deal to run a script for an extra hour, or use up a node of supercomputer. Extravagent solutions or added packages to make the code run smoother or faster is only wasting time. It speed/elegance only really matters when you know the code is going to be distributed to the community.

Basically scientists only care if the result, is true. If the result it outputs is sensible, defensible, reliable, reproducible. It would be considered a dick move to criticism someones code, if the code was proven to produce the correct result.


> It would be considered a dick move to criticism someones code, if the code was proven to produce the correct result.

Formal proof is much much harder than making code understandable and reviewable. It can be done but it is not easy, and can yield surprising results:

https://en.wikipedia.org/wiki/CompCert

http://envisage-project.eu/proving-android-java-and-python-s...


Do you know how you could get to the state that "the code was proven to produce the correct result"?

If not by unit tests, code review or formal logic, then what?


Not all scientific code is amenable to unit testing. From my own experience from a PhD in condensed matter physics, the main issue was that how important equations and quantities “should” behave by themselves was often unknown or undocumented, so very often each such component could only be tested as part of a system with known properties.

You can then use unit testing for low-level infrastructure (e.g. checking that your ODE solver works as expected), but do the high-level testing via scientific validation. The first line of defense is to check that you don’t break any laws of physics, e.g. that energy and electric charge is conserved in your end results. Even small implementation mistakes can violate these.

Then you search for related existing publications of a theoretical or numerical nature, trying to reproduce their results; the more existing research your code can reproduce, the more certain you can be that it is at least consistent with known science. If this fails, you have something to guide your debugging; or if you’re very lucky, something interesting to write a paper about :).

The final validation step is of course to validate against experiments. This is not suited for debugging though, since you can’t easily say whether a mismatch is due to a software bug, experimental noise, neglected effects in the mathematical model, etc.


>If not by unit tests, code review or formal logic, then what?

Cross referencing independent experiments and external datasets.

Science doesn't work like software. The code can be perfect and still not give results that reflect reality. The code can be logical and not reflect reality. Most scientists I know go in with the expectation that "the code is wrong" and its results must be validated by at least one other source.


I'm a scientist in a group that also includes a software production team. For me, the standard of scientific reproducibility is that a result can be replicated by a reasonably skilled person, who might even need to fill in some minor details themselves.

Part of our process involves cleaning up code to a higher state of refinement as it gets closer to entering the production pipeline.

I've tested 30 year old code, and it still runs, though I had to dig up a copy of Turbo Pascal, and much of it no longer exists in computer readable form but would have to be re-entered by hand. Life was actually simpler back then -- with the exception of the built-ins of Turbo Pascal, it has no dependencies.

My code was in fact adopted by two other research groups with only minor changes needed to suit slightly different experimental conditions. It contained many cross-checks, though we were unaware of modern software testing concepts at the time.

For a result to have broader or lasting impact, replication is not enough. The result has to fit into a broader web of results that reinforce one another and are extended or turned into something useful. That's the point where precise replication of minor supporting results becomes less important. The quality of any specific experiment done in support of modern electromagnetic theory would probably give you the heebie jeebies, but the overall theory is profoundly robust.

The same thing has to happen when going from prototype to production. Also, production requires what I call push-button replication. It has to replicate itself at the click of a mouse, because the production team doesn't have domain experts who can even critique the entirety of their own code, and maintaining their code would be nearly impossible if it didn't adhere to standards that make it maintainable by multiple people at once.


This sounds great. In your opinion, do you think your team is unusual in those aspects? Do you have any knowledge of the quality of code in other branches of physics or other sciences?


Well, I know the quality of my own code before I got some advice. And I've watched colleagues doing this as well.

My own code was quite clean in the 1980s, when the limitations on the machines themselves tended to keep things fairly compact with minimal dependencies. And I learned a decent "structured programming" discipline.

As I moved into more modern languages, my code kind of degenerated into a giant hairball of dependencies and abstractions. "Just because you can do that, doesn't mean you should." I've kind of learned that the commercial programmers limit themselves to a few familiar patterns, and if you try to create a new pattern for every problem, your code will be hard to hand off.

Scientists would benefit from receiving some training in good programming hygiene.


Nit: implementations of Monte Carlo methods are not necessarily nondeterministic. Whenever I implement one, I always aim for a deterministic function of (input data, RNG seed, parallelism, workspace size).


It really helps with debugging if your MC code is deterministic for a given input seed. And then you just run for a sufficient number of different seeds to sample the probability space.


Alternatively: seed the program randomly by default, but allow the user to specify a seed as a CLI argument or function argument (for tests).

In the common case, the software behaves as expected (random output), but it is reproducible for tests. You can then publish your RNG seed with the commit hash when you release your code/paper, and others may see your results and investigate that particular code execution.


Sure that works too. But word of advice from real life: Print the random seed at the beginning of the run so you can find out which seed caused it to crash or do stupid things.


And it seems that the people from Imperial College have done that with their epidemiological simulation. What critics claim is that their code produces non-deterministic results when given deterministic input and random seeds, i.e. that their code is seriously broken. Which would be a serious issue if true.


To be more specific, the critics claim the code would yield completely different results.


I have done research on Evolutionary Algorithm and numerical optimization. It was nigh impossible to reproduce poorly described algorithms from state of the art research at the time and researchers would very often not bother to reply to inquiries for their code. Even if you did get the code it would be some arcane C only compatible with a GCC from 1996.

Code belongs with the paper. Otherwise we can just continue to make up numbers and pretend we found something significant.


In 2006 or 2008 a university in England published some fluff about genetic/evolutionary algorithms that were evolving circuits on an fpga, specifically the published stuff regarded an fpga without a clock was able to differentiate between two tones.

I've spent the intervening years trying to find a way to implement this myself, going as far as to buy things like the ice40 fpga because the bitstreams are supposedly unlocked; this is a pre-req for modifying the actual gate/logic on the chip.

I've emailed the professor listed as the headliner in the articles published about it to no avail.

Nearly my entire adult life has been spent reading some interesting article, chasing down the paper, finding out if any code was published, and seeing if I could run the code myself.

It wasn't until machine learning with pytorch became mainstream that I started having luck replicating results. Just some more data points for this discussion.


Our first job as scientists is to make sure we're not fooling ourselves. I wouldn't just use any old scale to take a measurement. I want a calibrated scale, adjusted to meet a specific standard of accuracy. Such standards and calibrations ensure we can all get "the same" result doing "the same" thing, even if we use different equipment from different vendors. The concerns about code are exactly the same. It's even scarier to me because I realize that unlike a scale, most scientists have no idea how to calibrate their code to ensure accurate, reproducible results. Of course with the scales, the calibration is done by a specialized professional who's been trained to calibrate scales. Not sure how we solve this issue with the code.


I’m very puzzled by this attitude. As an accelerator physicist, would you want you accelerator to be held together by duct tape, and producing inconsistent results? Would you complain that you’re not a professional machinist when somebody pointed it out? Why is software any different than hardware in this respect?


> I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

In what way do idiots making idiotic comments about your correct code invalidate your scientific production? You can still turn out science and let people read and comment freely on it.

> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.

I guess you would not need to engage personally with the idiots at "acceleratorskeptics.com", but likely most of their critique would be easily shut off by a simple sentence such as this one. Since most of your readers would not be idiots, they could scrutinize your code and even provide that reply on your behalf. This is called the scientific method.

I agree that you produce science, not merely code. Yet, the code is part of the science and you are not really publishing anything if you hide that part. Criticizing scientific code because it is bad software engineering is like criticizing it because it uses bad typography. You should not feel attacked by that.


> In what way do idiots making idiotic comments about your correct code invalidate your scientific production? You can still turn out science and let people read and comment freely on it.

How would a layperson identify a faulty critique? It would be picked up by the media who would do their usual “both sides” thing.


Not that they abstain from doing that shit today, when code is not often published.

An educated and motivated layperson at least would have the chance to learn whether the critique is faulty. Today, with secret code, it is impossible to verify for almost everybody.


Race conditions and certain forms of non-determinism could invalidate the results of a given study. Code is essentially a better-specified methods section, it just says what they did. Scientists are expected to include a methods section for exactly this reason, and any scientist worried about including a methods section in their paper would be rightly rejected.

However, a methods section is always under-specified. Code provides the unique opportunity to actually see the full methods on display and properly review their work. It should be mandated by all reputable journals and worked into the peer review process.


While you're running experiments, it doesn't matter, but publishing any sort of result or using your code in parts of other publishable code IS production code, and you should treat it as such.


> people claiming that their non-software engineering grade code invalidates the results of their study.

But that's exactly the problem.

Are you familiar with that bug in early Civ games where an overflow was making Ghandi nuke the crap out of everyone? What if your code has a similar issue?

What if you have a random value right smack in the middle of your calculations and you just happened to be lucky when you run your code?

I'm not that familiar with Monte Carlo, my understanding is that this is just a way to sample the data. And I won't be testing your data sampling, but I will expect that given the same data to your calculations part (eg, after the sampling happens), I get exactly the same results every time I run the code and on any computer. And if there are differences I expect you to be able to explain why they don't matter, which will show you were aware of the differences in the first place and you were not just lucky.

And then there is the matter of magic values that plaster research code.

Researchers should understand that the rules for "software engineering grade code" are not there just because we want to complicate things, but because we want to make sure the code is correct and does what we expect it to do.

/edit: The real problem is not getting good results with faulty code, is ignoring good solutions because faulty code.


> What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts. In fact the time spent making it that way is time that a scientist spends doing software engineering instead of science, which isn't very efficient.

If the proof on which the paper is based is in the code that produced the evidence, you absolutely need to be able to let a lambda user run it without specific knowledge to abide to the reproducible principle. Asking a reviewer to fiddle about like a IT professional to get something working is bound to promote lazy reviewing and either will result into dismissing the result or approval without real review.

And by the way producing a paper could be argued it isn't really science either, but if you are working with MSFT Office, you know there is a fair amount of non science work hours that has been put into that as well.


> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.

Not so fast. Monte Carlo code turns arbitrary RNG seeds into outputs. That process can, and arguably should be, deterministic.

To do your study, you feed your Monte Carlo code 'random enough' seeds. Coming up with the seeds does not need to be deterministic. But once the seeds are fixed, the rest can be deterministic. Your paper should probably also publish the seeds used, so that people can reproduce everything. (And so they can check whether your seeds are carefully chosen, or really produce typical outcomes.)


I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

Sure, and that rationale works OK when your code operates in a limited, specialized domain.

But if you're modeling climate change or infectious diseases, and you expect your work to affect millions of human lives and trillions of dollars in spending, then you owe us a full accounting of it.


> Sure, and that rationale works OK when your code operates in a limited, specialized domain.

There are a lot of domains which do have a deep impact on society but are completely underfunded. For example research on ecology and declining insect populations. Or research on education. And domains like epidemiology, climate change research, cancer research or such are not known to earn better salaries to scientists as on average. Most scientists earn a pity.

> But if you're modeling climate change or infectious diseases, and you expect your work to affect millions of human lives and trillions of dollars in spending, then you owe us a full accounting of it.

What one can expect from scientists, in whatever subject they work, is honesty, integrity, a full account of their findings. What you can't expect is that they just turn into expert software engineers or make their working code beautiful. You can't expect them to work for free. What the academic system demands from them is that they work on their next paper instead, so if you want pretty code, you need at least in part to change the system.


>when that is the entire point of Monte Carlo methods and doesn't change their result.

Two nitpicks: a) it shouldn't change the conclusions, but MC calculations will get different results depending on the seed. and b) it is considered good practice in reproducible science to fix the seed so that the results of subsequent runs give exactly the same results.

Ultimately, I think there is a balance: really poor code can lead to incorrect conclusions... but you don't need production ready code for scientific exploration.


Sorry to be pedantic, but although Monte Carlo simulations are based on pseudo-randomness, I still think it is good practice that they have deterministic results (i.e., use a given seed) so that the exact results can be replicated. If the precise numbers can be reproduced then a) it helps me as a reviewer see that everything is kosher with their code and b) it means that if I tweak the code to try something out my results will be fully compatible with theirs.


Why is "doing software engineering" not "doing science"?

Anybody who has conducted experimental research will say they spent 80% of the time using a hammer or a spanner. Repairing faulty lasers or power supplies. This process of reliable and repeatable experimentation is the basis of science itself.

Computational experiments must be held to the same standards as physical experiments. They must be reproducible and they should be publicly available (if publicly funded).


What are the frameworks used in scientific endeavours? Given that scaling is not an issue, something like Rails for science seems like it could potentially return many $(B/M)illions of dollars for humanity.


What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts.

Sounds like I should just become a scientist then.

Do you guys write unit tests or is that beneath you too?


edit: please read the grandchild comment before going off on the idea that some random programmer on the Internet dares to criticize scientific code he does not understand. What is crucial in the argument here is indeed the distinction between methods employing pseudo-randomness, like Monte Carlo simulation, and non-determinism caused by undefined behavior.

> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points.

The person which wrote the linked blog post writes that it was a software engineer at google. Unfortunately, that claim is not falsifiable as the person decided to remain anonymous.

> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.

The claim is that even with the same random seed for the random generator, the program produces different results, and this is explained by the allegation that it runs non-deterministic (in the sense of undefined behavior) in multiple threads. It claims also that it produces significantly different results depending on which output file format is chosen.

If this is true, the code would have race conditions, and as being impacted by race conditions is a form of undefined behavior, this would make any result of the program questionable, as the program would not be well-defined.

Personally, I am very doubtful whether this is true, this would be incredibly sloppy by the imperial college scientists. Some more careful analysis by a recognized programmer might be warranted.

However it underlines well the importance of the main topic that scientific code should be open to analysis.

> What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts.

Fully agree with this. But it should try to document its limitations.


> If this is true, the code would have race conditions, and as being impacted by race conditions is a form of undefined behavior, this would make any result of the program questionable, as the program would not be well-defined.

That’s not at all what that means. What are you talking about? As long as a Monte Carlo process works towards the same result it’s equivalent.

You’re speaking genuine nonsense as far as I’m concerned. Randomness doesn’t imply non deterministic. Non-determinitism in no way implies race conditions or undefined behavior. We care that the random process reaches the same result, not that the exact sequence of steps is the same.

This is what scientists are talking about. A bunch of (pretty stupid) nonexperts want to criticize your code, so they feel smart on the internet.


I am referring to this blog post:

https://lockdownsceptics.org/code-review-of-fergusons-model/

It says, word-by-word:

> Clearly, the documentation wants us to think that, given a starting seed, the model will always produce the same results.

>

>Investigation reveals the truth: the code produces critically different results, even for identical starting seeds and parameters.

> I’ll illustrate with a few bugs. In issue 116 a UK “red team” at Edinburgh University reports that they tried to use a mode that stores data tables in a more efficient format for faster loading, and discovered – to their surprise – that the resulting predictions varied by around 80,000 deaths after 80 days: ...

The bugs which the blog post implies here are such ones as described by Jens Regehr: https://blog.regehr.org/archives/213

Not that I do not endorse these statements in the blog - I am rather skeptical whether they are true at all.

What the authors of the blob post mean is clearly "undefined behaviour" in the sense of non-deterministic program execution of a program that is not well-formed. It is clear that many non-experts could confuse that with the pseudo-randomness implicit in Monte-Carlo simulations, but this is a very different thing. The first is basically a broken, invalid, and untrustworthy program. The second is the established method to produce a computational result by introducing stochastic behavior, which is for example how modern weather models work.

These are wildly different things. I do not understand why your comment just adds to the confusion between these two things??

> A bunch of (pretty stupid) nonexperts want to criticize your code, so they feel smart on the internet.

As said, I don't endorse the critique in the blog. However, critique in a software implementation, as well as in scientific matters, should never carry a call on authority - it should logically explain what is the problem, with concrete points. Unfortunately, the cited blog post remains very vague about this, while claiming:

> My background. I have been writing software for 30 years. I worked at Google between 2006 and 2014, where I was a senior software engineer working on Maps, Gmail and account security. I spent the last five years at a US/UK firm where I designed the company’s database product, amongst other jobs and projects. I was also an independent consultant for a couple of years.

It would be much better if, instead claiming that there could be race conditions, it could point to lines in the code with actual race conditions, and show how the results of the simulation are different when the race conditions are fixed. Otherwise, it just looks like he claims that the program is buggy, because he is in no position to question the science, and does not like the result.


There is something I need to add, it is a subtle but important point:

Non-determinism can be caused by

a) random seeds derived from hardware, such as seek times in a HDD controller, which is fed into pseudo random number (PRNG) generation. This is not a problem. For debugging, or comparison, it can make sense to switch it off, though.

b) data race conditions, which is a form of undefined behavior. This not only can dramatically change results of a program run, but also invalidates the program logic, in languages such as C and C++. This is what he blog post in "lockdownskeptics.org" suggests. For the application area and its consequences, this would be a major nightmare.

c) What I had forgotten is that parallel execution (for example in LAM/MPI, map/reduce or similar frameworks) is inherently non-deterministic and, in combination with properties of floating-point computation, can yield different but valid results.

Here an example:

A computation is carried out on five nodes and they return the values 1e10, 1e10, 1e-20, -1e10, -1e10, in random order. The final result is computed by summing these up.

Now, the order of computation could be:

((((1e10 + 1e10) + 1e-20) + -1e10) + -1e10)

or it could be:

(((1e10 + -1e10) + 1e-20) + (+1e10 + -1e10))

In the first case, the result would be zero, in the second case, 1e-20, because of the finite length of floating point representation.

_However_... if the numerical model or simulation or whatever is stable, this should not lead to a dramatic qualitative difference in the result (otherwise, we have a stability problem with the model).

Finally, I want to cite one last paragraph from the post on lockdownskeptics.org:

> Conclusions. All papers based on this code should be retracted immediately. Imperial’s modelling efforts should be reset with a new team that isn’t under Professor Ferguson, and which has a commitment to replicable results with published code from day one.

> On a personal level, I’d go further and suggest that all academic epidemiology be defunded. This sort of work is best done by the insurance sector. Insurers employ modellers and data scientists, but also employ managers whose job is to decide whether a model is accurate enough for real world usage and professional software engineers to ensure model software is properly tested, understandable and so on. Academic efforts don’t have these people, and the results speak for themselves.


Race conditions aren't undefined behavior in C/C++. Data races are. Lots and lots of real systems contain race conditions without catastrophe.


> Race conditions aren't undefined behavior in C/C++. Data races are.

You are right with the distinction, I had data race conditions in mind.

Race conditions can well happen in a correct C/C++ multi-threaded program in the sense that the order of specific computation steps is sometimes random. And for operations such as floating-point addition, where order of operations does matter, the exact result can be random as a consequence. But the end result should not depend dramatically on it (which is what the poster at lockdownskeptics.org claims).


I want science to be held to a very high standard. Maybe even higher than "software engineering grade". Especially if it's being used as a justification for public policy.


Perhaps just a nitpick: software engineering runs the gamut from throwing together a GUI in a few hours, all the way up to avionics software where a bug could kill hundreds. There's no such thing as 'software engineering grade'.


> people that don't understand the material making low effort critiques of minor technical points

GPT-3 FTW!


At the risk of just mirroring points which have already been made:

> you understand that the links in your post are the exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.

It's profoundly unscientific to suggest that researchers should be given the choice to withhold details of their experiments that they fear will not withstand peer review. That's much of the point of scientific publication.

Researchers who are too ashamed of their code to submit it for publication, should be denied the opportunity to publish. If that's the state of their code, their results aren't publishable. Unpublishable garbage in, unpublishable garbage out. Simple enough. Journals just shouldn't permit that kind of sloppiness. Neither should scientists be permitted to take steps to artificially make it difficult to reproduce (in some weak sense) an experiment. (Independently re-running code whose correctness is suspect, obviously isn't as good as comparing against a fully independent reimplementation, but it still counts for something.)

If a mathematician tried to publish the conclusion of a proof but refused to show the derivation, they'd be laughed out of the room. Why should we hold software-based experiments to such a pitifully low standard by comparison?

It's not as if this is a minor problem. Software bugs really can result in incorrect figures being published. In the case of C and C++ code in particular, a seemingly minor issue can result in undefined behaviour, meaning the output of the program is entirely unconstrained, with no assurance that the output will resemble what the programmer expects. This isn't just theoretical. Bizarre behaviour really can happen on modern systems, when undefined behaviour is present.

A computer scientist once told me a story of some students he was supervising. The students had built some kind of physics simulation engine. They seemed pretty confident in its correctness, but in truth it hadn't been given any kind of proper testing, it merely looked about right to them. The supervisor had a suggestion: Rotate the simulated world by 19 degrees about the Y axis, run the simulation again, and compare the results. They did so. Their program showed totally different results. Oh dear.

Needless to say, not all scientific code can so easily be shown to be incorrect. All the more reason to subject it to peer review.

> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points.

Why would you care? Science is about advancing the frontier of knowledge, not about avoiding invalid criticism from online communities of unqualified fools.

I sincerely hope vaccine researchers don't make publication decisions based on this sort of fear.


> people claiming that their non-software engineering grade code invalidates the results of their study.

How exactly is this a bad thing?

> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

But it should be noted that what you didn't say is that you're here to turn out accurate science.

This is the software version of statistics. Imagine if someone took a random sampling of people at a Trump rally and then claimed that "98% of Americans are voting for Trump". And now imagine someone else points out that the sample is biased and therefore the conclusion is flawed, and the response was "Hey, I'm just here to do statistics".

---

Do you see the problem now? The poster above you pointed out that the conclusions of the software can't be trusted, not that the coding style was ugly. Most developers would be more than willing to say "the code is ugly, but it's accurate". What we don't want is to hear "the conclusions can't be trusted and 100 people have spent 10+ years working from those unreliable conclusions".


Oh, he didn't say 'accurate science', nice gotcha!

This is exactly the sort of pedantic cluelessness that scientists are seeking to avoid by not publishing their code.


I don't consider accuracy in science to be pedantic, and I suspect most others don't either.

To paraphrase what the other developer said: "I don't want my work to be checked, I'm not here for accuracy, just the act of doing science".

When I was young, the ability to invalidate was the core aspect of science, but apparently that's changed over the years.


As a theoretical physicist doing computer simulations, I am trying to publish all my code whenever possible. However all my coauthors are against that. They say things like "Someone will take this code and use it without citing us", "Someone will break the code, obtain wrong results and blame us", "Someone will demand support and we do not have time for that", "No one is giving away their tools which make their competitive advantage". This is of course all nonsense, but my arguments are ignored.

If you want to help me (and others who agree with me), please sign this petition: https://publiccode.eu. It demands that all publicly funded code must be public.

P.S. Yes, my 10-year-old code is working.


>"Someone will demand support and we do not have time for that",

Well ... that part isn't nonsense, though I agree it shouldn't be a dealbreaker. And it means we should work towards making such support demands minimal or non-existent via easy containerization.

I note with frustration that even the Docker people, whose entire job is containerization, can get this part wrong. I remember when we containerized our startup's app c. 2015, to the point that you should be able to run it locally just by installing docker and running `docker-compose up`, and it still stopped working within a few weeks (which we found when onboarding new employees), which required a knowledgeable person to debug and re-write.

(They changed the spec for docker-compose so that the new version you'd get when downloading Docker would interpret the yaml to mean something else.)


As a theoretical physicist your results should be reproducible based on the content of your papers, where you should detail/state the methods you use. I would make the argument that releasing code in your position has the potential to be scientifically damaging; if another researcher interested in reproducing your results reads your code, then it is possible their reproduction will not be independent. However they will likely still publish it as such.


> "No one is giving away their tools which make their competitive advantage"

This hits close to home. Back in college, I developed software, for a lab, for a project-based class. I put the code up on GitHub under the GPL license (some code I used was licensed under GPL as well), and when the people from the lab found out, they lost their minds. A while later, they submitted a paper and the journal ended up demanding the code they used for analysis. Their solution? They copied and pasted pieces of my project they used for that paper and submitted it as their own work. Of course, they also completely ignored the license.


I’m curious, are dedicated software assurance teams a thing in your research area? Or is quality left up to the primary researchers?


> Or is quality left up to the primary researchers?

Individual researchers, and in many disciplines (like physics), there is almost no emphasis on quality.

I left academia a decade ago, but at the time all except one of my colleagues protested when version control was suggested to them. Some of these have code in the 30-40K lines.


I formerly worked in research, left and am now back in a quasi-research organization.

It’s bit disconcerting seeing how much quality is brushed aside particularly in software. Researchers seem to intuitively grasp how they need quality hardware to do their job, yet software rarely gets the same consideration. I’ve never been able to get many to come around to the idea that software should be treated the same as any other engineered product that enables their research


> protested when version control was suggested

Academics are strange like this. The root reason is fear: fear that you're complicating their process, that you're going to interrupt their productivity or flow state, that you're introducing complication that has no benefit. They then build up a massive case in their minds for why they shouldn't do this; good luck fighting it.

Doubly so if you're IT staff and don't have a PhD. There's a fundamental lack of respect on behalf of (a vocal minority) of academics about bit plumbers, until of course when they need us to do something laughably basic. It's the seeds of elitism; in reality we should be able to work together, each of us understanding our particular domain and working to help the other.


> The root reason is fear: fear that you're complicating their process, that you're going to interrupt their productivity or flow state, that you're introducing complication that has no benefit.

Yes, but how does it compare to all the complicated processes that exist in academic institutions currently? Almost all of which originated from academics themselves, mind you.


It's not that complicated. No one individual process is that bad. The problem is that there's so many that you need to steep in it for ages to pick everything up.

This means it makes most sense to pick up processes that are portable and have longevity. Learning Git is a pretty solid example.


I think this is why industry does better science than academia, at least in any area where there are applications. Generally, they get paid for being right, not just for being published, so they put respect and money into people that help get correct results.


I think this is a much wider problem than just in academia/research. Really any area where software isn't the primary product tends to have fairly lax software standards. I work in the embedded firmware field and best practices are often looked at with skepticism and even derision by the electrical engineers who are often the ones doing the programming^[1].

I think software development as a field is incredibly vast and diverse. Programming is an amazing tool, but it's a tool that requires a lot of knowledge in a lot of different areas.

^[1] This isn't universally true of course, I'm not trying to be insulting here.


"quality" is a subjectit word. Let's be clear what this means:

Individual researchers, and in many disciplines (like physics), there is almost no emphasis on correct results, merely on believable results.


There are a few standardized definitions. The most succinct bring “quality is the adherence to requirements”.

As an example, if your science has the requirement of being replicable (as it should) there are a host of best practices that should flow down to the software development requirements. Not implementing those best practices would be indicative of lower quality


Most of the codes I am developing alone. No one else looks at them ever. My supervisor also develops the code alone and never shows it to anyone (not even members of the group).

In other cases, a couple of other researchers may have a look at my code or continue its development. I worked with 4+ research teams and only saw one professional programmer in one of them helping the development. Never heard about a "dedicated software assurance team".


To clarify, nobody sees the code because they aren't allowed, or nobody ever ask to see it?


The second case. However I am hesitating to ask to look at the code of my supervisor. How would I explain why I need it (if it's not needed for my research)? It's also unlikely user-friendly, so it would take a lot of time to understand anything.


I think you touched on something important. Researchers are most concerned with “getting things working”.

One of my favorite points from the book Clean Code was that professional developers aren’t satisfied with “working code”, they aim to make it maintainable. Which may mean writing it in a way that is more clear and concise than we are used to


> I’m curious, are dedicated software assurance teams a thing in your research area?

Are these a thing in any research area? I've heard of exactly one case of an academic lab (one that was easily 99th+ percentile in terms of funding) hiring one software engineer not directly involved in leading a research effort, and when I tell other academics about this they're somewhat incredulous. (I admittedly have a bit of trouble believing it myself -- I can't imagine the incentive to work for low academic pay in an environment where you're inevitably going to feel a sense of inferiority to first year PhD students who think they're hot shit because they're doing "research".)


>Are these a thing in any research area

I can say there are some that have the explicit intent but it can often fall to the wayside due to cost pressure. For example, government funded research from large organizations (think DoD or NASA) have these quality requirements but they can often be hand-waved away or just plain ignored due to cost concerns


Interestingly each of those arguments also applies to publishing an article describing your work.


> Scientists really need to publish their code artifacts, and we can no longer just say "Well they're scientists or mathematicians" and allow that as an excuse for terrible code with no testing specs.

You are blaming scientists but speaking from my personal experience as a computational scientist, this exists because there are few structures in place that incentivize strong programming practices.

* Funding agencies do not provide support for verification and validation of scientific software (typically)

* Few journals require assess code reproducibility and few require public code (few require even public data)

* There are few funded studies to reproduce major existing studies

Until these structural challenges are addressed, scientists will not have sufficient incentive to change their behavior.

> Scientific code needs to have tests, a minimal amount of test coverage, and code/data used really need to be published and run by volunteers/editors in the same way papers are reviewed, even for non-computer science journals.

I completely agree.


Second this. Research code is already hard, and with misaligned incentives from the funding agencies and grad school pipelines, it's an uphill battle. Not to mention that professors with an outdated mindset might discourage graduate students from committing too much time to work on scientific code. "We are scientists, not programmers. Coding doesn't advance your career" is often an excuse for that.

In my opinion, enforcing standards without addressing this root cause is not gonna fix the problem. Worse, students and early career researchers will bear the brunt of increased workload and code compliance requirements from journals. Big, well-funded labs that can afford a research engineer position is gonna have an edge over small labs that cannot do so.


The graphics community has started an interesting initiative at this end: http://www.replicabilitystamp.org/

After a paper has been accepted, authors can submit a repository containing a script which automatically replicates results shown in the paper. After a reviewer confirms that the results were indeed replicable, the paper gets a small badge next to its title.

While there could certainly be improvements, I think it's a step in the right direction.


But does this badge influence the scientific profile / resume of the researcher in any way?


You can always put "certified by the Graphics Replicability Stamp Initiative" next to each paper on your CV. It might influence people a little, even if it isn't part of the formal review for employment / promotion. Although "Graphics Replicability Stamp Initiative" does not sound very impressive. And Federal grant applications have rules about what can be included in your profile.

Informal reputation does matter though. If you want to get things done and not just get promoted, you need the cooperation of people with a similar mindset, and collaboration is entirely voluntary.


> If journals really care about the reproducibility crisis

All is well and good then, because journals absolutely don't care about science. They care about money and prestige. From personal experience, I'd say this intersects with the interests of most high-ranking academics. So the only unhappy people are idealistic youngsters and science "users".

Let's get back to non-profit journals.


I am in 100% agreement and would like to point out that many papers based on code don't even come with code bases, and if they do those code bases are not going to contain or be accompanied by any documentation whatsoever. This is frequently by design as many labs consider code to be IP and they don't want to share it because it gives them a leg up on producing more papers and the shared code won't yield an authorship.


If published research is based on a code base, then surely the documentation and working code is equally important than the carefully written paper.


I completely agree, the problem is the journal editors and reviewers largely don't.


No, the paper is what matters. The code is a means to generate the paper.


I agree, but that’s similar to saying the data is what matters, not the methodology.

In the research germane to this conversation, software is the means by which the scientific data is generated. If the software is flawed, it undermines the confidence in the data and thus the conclusions.


Most researchers would agree with the first statement without significant qualification. Methods are at the end for a reason.


Not disagreeing with your assertion on the opinion of “most researchers” but you’ll often find quite a few people advocating for using the methodology sans data as a means to determine publication worthiness to try and avoid the perverse incentives for novel or meaningful data.

I think it’s too easy to game the data (whether knowingly or not) with poor methodology. I advocate process before product, in other words.


Institutions need to provide scientists and mathematicians with coders. It's a bit insane to expect them to be software engineers as well.


There are some efforts in this vein within academia, but they are very weak in the United States. The U.S. Research Software Engineer Association (https://us-rse.org/) represents one such attempt at increasing awareness about the need for dedicated software engineers in scientific research and advocates for a formal recognition that software engineers are essential to the scientific process.

In terms of tangible results, Princeton at least has created a dedicated team of software engineers as part of their research computing unit (https://researchcomputing.princeton.edu/software-engineering).

Realistically though even if the necessity of research software engineering were acknowledged at the institutional level at the bulk of universities, there would still be the problem of universities paying way below market rate for software engineering talent...

To some degree, universities alone cannot effect the change needed to establish a professional class of software engineers that collaborate with researchers. Funding agencies such as the NIH and NSF are also responsible, and need to lead in this regard.


Thank you for the link to the Princeton group. That is encouraging. Aside from that, I share your lack of optimism about the prospects for this niche.

Most research programmers, in my experience, work in a lab for a PI. Over time, these programmers have become more valued by their team. However, they often still face a hard cap on career advancement. They generally are paid considerably less than they'd earn in the private sector, with far less opportunity for career growth. I think they often make creative contributions to research that would be "co-author" level worthy if they came from someone in an academic track, but they are frequently left off publications. They don't get the benefits that come with academic careers, such as sabbaticals, and they often work to assignment, with relatively little autonomy. The right career path and degree to build the skills required for this kind of programming is often a mismatch for the research-oriented degrees that are essential to advancement in an academic environment (including leadership roles that aren't research roles).

In short, I think there is a deep need for the emerging "research software engineer" you mention, but at this point, I can't recommend these jobs to someone with the talent to do them. There are a few edge cases (lifestyle, trailing spouse in academic, visa restrictions), but overall, these jobs are not competitive with the pay, career growth, autonomy, and even job security elsewhere (university jobs have a reputation for job security, but many research programmers are paid purely through a grant, so often these are 1-2 year appointments that can be extended only if the grant is renewed).

The Princeton group you linked to is encouraging - working for a unit of software developers who engage with researchers could be an improvement. Academia is still a long, long way away from building the career path that would be necessary to attract and keep talent in this field, though.


Noone expects them to be software engineers, but we do expect them to be _scientists_ - to publish results that are reproducible and verifiable. And that has to hold for code as well.


John Carmack, who did some small amount of work on the code, had a short rebuttal of the "Lockdown Skeptics" attack on the Imperial College code that probably mirrors the feelings of some of us here:

https://mobile.twitter.com/id_aa_carmack/status/125819213475...


Can you describe a bit more about what is going on in the project? The file you linked is over 2.5k lines of c++ code, and that is just the “setup” file. As you say, this is supposed to be a statistical model, I expected this to be R, Python or one of the standard statistical packages.

Why is there so much c++ code?


It's a Monte-Carlo simulation, not a statistical model. These are usually written in C++ for performance reasons.


Or Fortran.


Oh gosh yes, the amount of `just works` Fortran in science is one of those things akin to COBOL in business. I just know some people are thinking 10 years - ha, be some instances of 40 and possible 50 years for some. Heck, the sad part is many will have computer systems older than 10 years just as it links to this bit of kit and the RS232 just works with the DOS software fine as and the updated version had issues when they last tried. That's a common theme with specialist kit attached to a computer for control - medical as well has that.


I know two fresh PhDs from two different schools whose favorite language is fortran. I think it's rather different from cobol in that way -- yes, the old stuff still works, but newer code cuts down on the boilerplate and is much more readable. And yeah, the ability to link to 50 year-old battle-tested code is quite a feature.


Large chunks of this particular code was in fact originally written in Fortran and then machine translated into C++.


It is essentially a detailed simulation of viral spread, not just a programmed distribution or anything. It's all in C++ because it's pretty performance-critical.


Because much of this code was written in the 80's, I suspect. In general, there's a bunch of really old scientific codebases in particular disciplines because people have been working on these problems for a looooonnngg time.


Who says anything about statistical models?


In computer science a lot of researcher already publish their code (at least in the domain of software engineering) but my biggest problem is not the absence of tests but the absence of any documentation how to run it. In the best case you can open it in an IDE and it will figure out how to run it but I rarely see any indications what the dependencies are. So if you figure out how to run the code you run it until you get the first import exception, get the dependency until you get the next import exception and so on. I spent way too much time on that instead of doing real research.


The criticisms of the code from Imperial College are strange to me. Non-deterministic code is the least of your problems when it comes to modeling the spread of a brand new disease. Whatever error is introduced by race conditions or multiple seeds is completely dwarfed by the error in the input parameters. Like, it's hard to overstate how irrelevant that is to the practical conclusions drawn from the results.

Skeptics could have a field day tearing apart the estimates for the large number of input parameters to models like that, but they choose not to? I don't get it.


I do research for a private company, and open-source as much of my work as I can. It's always a fight. So I'll take their side for the moment.

Many years ago, a paper on the PageRank algorithm was written, and the code behind that paper was monetized to unprecedented levels. Should computer science journals also require working proof of concept code, even if that discourages companies from sharing their results; even if it prevents students from monetizing the fruits of their research?


A seasoned software developer encountering scientific code can be a jarring experience. So many code smells. Yet, most of those code smells are really only code smells in application development. Most scientific programming code only ever runs once, so most of the axioms of software engineering are inapplicable or a distraction from the business at hand.

Scientists, not programmers, should be the ones spear-heading the development of standards and rules of thumb.

Still, there are real problematic practices that an emphasis on sharing scientific code would discourage. One classic one is the use of a single script that you edit each time you want to re-parameterize a model. Unless you copy the script into the output, you lose the informational channel between your code and its output. This can have real consequences. Several years ago I started up a project with a collaborator to follow up on their unpublished results from a year prior. Our first task was to take that data and reproduce the results they obtained before, because the person no longer had access to the exact copy of the script that they ran. We eventually determined that the original result was due to a software error (which we eventually identified). My colleague took it well, but the motivation to continue the project was much diminished.


You can blame all the scientists, but shouldn't we blame the CS folks for not coming up with suitable languages and software engineering methods that will prevent software from rotting in the first place?

Why isn't there a common language that all other languages compile to, and that will be supported on all possible platforms, for the rest of time?

(Perhaps WASM could be such a language, but the point is that this would be just coincidental and not a planned effort to conservate software)

And why aren't package managers structured such that packages will live forever (e.g. in IPFS) regardless of whether the package management system is online? Why is Github still a single point of failure in many cases?


It's hard for me to publish my code in healthcare services research because most of it is under lock-and-key due to HIPAA concerns. I can't release the data, and so 90% of the work of munging and validating the data is un-releasable. So, should I release my last 10% of code where I do basic descriptive stats, make tables, make visualizations, or do some regression modeling? Certainly, I can make that available in de-identified ways, but without data, how can anyone ever verify its usefulness? And does anyone want to see how I calculated the mean, median, SD, IQR?...because it's with base R or tidyverse, that's not exactly revolutionary code.


One of the things I come across is scientists who believe they're capable of learning code quickly because they're capable in another field.

After they embark on solving problems, it does become an eyeopening experience, and one that becomes now about keeping things running.

For those who have a STEM discipline in addition to a software development background >5Y, would you agree with seeing the above?

I would have thought the scientists among us would approach someone with familiarity with software development expertise. (something abstract and requiring a different set of muscles)

One positive emerging is the variety of low/no-code tooling that can replace a lot of this hornets nest coding.


It's generally not plausible to "approach someone with familiarity with software development expertise" for organizational and budget reasons. Employing dedicated software developers is simply not a thing that happens; research labs overwhelmingly have the coding done by researchers and involved students without having any dedicated positions for software development.

In any case you'd need to teach them the problem domain, and it's considered cheaper (and simpler from organizational perspective) to get some phd students or postdocs from your domain to spend half a year getting up to speed on coding (and they likely had a few courses in programming and statistics anyway) than to hire an experienced software developer and have them learn the basics of your domain (which may well take a third or half of the appropriate undergraduate bachelor's program).


> Employing dedicated software developers is simply not a thing that happens

This is a really key point that is lost on devs outside of science looking in. In our case, good devs are out of budget by a factor of 2x at least (at an EU public university in a lab doing lots of computational work).

The best we get are engineers which are expected to keep the cluster running, order computers, organize seminars.. and eventually resolve any software or dev problems. This doesn’t leave much time for caring about reproducibility outside the very core algorithms. The overall workflow can fade away since the next post doc is going to redo it anyway.


Are the hiring scientists also paid well-below market wages by that degree?


In many fields industry pays noticeably better than academia, but the difference is not that meaningful for the actual scientists; a principal investigator gets a reasonable amount of money but also significant degree of freedom and influence which helps job satisfaction even if the pay itself is lower.

The big issue with hiring software developers is that the 'payscale' is set according to the academic criteria, and an external developer coming from the industry - no matter how experienced or skilled - can usually be offered only a junior position with pay appropriate for that because they do not meet the criteria required for non-junior positions (no PhD, often not even a masters', no relevant publications, etc). From that perspective the only difference between a grad student who just started and a seasoned software developer is that the grad student can be employed as a part-time research assistant while a 'pure' developer could be full-time; but the hourly rate and conditions would be pretty much the same, targeted at less experienced employees. We can hire skilled mid-level individual contributors with reasonable pay for post-doc positions, but post-doc positions are limited to candidates who have a PhD. And it's not a that big limitation, since it's expected that everybody who's working "in the field" will get a PhD during their first few years of practical work experience as a grad student, the concept of "experienced/skilled but no degree" is not considered by the system as such people are rare in academia, and they stay rare due to the existing system.

So the disparity in evaluation criteria means that it's tricky to transfer between the different "career paths" - if you come from an environment where degrees mostly don't matter to an environment where a PhD is almost table stakes (to be a "hiring scientist", PhD is mandatory but nowhere near sufficient), then "getting your worth" is possible only if you are willing to put in quite some time and effort to fit the criteria used to evaluate scientists, even if you're there just to do software development.


Academia definitely can place software creators at a lower payscale to keep other higher. I wonder why such a caste system exists?


It does not place software creators at a lower payscale - all the software creators I know in academia are at the payscale level where they should be given their experience, however, all of them have a PhD or are in the process of getting one very soon.

My point is that it places outsiders (no matter if they're going to do software development or something else) on a lower payscale until they catch up on all the academia-specific factors of evaluation.

It's not a caste system between different types of activities, but rather a barrier of entry - in some sense, you have to start from 'level 1' no matter how much experience you have in other fields, so inexperienced people can join easily, but for senior/experienced people doing it is possible but costly.


It’s not a caste system, rather a capitalist imitation where the capital is impact factor and grants and first authorship in nature and science. In this system software creators are just means to an end, and good software engineers are an irrational cost given the PhD student that can churn out working code for same impact factor for less money.


As a grad student in physics, I not only wrote code, but also designed my own (computer controlled) electronics, mechanics, optics, vacuum systems, etc. I was my own machinist and millwright. Today I work in a small R&D team within a larger business, and still do a lot of those things myself when needed.

There are many problems with using a dedicated programmer, or any other technical specialist in a small R&D team. The first is keeping them occupied. There was programming to be done, but not full time. And it had to be done in an extremely agile fashion, with requirements changing constantly, often at the location where the problem is occurring, not where their workstation happens to be set up. Many developers hate this kind of work.

Second is just managing software development. Entire books have been written about the topic, and it's not a solved problem how to keep software development from eating you alive and taking ownership of your organization. Nobody knows how to estimate the time and effort. You never know if you're going to be able to recover your source code and make sense of it, if your programmer up and quits.

With apologies to Clemenceau, programming is too important to be left to the programmers. ;-)


There's no problem with not leaving programming to programmers, its about how to encourage anyone picking up programming to have healthier habits so others can participate in the creation in the future.


Indeed, and one thing that's lacking is any kind of coaching or training. Those of us doing it can't necessarily coach the next generation, because stuff has gotten ahead of us.


> research labs overwhelmingly have the coding done by researchers and involved students

This is a general problem we all have, whenever we should employ a professional to do necessary work. The right professional will take a tenth of the time and the job will be done some multiple better. But how do you pick the right person?

I have two experiences with post-grad work that I think are relevant:

1. A friend needed some work done in a statistics package that used a language that felt like it was from the 80's. I was able to complete the work in a few hours, but I don't think a student could have done it (complicated need combined with a crappy language and IDE).

2. Another postdoc engineering friend needed to do some heavy duty data analysis, and she was recommended to learn C++. I suspect she wasted years learning C++, time which should have been spent on investigating different forms of analysis. She wanted to listen to her engineering fellows, not some practicing software engineer, so wasted her life not achieving much...


> I would have thought the scientists among us would approach someone with familiarity with software development expertise.

Is there a pool of skilled software architects willing to provide consultations at well-below market wages? Or a Q&A forum full of people interested in giving this kind of advice? (StackOverflow isn't useful for this; the allowed question scope is too narrow.) I guess one incentive to publish one's code is to get it criticized on places like Hacker News. The best way to get the right answer on the internet is to post the wrong answer, after all.


I'll state the obvious and answer with No. There are not enough skilled software architects to go around and many who consider themselves skilled are not actually producing good code themselves, probably including many confident posters here in this forum.

The idiosyncrasies and tastes of many 'senior' software engineers would likely make the code unreadable and unmaintainable for the average scientist and possibly discourage them from programming altogether.

Software architecture is an unsolved problem as evident in the frequent fundamental discussions about even trivial things, highlighted by a cambrian explosion of frameworks who try to help herding cats, and made obvious in senior programmers struggling to get a handle on moderately complex code.

I propose scientists keep their code base as simple as possible, review the code along with the ideas with their peers, maybe use Jupyter notebooks to show the iterations and keep intermediate steps, and, as others state, show the code as appropriate and try to keep it running. There is no silver bullet and very few programmers could walk into your lab or office and really clean things up the way you'd hope.


I think the suggestion to keep the codebase as simple as possible for scientists applies as well to the software creators.

Life is different when you might have the a relationship with a single code base for 2-5 years, or even more. Complexities will happen on their own, no need to add them in.


Are the hiring scientists also paid well-below market wages?


> Are the hiring scientists also paid well-below market wages?

Yes. Well, in engineering anyway. That's why most engineers use academia as a stepping stone to something else. Working in science is, I think, sort of like working at a startup that's perpetually short on cash with no possibility of an exit.

With a little digging you can find published tables of job codes and salaries for many universities, e.g., https://www.udel.edu/content/dam/udelImages/human-resources/...

Faculty are listed separately for some reason: https://www.udel.edu/faculty-staff/human-resources/compensat...

Positions requiring a PhD start being listed at 29E (midrange of $55k) or 30E (midrange of $63,800). You could easily get that with a bachelor's degree in engineering 10 years ago. I suspect you will find the "Information Technology" Job Family salaries particularly amusing.


My work position was created because scientists are not engineers. I had to explain -to my disappointment- why non-deterministic algorithms are bad, how to write tests, and how to write SQL queries, more than once.

However, when working as equals scientists and engineers can create truly transformative projects. Algorithms accounts for 10% of the solution. The code, infrastructure and system design accounts for 20% of the final result. The remaining 70% of the value, is directly coming from its impact. A projects that nobody uses is a failure. Something that perfectly solves a problem that nobody cares about is useless.


In the event, the code actually is reproducible: https://www.nature.com/articles/d41586-020-01685-y


> This was used by the Imperial College for COVID-19 predictions. It has race conditions, seeds the model multiple times, and therefore has totally non-deterministic results[0].

>

> [0] https://lockdownsceptics.org/code-review-of-fergusons-model/

This does not looks like a good example at all, as it appears that the blog author there just tries to discredit the program because he does not like the results. He also writes that all epidemiological research should be defunded.


There is a fundamental reason not to publish scientific code.

If someone is trying to reproduce someone else's results, the data and methods are the only ingredients they need. If you add code into this mix, all you do is introduce new sources of bias.

(Ideally the results would be blinded too.)


This is an easy argument to make because it was already made for you in popular press months ago.

Show me the grant announcements that identify reproducible long term code as a key deliverable, and I’ll show you 19 out of 20 scientists who start worrying about it.


Short answer: Yes, my 30 year old Fortran code runs (with a few minor edits between f77 and modern fortran), as did my ancient Perl codes.

Watching the density functional theory based molecular dynamics zip along at ~2 seconds per time step on my 2 year old laptop, versus the roughly 6k seconds per time step on an old Sun machine back in 1991. I remember the same code getting down to 60 seconds per time step on my desktop R8k machine in the late 90s.

Whats been really awesome about that has been the fact that I've written some binary data files on big endian machines in the early 90s, and re-read them on the laptop (little endian) adding a single compiler switch.

Perl code that worked with big XML file input in the mid 2000s continues to work, though I've largely abandoned using XML for data interchange.

C code I wrote in the mid 90s compiled, albeit with errors that needed to be corrected. C++ code was less forgiving.

Over the past 4 months, I had to forward port a code from Boost 1.41 to Boost 1.65. Enough changes over 9 years (code was from 2011) that it presented a problem. So I had to follow the changes in the API and fix it.

I am quite thankful I've avoided the various fads in platforms and languages over the years. Keep inputs in simple textual format that can be trivially parsed.


> Whats been really awesome about that has been the fact that I've written some binary data files on big endian machines in the early 90s, and re-read them on the laptop (little endian) adding a single compiler switch.

I want to second the idea of just dumping your floating point data as binary. It's basically the CSV of HPC data. It doesn't require any libraries, which could break or change, and even if the endianness changes you can still read it decades later. I've been writing a computational fluid dynamics code recently and decided to only write binary output for those reasons. I'm not convinced of the long-term stability of other formats. I've seen colleagues struggle to read data in proprietary formats even a few years after creating it. Binary is just simple and avoids all of that. Anybody can read it if needed.


Counter argument: Binary dumps are horrible because usually the documentation that allows you to read the data is missing. Using a self-documenting format such as HDF5 is far superior. It will tell you of the bit are floating point numbers in single or double precision, which endianess and what the layout of the 3d array was. (No surprise that HDF was invented for the Voyager mission where they had to ensure readability of the data for half a century).


I got into the habit of documenting each file with a file.meta that I could view later on.

I did binary dumps in the past because ascii dumps (remember, 90s) were far more time/space expensive. HDF wasn't quite an option then, either HDF4, or HDF5.

These days I would probably look at something like that, though, to be honest, there is always a danger of choosing something that may not be supported over the long term. This is why I generally prefer open and simple formats for everything. HDF5 is nice and open.

One needs to look carefully at the total risk of using a proprietary format/system for any part of their storage. Chances are you will not be able to even read older data within a small number of decades if any of the format/system dependent technologies goes away.

I've got old word processor files from the mid 80s, that I can't read. What I've written there (mostly college papers) is lost (which may be a net positive for humanity).

My tarballs, and zip files though, are readable 30+ years later. That is pretty amazing.

Simple, documented, and open formats. Picture a time when you can't read/open your pptx/xlsx/docx files any more. Same with data. Simple binary formats are like CSV files, but you do need to maintain metadata on their contents, and document it extensively in the code as to what you are reading/writing, why you are doing this, and how you are doing this.

I think this will get more important over time as we start asking questions on how to maintain open artefact repositories for data and code. The fewer dependencies the better.

And unlike the recent gene renaming snafu in biology[1], you really, never, want your tool to get in the way of the science. Either in terms of formats, or interpretation of data.

[1] https://www.theverge.com/2020/8/6/21355674/human-genes-renam...


Your argument raises a lot of good points. I actually agree that binary does lose all of the metadata and documentation that goes with it. That is a big problem. That is why I think it is also important to include some sort of documentation like an Xdmf file [1]. That is what I use to tie everything together in my particular project. HDF5 is fine. In fact, I would have strongly preferred my colleagues using HDF5 over the proprietary format that they did end up using. But HDF5 requires an additional library. I did not want to use any external libraries in my particular project (other than MPI), so I tried to look for a solution that achieves close to what HDF5 can achieve but without requiring something as "heavy" as HDF5. I have to admit that perhaps my design choice does not work for more complex situations, but I think it is something people should consider before tying themselves down too much.

[1] http://www.xdmf.org/index.php/Main_Page


Having an Xdmf file alongside is nice, but the breaking changes between v2 and v3 are very annoying. And I understand the want to have few external dependencies, but at least HDF5 is straight forward to compile and available as a pre-compiled module on all supercomputers that I have ever seen.


Why not dumping into SQLite? It makes everything easy, and we will be able to use sqlite3 for a long time IMO.


Because parallel IO from a lot of different MPI ranks is not supported. And filesystems tend to look unhappy when 100k processes try to open a new file at the same time.


Yes, I know a couple of Fortran 77 apps and libraries which were developed more than 25 years ago and which are still in use today.

My C++ Qt GUI application for NMR spectrum analysis (https://github.com/rochus-keller/CARA) runs since 20 years now with continuing high download and citation rates.

So obviously C++/Qt or Fortran 77 are very well suited to outlast time.


Nice. Interesting to know that Github starts aren't always a representative metric.


Yes, many of my apps and libs were more than ten years old when I pushed them to github. Some projects started before git was invented.


As someone who worked with bits of scientific code: Does the code you write right now work on another machine might be the more appropriate challenge. If seen a lot of hardcoded paths, unmentioned dependencies and monkey-patched libraries downloaded from somewhere; just getting the new code to work is hard enough. And let's not even begin to talk about versioning or magic numbers.

Similar to other comments I don't mean to fault scientists for that - their job is not coding and some of the dependencies come from earlier papers or proprietary cluster setups and are therefore hard to avoid - but the situation is not good.


> their job is not coding

To me, that's like a theoretical physicist saying "My job is not to do mathematics" when asked for a derivation of a formula he put in the paper.

Or an experimental physicist saying "My job is not mechanical engineering" when asked for details of their lab equipment (almost all of which is typically custom built for the experiment).


On one hand, yes. But on the other hand, reuseable code, dependency management, linting, portability etc are not that easy problems and something junior developers tend to struggle with (and its not like that problem never pops up for seniors, either). I really can't fault non-compsci scientist for not handling that problem well. Of course, part of it (like publishing the relevant code) is far easier and should be done, but some aspects are really hard.

IMO the incentive problem in science (basically number of papers and new results is what counts) also plays into this, as investing tons of time in your code gives you hardly any reward.


> But on the other hand, reuseable code, dependency management, linting, portability etc are not that easy problems and something junior developers tend to struggle with

On the original hand, these are easier problems than all the years of math education they have. Once you're relying on simulations to get results to explain natural phenomena, it needs to be put on the same pedestal as mathematics.


There are tons of tutorials on using conda for dependency management, it's not rocket science. And using a linter is difficult? If a scientist needs to read and write code as part of their job then they should learn the basics of programming - that includes tools and 'best practices'.


The point is that as a scientist your code is a tool to get the job done and not the product. I can't spend 48 hours writing unit tests for my library (even though I want to) if it's not going to give me results. It's literally not my job and is not an efficient use of my time


If the code you base your work on is horrible it definitely makes me question your results. That's why it's called the reproducibility crisis.

Writing some tests, using a linter, commenting your code, and learning about best programming practices doesn't take long and pays off - even for yourself when writing the code or you need to touch the code again. "48 hours writing unit tests" is a ridiculous comparison.


No, the reproducibility crisis is that when 10 people write the same library you all get different results.

This is just complaining that science is too hard because you can't be bother to replicate an experiment.


This is the same as any other argument against testing. Unless you are actually selling a library, code is not the product. Customers are buying results, not your code base. Yet, we've discovered the importance of testing to make sure customers get the right results without issues.

If you want your results to be usable by others, the quality of the code matters. If all you care is publishing a paper, then I guess sure it doesn't matter if anyone else can build off your work.


But the results are usable by others, in most fields of science the code is not part of these results and is not needed to enjoy, use and build upon the research results.

The only case where the code would be used (which is a valid reason why it should be available somehow) is to assert that your particular results are flawed or fraudulent; otherwise the quality of the code (or its availability, or even existence - perhaps you could have had a bunch of people do all of it on paper without any code) is simply irrelevant if you want your results to be usable by others.


> The only case where the code would be used (which is a valid reason why it should be available somehow) is to assert that your particular results are flawed or fraudulent;

Not true. Code is often used and reused to churn out a lot more results than the initial paper. A flaw in the code doesn't just show one paper/result as problematic. It can show a large chunk of a researcher's work in his area of expertise to be problematic.


> The point is that as a scientist your code is a tool to get the job done and not the product.

Everything you say is as true for experimental equipment and mathematical tools. Physicists are fantastic at mathematics, yet are one of the most anti-math people I know - in the sense of "Mathematics is just a tool to get results that explain nature! Doing mathematics for its own sake is a waste of time!"

The equation is not the product - the explanation of physical phenomena is. If the attitude of "I don't need to show how I got this equation" is unacceptable, the same should go for code.


> I can't spend 48 hours writing unit tests for my library

No one is insisting on top quality code, but there has to be an acceptance that code can be flawed and that needs to be tested for.


How do you know it won't give you results? Maybe it will find a bug that would have resulted in an embarrassing retraction.

Maybe it wouldn't find any bugs, but give confidence to and encourage other users and increasing your citations and "impact".

Maybe it will just save you 48h later on when you need to adapt the code.

Software engineering has generally accepted that unit testing is a good practice and well worth the time taken. Why do you think science is different?


> Why do you think science is different?

It's really not, I guess his focus lies on cranking out irreproducible papers.


That's literally what they do.

Theoretical Physicists (literal conversation I had):

>Yeah, this looked like it simplifies to 1-ish and Smart John said it's probably right.

Experimental physicists (another literal conversation):

>Yeah, we build it with duck-tape and there's hot glue holding the important bits that kept falling off. Don't put anything metal in that, we use it as a tea heater, but there's 1000A running through it so it's shoots spoons out when we turn the main machine on.


Lots of people saying, it is the scientist's job to produce reproducible code. It is, and the benefits of reproducible code are many. I have been a big proponent of it in my own work.

But not with the current mess of software frameworks. If I am to produce reproducible scientific code, I need an idiot-proof method of doing it. Yes, I can put in the 50-100 hours to learn how to do it [1], but guess what, in about 3-5 years a lot of that knowledge will be outdated. People comparing it with math, but the math proofs I produce will still be readable and understandable a century from now.

Regularly used scientific computing frameworks like matlab/R/Python ecosystem/mathematica need a dumb guided method of producing releasable and reproducable code. I want to go through a bunch of next buttons, that help me fix the problems you indicate, and finally release a final version that has all the information necessary for someone else to reproduce the results.

[1] I have. I would put myself in the 90th percentile of physicists familiar with best practices for coding. I speak for the 50% percentile.


The dumb guide is the following:

(1) Use a package manager, which stores hashsums in a lock file. (2) Install your dependencies from a lock file as spec. (3) Do not trust version numbers. Trust hash sums. Do not believe in "But I set the version number!". (4) Do not rely on downloads Again, trust hash sums, not URLs. (5) Hashsums!!! (6) Wherever there is randomness as in random number generators, use a seed. If the interface does not allow to specify the seed, thtow the trash away and use another generator. Careful when concurrency is involved. It might destroy reproducibility. For example this was the case with Tensorflow. Not sure it still is. (7) Use a version control system.


> in about 3-5 years a lot of that knowledge will be outdated

Yup, and most of the points you mentioned will probably not be outdated for quite some while. Every package manager I'm aware of with lock files that are that old can still consume them today.


I emailed an author of a 5 year old paper and they said they had lost their original MATLAB code, certainly brings into question their paper.


Definitely makes you question it more. Does the paper not explain the contents of the MATLAB code? That's all that is usually needed for reproducibility. You should be able to get the same results no matter who writes the code to do what is explained in their methods.

Of course, I have no idea about the paper you're talking about and just want to say that reproducibility isn't dependent on releasing code. There could even be a case were it's better if someone reproduces a result without having been biased by someone else's code.


If a scientist needs to write code then it's part of their job. It's as easy as that.


I think the idea that scientific code should be judged by the same standards as production code is a bit unfair. The point when the code works the first time is when an industry programmer starts to refactor it -- because he expects to use and work on it in the future. The point when the code works the first time is when a scientists abandons it -- because it has fulfilled its purpose. This is why the quality is lower: lots of scientific code is the first iteration that never got a second.

(Of course, not all scientific code is discardable, large quantities of reusable code is reused every day; we have many frameworks, and the code quality of those is completely different).


That's not the point, though. If you obtain your results by writing and executing code then code quality matters - to reproduce and validate them.


> their job is not coding

But it often is. For most non-CS papers (mostly biosciences) I've read, there are specific authors whose contribution to a large degree was mainly "coding".


The gold standard for a scientific finding is not whether an particular experiment can be repeated, it is whether a different experiment can confirm the finding.

The idea is that you have learned something about how the universe works. Which means that the details of your experiment should not change what you find... assuming it's a true finding.

Concerns about software quality in science are primarily about avoiding experimental error at the time of publication, not the durability of the results. If you did the experiment correctly, it doesn't matter if your code can run 10 years later. Someone else can run their own experiment, write their own code, and find the same thing you did.

And if you did the experiment incorrectly, it also doesn't matter if you can run your code 10 years later; running wrong code a decade later does not tell you what the right answer is. Again--conducting new research to explore the same phenomenon would be better.

When it comes to hardware, we get this. Could you pick up a PCR machine that's been sitting in a basement for 10 years and get it running to confirm a finding from a decade ago? The real question is, why would you bother? There are plenty of new PCR machines available today, that work even better.

And it's the same for custom hardware. We use all sorts of different telescopes to look at Jupiter. Unless the telescope is broken, it looks the same in all of them. Software is also a tool for scientific observation and experimentation. Like a telescope, the thing that really matters is whether it gives a clear view of nature at the time we look through it.


Reproducibility is about understanding the result. It is the modern version of "showing your work".

One of the unsung and wonderful properties of reproducible workflows is the fact that it can allow science to be salvaged from an analysis that contains an error. If I had made an error in my thesis data analysis (and I did, pre-graduation), the error can be corrected and the analysis re-run. This works even if the authors are dead (which I am not :) ).

Reproducibility abstracts the analysis from data in a rigorous (and hopefully in the future, sustainable) fashion.


>Reproducibility is about understanding the result. It is the modern version of "showing your work".

That is something no one outside of highschool cares about. The idea that you can show work in general is ridiculous. Do I need to write a few hundred pages of set theory to start using addition in a physics paper? No. The work you need to show is the work a specialist in the field would find new, which is completely different to what a layman would find new.

Every large lab, the ones that can actually reproduce results, has decades of specialist code that does not interface with anything outside the lab. Providing the source code is then as useful as giving a binary print out of an executable for an OS you've never seen before.


I heartily disagree -- reproducible analysis is essential for the intercomparison of analyses between specialist research groups.

Here is my thesis work, minus the data, which is too large to store within a GitHub repo. By calling `make`, it goes from raw data to final document in a single shot. The entire workflow, warts and all, can be audited. If you see a bug or concern, please let me know: https://github.com/4kbt/PlateWash


> running wrong code a decade later does not tell you what the right answer is.

It can tell, however, exactly where the error lies (if the error is in software at all). Like a math teacher that can circle where the student made a mistake in an exam.


Yes, this argument, along with the practices of cross checking within one project, is what saves science from the total doom its software practices would otherwise deliver.

However, reproducibility is a precondition to automation, and automation is a real nice thing to have.


Yes. 110% attributed to learning about unit-tests and gems/CPAN in grad school.

IMO there is a big fallacy about the "just get it to work" approach. Most serious scientific code, i.e. supporting months-years of research, is used and modified a lot. It's also not really one-off, it's a core part of a dissertation, or research program, if it fails- you do. I'd argue that (and I found that), using unit-tests, a deployment strategy, etc. ultimately allowed me to do more, and better science because in the long run I didn't spend as much time figuring out why my code didn't run when I tweaked stuff. This is really liberating stuff. I suspect this is all obvious to those who have gone down that path.

Frankly, every reasonably tricky problem benefits from unit-tests as well for another reason. Don't know how to code it, but know the answer? Assert lots of stuff, not just one at a time red-green style. Then code, and see what happens. So powerful for scientific approaches.


And bugs can have quite big implications:

https://smw.ch/article/doi/smw.2020.20336


The longest-running code I wrote as a scientist was a sandwich ordering system. I worked for a computer graphics group at UCSF and while taking a year off from grad school while my simulations ran on a supercomputer, and we had a weekly group meeting where everybody ordered sandwiches from a local deli.

It was 2000, so I wrote a cgi-bin in Python (2?) with a MySQL backend. The menu was stored in MySQL, as were the orders. I occasionally check back to see if it's still running, and it is- a few code changes to port to Python3, a data update since they changed vendors, and a mysql update or two as well.

It's not much but at least it was honest work.


My Tetris, first "serious" game I made, written in 1996, can play just as well in DosBox.


An interesting concern is that there often is no single piece of code that has produced the results of a given paper.

Often it is a mixture of different (and evolving) versions of different scripts and programs, with manual steps in between. Often one starts the calculation with one version of the code, identifies edge cases where it is slow or inaccurate, develops it further while the calculations are running, does the next step (or re-does a previous one) with the new version, possibly modifying intermediate results manually to fit the structure of the new code, and so on -- the process it interactive, and not trivially repeatable.

So the set of code one has at the end is not the code the results were obtained with: it is just the code with the latest edge case fixed. Is it able to reproduce the parts of the results that were obtained before it was written? One hopes so, but given that advanced research may take months of computer time and machines with high memory/disk/CPU/GPU/network speed requirements only available in a given lab -- it is not at all easy to verify.


>the process it interactive, and not trivially repeatable.

The kind of interaction you're describing should be frowned upon. It requires the audience to trust the manual data edits are no different than rerunning the analysis. But the researcher should just rerun the analysis.

Also, mixing old and new results is a common problem in manually updated papers. It can be avoided by using reproducible research tools like R Markdown.


If it can't be trivially repeated, then you should publish what you have with an explanation of how you got it. Saying that "the researcher should just rerun the analysis" is not taking into account the fact that this could be very expensive and that you can learn a lot from observations that come from messy systems. Science is about more than just perfect experiments.


And any such "research" should go in the bin. Reproducibility of final results a me d their review is key.


No, you should publish this research and be clear with how it all worked out and someone will reproduce it in their own way.

Reproducibility isn't usually about having a button to press that magically gives you the researchers' results. It's also not always a set of perfect instructions. More often it is a documentation of what happen and what was observed as the researcher's believe is important to the understanding of the research questions. Sometimes we don't know what's important to document so we try to document as much as possible. This isn't always practical and sometimes it is obviously unnecessary.


Back in the 80s/90s I was heavily into TeX/LaTeX—I was responsible for a major FTP archive that predated CTAN, wrote ports for some of the utilities to VM/CMS and VAX/VMS and taught classes in LaTeX for the TeX Users Group. I wrote most of a book on LaTeX based on those classes that a few years back I thought I'd resurrect. Even something as stable as LaTeX has evolved enough that just getting the book to recompile with a contemporary TeX distribution was a challenge. (On the other hand, I've also found that a lot of what I knew from 20+ years ago is still valid and I'm able to still be helpful on the TeX stack exchange site).


Just as a quick bit of context here, Konrad Hinsen has a specific agenda that he is trying to push with this challenge. It's not clear from this summary article, but if you look at the original abstract soliciting entries for the challenge (https://www.nature.com/articles/d41586-019-03296-8), it's a bit clearer that Hinsen is using this to challenge the technical merits of Common Workflow Language (https://www.commonwl.org/; currently used in bioinformatics by the Broad Institute via the Cromwell workflow manager).

Hinsen has created his own DSL, Leibniz (https://github.com/khinsen/leibniz ; http://dirac.cnrs-orleans.fr/~hinsen/leibniz-20161124.pdf), which he believes is a better alternative to Common Workflow Language. This reproducibility challenge is in support of this agenda in particular, which is worth keeping in mind; it is not an unbiased thought experiment.


Konrad Hinsen is an expert in molecular bioinformatics and also has significantly contributed to Numerical Python, for example, and has extensively published around the topic of reproducible science and algorithms - see his blog.

The fact that he might favor different solutions from you does not mean that he is pushing some kind of hidden agenda.

If you think that Common Workflow Language is a better solution, you are free to explain in a blog why you think this.

Are you saying that the reproductive challenge poses a difficulty to Common Workflow Language? If this is so, would that not rather support Hinsen's point - without implying that what he suggests is already a perfect solution?


I never said that Konrad Hinsen's agenda was hidden; in fact, it's not at all hidden (which is why I linked the abstract). It's just that this context isn't at all clear in the Nature write-up, and it's relevant to take into account.

I haven't taken the time to seriously contemplate the merits of CWL vs Leibniz, although my gut instinct is that we don't really need another domain-specific language for science given the profusion of such languages that already exist (Mathematica, Maple, R, MATLAB, etc). That's the extent of my bias, but again, it's a gut instinct and not a comprehensive well-reasoned argument against Leibniz.


I never answered your last question so here goes:

> Are you saying that the reproductive challenge poses a difficulty to Common Workflow Language?

I don't actually understand how the reproducibility challenge undermines the validity of using CWL / flow-based programming as an approach to promoting reproducible analyses. There certainly wasn't anything in the article that made me think that CWL was challenged, but Hinsen explicitly called out CWL in the abstract, which implies that for some reason he thinks, a priori, that it's a non-solution. He never justifies this implied assumption further, and as near as I can tell, none of the attempted replications used a flow-based language.

If Hinsen really aimed to argue against the viability of CWL/flow-based programming as an approach to reproducibility, he would have done a systematic comparison of historical analyses that used a flow-based system (like National Instruments' Labview or Prograph) vs analyses that are more similar to the approach that he seems to favor (i.e., analyses using Mathematica or Maple).

While I find the challenge interesting to follow, and the retrocomputing geek in me finds it fun, I don't actually understand what it really accomplished other than being a fun diversion. Assuming that an analysis was written in a Turing-complete language and you didn't use non-deterministic algorithms, you should theoretically be able to reproduce the results exactly on modern hardware, and using non-deterministic algorithms I would imagine that a result would be "close enough" within some kind of confidence interval. You may need to go to great lengths (in terms of emulating instruction sets, ripping tapes, etc), but I think a visit to any retrocomputing festival or computer history museum would have made that pretty obvious from the outset.


There seem to be some misunderstanding here.

CWL is intended for stringing together other programs. It is useful for reproducibility in that it attempts to provide a fairly specific description of the runtime environment needed to execute a program, and also abstracts site-specific details such as file system layout or batch system in use. CWL platforms such as Arvados also generate comprehensive provenance traces which are vital for going back and reviewing how a data result was produced.

Leibniz seems to be a numerical computing language for describing equations, which is more similar to something like NumPy or R. It seems like an apples-and-oranges comparison.

The original call-out is weird, because CWL did not exist 10 years ago so you can't yet answer the question yet of whether it facilitates running 10 year old workflows.


Plenty of actual professional programmers can't manage this, how is it a fair standard to hold scientists to, when the code is just one of the many tools they're trying to use to get their real job done?

I think moving away from the cesspool of imported remote libraries that update at random times and can vanish off the internet without warning, would help a lot of both cases.


Professional programmers should adopt package manager that focus on reproducibility like Guix and Nix and make them accessible enough for non programmers to use.

Neither of these are perfect but in my experience they are worlds better than apk, Dockerfiles, and many other commonly used solutions.

http://guix.gnu.org/

https://nixos.org/


I've had the pleasure of setting up a guix system in production.

The next guy to come along didn't understand it and threw it all away.


That’s very unfortunate.


I think we have to hold scientists to higher standards for code quality because it has a direct impact on the findings of their results. How many off by one or other subtle errors that are found later in testing have most software engineers written in their career? Is it fine to just say eh, scientific results can be off by one because the standards should be lower?


>Plenty of actual professional programmers can't manage this, how is it a fair standard to hold scientists to

That's a good point. On a tangential note, prototype code tends to be at a higher level than production code, so there is a higher chance 10 year old code will continue to run on the scientist side, as long as the libraries imported haven't vanished.


The two main problems in academia are that a) few researchers have formal training in best practices of software engineering, and that b) time pressure leads to "whatever worked two minutes before submission deadline" becoming what is kept for posteriority.

When I started working as a full-time researcher, I had come from working two years in a software shop, only to find people at the research lab having never used VCS, object-oriented programming, etc. Everyone just put together a few text files and Python or MATLAB scripts that output some numbers that went into Excel or gnuplot scripts that got copy-pasted into LaTeX documents with suffixes like "v2_final_modified.tex", shared over Dropbox.

Took a long time to establish some coding standards, but even then it took me a while to figure out that that alone didn't help: you need a proper way to lock dependencies, which, at the time, was mostly unknown (think requirements.txt, packrat for R, …).


Don't you think docker, dependencies, unit test frameworks, etc actually increase the need for ongoing maintenance as opposed to spitting out some C files or python scripts which last "forever"?


I don't think so. The source code is the same but there's now metadata that helps in setting up the same environment again, even years later. You still have the original code in case, e.g. Docker is no longer available.

For instance, if you just have a Python script importing a statistical library, what version are you going to use? Scipy had a pretty nasty change in one of its statistical functions, changing the outcome of significance tests in our project. Depending on which version you happened to have installed it'd give you a positive or negative result.


It makes sense that having more information is better than less.

I would argue that they should use no dependencies to avoid this problem entirely, or download them and include them as source in the project, or at least include a note of which version of a major library they used in a README or comment. I think this is what is often done in practice currently.

Perhaps as you are saying, docker is just a stable way to document this stuff formally. But it is a large moving part that assumes a lot of stuff is still on the internet. What if the docker hub image is removed or dramatically changed? What if that OS package manager no longer exists? It just doesn't seem like our software is getting more longevity, but less. I don't know why we would bring that extra complexity to academic research if the goal is longevity.


No.

Python/C files didn't work in a vacuum. They need dependencies, that is the point of Docker after all.

Capture all necessary dependencies into a single image.


> Python/C files didn't work in a vacuum

They do if you use the standard library (which for python is quite extensive), and copy any dependencies into your own source, as if they are your own. By "in a vaccum" we can mean if python o is installed, it will work.

> Capture all necessary dependencies

Docker doesn't capture any dependencies. They still exist on the internet. It just captures a list of which ones to download when you build the image.

Do you think software we write now has more longevity than older software that uses make or a shell script?


Linus is still able to run the first ./a.out he built on pre-0.1 Linux.

I can build a docker file from 5 years ago because all the links are dead.


requirements.txt is not a lockfile


This seems like a fluff piece because:

1) Prototype code scientists write tends to be written at a high level, so barring imported libraries not up and disappearing, there is a high chance that code written by scientists will run 10 years later. There is a higher chance it will run than production code written at a lower level.

2) The article dives into documentation but scientists code in the Literate Programming Paradigm[0] where the idea is you're writing a book and the code is used as examples to support what you're presenting. Of course scientists write documentation. Presenting findings is a primary goal.

3) Comments here have mentioned unit testing. Some of you may scoff at this but when prototyping, every time you run your code, the output from it teaches you something, and that turns into an iterative feedback loop, so every time you learn something you want to change the code. Unit tests are not super helpful when you're changing what the code should be doing every time you run it. Unit tests are better once the model has been solidified and is being productionized. Having a lack of unit testing does not make 10 year old prototype code harder to run.

[0] https://en.wikipedia.org/wiki/Literate_programming


> scientists code in the Literate Programming Paradigm

I wish. In my career as a computational scientist I have never seen this in practice, either in academia or industry.

On unit testing, I half agree. Most unit tests get quickly thrown out as the code changes, so it's a depressing way to write research code. But tests absolutely help someone trying to run old code - they show what parts still work and how to use them.


It's even more common today. eg, Jupyter Lab or Jupyter Notebooks.


Notebooks are to Literate Programming what Word is to TeX.


I wrote a C++ implementation of the AMBER force field in 2003. Still have the source code with its original modification times. Let's see:

  /usr/bin/g++   -I/home/dek/sw/rh9/gsl-1.3/include    -c -o NBEnergy.o NBEnergy.cpp
  NBEnergy.cpp: In member function ‘virtual double NBEnergy::Calculate(Coordinates&, std::vector<Force*>)’:
  NBEnergy.cpp:20:68: error: no matching function for call to ‘find(std::vector<atom*>::const_iterator, std::vector<atom*>::const_iterator, const atom*&)’
   20 |       if (std::find(at1->Excluded.begin(), at1->Excluded.end(), at2) != at1->Excluded.end()) 
  {
        |                                                                    ^
  In file included from /usr/include/c++/9/bits/locale_facets.h:48,
                   from /usr/include/c++/9/bits/basic_ios.h:37,
                   from /usr/include/c++/9/ios:44,
                   from /usr/include/c++/9/ostream:38,
                   from GeneralParameters.h:6,
                   from NBEnergy.h:6,
                   from NBEnergy.cpp:1:
  /usr/include/c++/9/bits/streambuf_iterator.h:373:5: note: candidate: ‘template<class _CharT2> typename __gnu_cxx::__enable_if<std::__is_char<_CharT2>::__value, std::istreambuf_iterator<_CharT> >::__type std::find(std::istreambuf_iterator<_CharT>, std::istreambuf_iterator<_CharT>, const _CharT2&)’
    373 |     find(istreambuf_iterator<_CharT> __first,
      |     ^~~~
  /usr/include/c++/9/bits/streambuf_iterator.h:373:5: note:   template argument deduction/substitution failed:
  NBEnergy.cpp:20:68: note:   ‘__gnu_cxx::__normal_iterator<atom* const*, std::vector<atom*> >’ is not derived from ‘std::istreambuf_iterator<_CharT>’
   20 |       if (std::find(at1->Excluded.begin(), at1->Excluded.end(), at2) != at1->Excluded.end()) 
  {
        |                                                                    ^
  make: *** [<builtin>: NBEnergy.o] Error 1

I still have a hardcoded reference to RedHat 9 apparently. But the only error has to do with an iterator, so clearly, something in C++ changed. Looks like a 1-2 line change.


You probably didn't include the algorithm header that defines find directly and it stopped compiling once the standard library maintainers cleaned up their own includes. The iostreams headers you include define their own stream iterator specific overload of find and that doesn't match.


Yup, that was it.

After that, I had to install libpython27-dev, and add -fPIC. Then my 17 year old Python module that has linked-in C++ code runs just fine. I'm not surprised- I've been writing cross-platform code that runs for 10+ years for 20+ years.


I think it's unfair to expect from anyone to maintain code forever when the code rot is completely beyond your control, let alone to expect this from scientists who have better things to do. Anything with a GUI is bound to self-destruct, for example, and it's not the programmer's fault. Blame the OS makers and framework/3rd party library suppliers.

The damage can be limited by choosing a programming language that provides good long compatibility. Languages like ANSI C, Ada, CommonLisp, and Fortran fit the bill. There are many more. Heck, you could use Chipmunk Basic. Anything fancy and trendy will stop working soon, though, sometimes even within a year.


> CommonLisp

Common Lisp has fantastic long-term stability. I think that deserves more recognition, as Common Lisp is often almost as fast as C, but is (by default) not riddled with undefined behavior.

It would be superb if Rust could take C's space in computational science and libraries.


"Python 2.7 puts “at our disposal an advanced programming language that is guaranteed not to evolve anymore”, Rougier writes1." Oh no. That's not at all what was intended. Regarding my own research: I'm doing theoretical biophysics. Often I do simulations. If conda stays stable enough, my code should be reproducible. There's however some external binaries(like lammps) I did not turn into a conda package yet. There's no official package that fits my use-case in conda since compilation is fine-grained to each user's needs.


I added different variants of lammps to a Guix channel we maintain at our institute:

https://github.com/BIMSBbioinfo/guix-bimsb/blob/master/bimsb...

Thankfully, Guix makes it easy to take an existing package definition and create an altered variant of it.


Would an abandoned project I wrote 10 years ago still run? The code is probably fine, but getting it to actually run by linking up whatever libraries, sdks and environment correctly could be troublesome. Even a small pipeline a wrote a few weeks ago I had trouble re-running, because I forgot there was a manual step I had to do on the input file.

Expecting more rigid software practices of scientists than software engineers would be wrong. I don't think they should have to tangle with this, tools should aid them somehow.


If the same project had been packaged with Nix, it would probably still compile. People regularly checkout older versions of nixpkgs to get access to older package releases.

One of the key property is that the build system enforces all the build inputs to be declared. And the other one is to keep a cache of all the build inputs like sources because upstream repositories tend to disappear over time.


It's interesting that it's often easier to get something 25+ years old running because I need fewer things. Not so hard to find, say "DosBox" and and old version of Turbo Pascal.


When I was in my 20s I managed to get a contract updating some control software for a contact lens company on the basis of my happening to own an old copy of Borland C++ 1.0.


Had a similar experience getting a contract updating a mass spectrometer control system because I had extensive high school experience in Turbo Pascal.


This. In the last years, conventional software engineering has in many cases experienced an explosion in complexity which will make very very difficult to maintain stuff in the long run. This only works because over 90% of startups go bust anyways, within a few years.


Sounds like simplicity for the win.

The complex house of cards we currently stand on seems fragile by comparison.


We also benefit, for that old stuff, from enthusiasts that build cool stuff. Like DosBox, Floppy Emulators, etc.

I doubt there are going to be folks nostalgic for the complex mess we have now.


Indeed. I participate in Atariage.com and the level of dedication is amazing.

Are there groups for Win 3.1, Win95?


I had an 18-year-old Python script. But it didn't work! And I couldn't make it work! Fortunately I had an even older version of the code in Perl, which did work after some very minor changes.

This wasn't scientific code. It was some snarly private code for generating the index for a book and I didn't look at it between one edition and the next. I hope I don't have to fix it again in another 18 years.

Applying some version of the "doomsday argument", Perl 5 might be a good choice if you're writing something now that you want to work (without a great tower of VMs) in 10 or 20 years' time. C would only be a reasonable choice if you have a way of checking that your program does not cause any undefined behaviour. A C program that causes undefined behaviour can quite easily stop working with a newer version of the compiler.


The day when code used to produce a paper must also be published can not come soon enough.


In all my papers the results were produced on multiple days (spanning months), with multiple versions of the code, and they are computationally too expensive to reproduce with the final version of the code. I'm trying to keep track of all the used versions, but given that there is no automated framework for this (is there?) and research involves lots of experiments, it's never perfect. Given this context, any ideas how to do it better?


My first thought: Demand the journals provide hosting for a code repo that is part of your paper. For every numerical result, specify the version (e.g. a git tag) used to generate your result.

And if that means scientists need to learn about version control, well... they should if they're writing code.


For a paper I recently submitted, the journal demanded a github release of the software.


I agree, except that AFAIK "tags" in git are not fixed, they can be deleted and re-created to point at a different commit. Hence I prefer to use (short) commit IDs, since changing them is infeasible.


I'm assuming the repo, once hosted and the paper is published, is "fixed" and cannot be changed by the authors.

But commit ids work just as well.


I should have mentioned that I of course use Git. But the need to manually keep track of the calculation~commit pairing is tiresome and error-prone.


That's no different than normal software engineering. We use version control software (VCS, like git) to deal with it. You can include your results in the tracked source.

For what it's worth, using results from outdated source code is extremely suspicious. This is a frequent problem in software development where we have tests or benchmarks based on stale code, and it's almost always incorrect. I would not trust your results if they are not created with the most up to date version of your software at all.


I tend to do the following (some or all, depending on the situation):

- Use known, plaintext formats like LaTeX, Markdown, CSV, JSON, etc. rather than undocumented binary formats like those of Word, Excel, etc.

- Keep sources in git (just a master branch will do)

- Write all of the rendering steps into a shell script of Makefile, so it's just one command with no options

- I go even further and use Nix, with all dependencies pinned (this is like an extreme form of Make)

- Code for generating diagrams, graphs, tables, etc. is kept in git alongside the LaTex/whatever

- Generated diagrams/graphs/tables are not included in git; they're generated during rendering, as part of the shell-script/Makefile/Nix-file; the latter only re-generate things if their dependencies have changed

- All code is liberally sprinkled with assertions, causing a hard crash if anything looks wrong

- If journals/collaborators/etc. want things a certain way (e.g. a zip file containing plain LaTeX, with all diagrams as separate PNGs, or whatever) then the "rendering" should take care of generating that (and make assertions about the result, e.g. that it renders to PDF without error, contains the number of pages we're expecting, that the images have the expected dimensions, etc.)

- I push changes from my working copies into a 'repos' directory, which in turn pushes to my Web server and to github (for backups and redundancy)

- Pushing changes also triggers a build on the continuous integration server (Laminar) running on my laptop. This makes a fresh copy of the repo and tries to render the document (this prevents depending on uncommitted files, the absolute directory path, etc.)

Referencing a particular git commit should be enough to recreate the document (this can also be embedded in the resulting document somewhere, for easy reference). Some care needs to be taken to avoid implicit dependencies, etc. but Nix makes this much easier. Results should also be deterministic; if we need pseudorandom numbers then a fixed seed can be used, or (to prove there's nothing up our sleeves) we can use SHA256 on something that changes on each commit (e.g. the LaTeX source).

For computationally-expensive operations (with relatively small outputs) I'll split this across a few git repos:

1) The code for setting up and performing the experiments/generating the data goes in one repo. This is just like any other software project.

2) The results of each experiment/run are kept in a separate git repo. This may be a bad idea for large, binary files; but I've found it works fine for compressed JSON weighing many MBs. Results are always appended to this repo as new files; existing files are never altered, so we don't need to worry about binary diffs. There should be metadata alongside/inside each file which gives the git commit of the experiment repo (i.e. step 1) that was used, alongside other relevant information like machine specs (if it depends on performance), etc. This could be as simple as a file naming scheme. The exact details for this should be written down in this repo, e.g. in a README and/or a simple script to grab the relevant experiment repo, run it, and store the results+metadata in the relevant place. Results should be as "raw" as possible, so that they don't depend on e.g. post-processing details, or choice of analysis, etc.

3) I tend to put the writeup in a separate git repo from the results, so that those results can be referenced by commit + filename, without a load of unrelated churn from the writeup. This repo will follow the same advice as above, e.g. code for turning the "raw" results into graphs, tables, etc. will be kept here and run as part of the rendering process. Fetching the particular commit from the results repo should also be one of the rendering steps (Nix makes this easy, or you could use a git submodule, etc.)

I don't know what the best advice is w.r.t. large datasets (GBs or TBs), but I've found the above to be robust for about 5 years so far.


Arguably, data is just as important. Academics hoard their data and try to milk out every paper they can from it. The reward system is based on publishing as many papers as possible rather than just making a meaningful contribution.


Data is much trickier because your data source for medical, education or even just regular businesses don't want the added legal weight of making data freely available.

This is obviously a shame, I was working on segmentation of open wounds and most papers include a "we are currently in talks with the hospital to make the data available". If you contact the authors directly they will tell you that their committee blocked it because the information is too sensitive.


It seems like there can be a balance between "the results are unverifiable because no one else can touch the data" and "effectively open-source the dataset"?

Something like: "To make it easier to verify the code behind this paper, we've used <accepted standard project/practice> to generate a synthetic dataset with the same fields as the original and included it with the source code. The <data-owning institution> isn't comfortable with publishing the full dataset, but they did agree to provide the same data to groups working on verification studies as long as they're willing to sign a data privacy agreement. Send a query to <blahblahblah> ..."


> but they did agree to provide the same data to groups working on verification studies as long as they're willing to sign a data privacy agreement. Send a query to <blahblahblah> ..."

This would be administrative overhead, it will be shut down 9 times out of 10. I understand why this might seem easy but it really is not, you can have multiple hospital that each have their committee that agreed to give the researcher their data. They don't have a central authority that you can appeal to, much less someone that can green light your specific access.

As for the synthetic datasets that's basically just having tests and was advocated for elsewhere in this thread.


> This would be administrative overhead

I didn't say it was easy--I just said it struck a balance relative to trying to openly publish the whole dataset. Yes. Obviously comes with administrative overhead. So did dealing with the initial researcher. If the institution can manage the one, it can manage the other.

> As for the synthetic datasets that's basically just having tests

An appropriate synthetic dataset would inevitably be part of a great test suite, but it's also pretty simple to write narrow unit-tests that embed rather than stretch the same assumptions and biases that are also in the code (i.e., simple enough that even people who feed themselves with code do it).

An independent project/practice for synthesizing sample datasets from the real dataset lowers the bar and clarifies the best-practice for releasing a dataset that a verifier could actually use to spot simple bugs, edge-cases, and algorithm issues. Ideally, yes, this practice nudges the researchers to bother running their program over generated sample datasets as well, and to pay attention to whether the results make sense.


The reward system also prevents dead ends from being identified, publication of approaches that did not lead to the expected results or got nul results, publishing confirmations of prior papers, etc.

Basically, the reward system is designed to be easy to measure and administer, but is not actually useful in any way to the advancement of science.


It won't happen until researchers are forced to do it. Please sign petition at https://publiccode.eu and have a look at my other comment here.


Making this mandatory might have bad downstream effects like prohibiting publication of some research at all (GPT-X I am looking at you)


Closed source research isn't publication, it's advertisement.


So R&D is not a thing, but A&D is? That would be new to me


I agree, with the caveat: publically-funded research.


> Today, researchers can use Docker containers (see also ref. 7) and Conda virtual environments (see also ref. 8) to package computational environments for reuse.

Docker is also flawed. You can perfectly reproduce it today but what about in 10 years. I can barely go back to our previous release for some dockerfiles.


Similarly, conda envs can break in weeks due to package changes.

Even if you remove build versioning and all transitive dependencies from your env (making it less reproducible...) they will break pretty damn quick.


Guix is arguably better.


You often run into code of the "just get it to work" variety, which has the problem that when it was written, maintainability was bottom of the list of priorities. Often the author has a goal that isn't described in terms of software engineering terms: calculate my option model, work out the hedge amounts, etc.

And the people who write this kind of code tend not to think about version control, documentation, dependency management, deployment, and so forth. The result is you get these fragile pieces holding up some very complex logic, which takes a lot of effort to understand.

IMO there should be a sort of code literacy course that everyone who writes anything needs to do. In a way it's the modern equivalent of everyone who writes needing to understand not just grammar but style and other writing related hygiene.


The fundamental problem here, as you note, is that scientists are rarely also engineers, and don't really share our desiderata. The point is to develop and publish a result, and engineering analysis code for resiliency is of secondary concern at best when that code isn't likely to need to be used again once the paper is finished.

The "Software Carpentry" movement [1] has in the past decade tried to address this, as I recall. It's very much in the vein of the "basic literacy" course you suggest. I can't say how far they've gotten, and I'm no longer adjacent to academia, but based on what I do still see of academics' code, there's a long way still to go.

[1] https://software-carpentry.org/


Nah.

The fundamental problem is that scientific code is produced by entry-level developers:

1. Paid below-market wages

2. With no way to move up in the organization

3. With lots of non-software responsibilities

4. With an expectation of leaving the organization in six years

As long as the grunt work of science is done by overworked junior scientists whose careers get thrown to the wolves no matter what they do, you're not going to get maintainable code out of it.


I mean, senior researchers in stable roles don't really do any better. Just to pick the first example off the top of my head - one of the investigators I worked with, during my year as a staff member of an academic institution most of a decade ago, is also one of my oldest friends; he's been a researcher there for what must be well past ten years by now. Despite one of his undergrad degrees being actually in CS, I still find ample reason whenever I see it to give him a hard time about the maintainability of his code.

Like I said before, it's a field in which people really just don't give a damn about engineering. Which is fair! There's little reason why they should, as far as I've ever been able to see.


Even more fundamental is that there is no maintenance budget for important scientific libraries and tools. Somebody wrote them as part of their job, and the person who wrote it, is now working somewhere else.


And that scientists also are rarely supported by programmers, or if they are it's an unstable and unappreciated position.


Being in such a position, I can say that I am appreciated, but not in a manner that results in job stability and promotion. It's a massive problem in academia, and there's an attempt to get the position recognised and call it "Research Software Engineer", with comparable opportunities for promotion and job stability as a researcher. However, it's not going massively well. Academic job progression is still almost completely purely based on the ability to get first or last author papers in top journals. I have lots of papers where I am a middle author, because I wrote the software that did the analysis that was vital for the paper to even exist, but it largely doesn't count. And I'm lucky - many software engineers don't even get put in as a middle author on the paper they contributed to.


Having had that exact experience - yeah, that can be a big problem too.

Researchers and engineers can work really well together, because the strengths of each role complement the weaknesses of the other, and I think it would be very nice to see that actually happen some day.


It doesn't help with the issue of hard-to-reproduce work, but apparently working for a company making products aimed at scientists can be a place to see this happen (if the company is good about talking to customers).


Interesting, thanks! I'll keep that in mind for when I'm next looking for a new client.


Ive seen job listings for "scientific programmers" where what they're asking for is a scientist who happens to know a little programming.


Yeah - who then likely doesn't have that much software experience, and worse, if they want to stay a scientist such a role is often a bad career move, because they help others get ahead with their research instead of publishing their own work. Even if they build some really great domain-specific software tool in that role, it often doesn't count as much.

Or it's an informal thing done by some student as a side-gig. Which can be cool, but is not a stable long-term thing.

I hope there's exceptions.

EDIT: weirdest example I've seen was a lab looking for sysadmins with PhD preferred. I wonder if they had some funding source that only paid for "scientists" or what was going on there...


Simple answer for that. University pay scales tend to be fairly inflexible in terms of which grades you are eligible for without a PhD, if you are counted as academic staff. If you're non-academic staff (like the cleaner, the receptionist, and the central IT sysadmin) then you can be paid a fair wage based upon your experience, but if you are academic staff, then you have a hard ceiling without a PhD. An individual research group with a grant may only be able to hire academic staff, but they want a sysadmin, so in order to be able to pay them more than a pittance they would have to have a PhD.


Unit testing, readability version control, documentation, etc are all engineering practices for the purpose of making ongoing development organized (especially for teams).

Why would a researcher need to do this, when in most cases all that they use is the output, and in CS/math it's only a minimal prototype demonstrating operation of their principle?

All of the other stuff would certainly be nice, but they don't need to adopt our whole profession to write code


Even with all the best practices, things outside your control can cause issues. A lot of the code that software engineers write is subject to tiny bits of continual maintenance as small changes in the runtime environment take place. Imagine ten years of those changes deployed all at once. Even something employing all the best practices of ten years ago could be a challenge. You've got a subversion repository somewhere with the code which was compiled to run on Windows XP with Windows Visual Studio C++ 2008 Express but you've abandoned Windows for Linux. If you're lucky the code will compile with the appropriate flags to support C++98 in gcc, but who knows? And maybe there's a bunch of graphical stuff that isn't supported at all anymore or a computational library you used which was only distributed as a closed-source library for 32-bit Windows.


Scientific programming is a perfect storm of extremely smart people, with strong abilities to do it themselves, distain for the subject matter, and no direct experience with the price of failing to write portable code. In some circumstances, even parameterizing scripts so that they aren't re-edited with new values for every experiment is an uphill fight, never mind having promotion through environments.


Very related to this, see also Hinsens blog post: http://blog.khinsen.net/posts/2017/11/16/a-plea-for-stabilit...

I think that GNU Guix is extremely well-suited to improve this situation.

Also, one could think this is an academic problem, in the sense of am otherwise unimportant niche problem. It really isn't, it is just like in many other topics that academics get confronted first with this issue. I am sure that in many medium or large companies there are some Visual Basic or Excel code bases which are important but could turn out extremely hard to reproduce. This issue will only get more burning with today's fast-moving ecosystems where backward-compatibility is more a moral ideal than an enforced requirement.

It is well known that ransomware can wipe-out businesses if critical business data is lost. But more and more businesses and organizations also have critical, and non-standard, software.


Guix is one of several solutions that has been touted as a solution. Another one that is quite popular in HPC circles is Spack (https://spack.readthedocs.io/en/latest/).

At my institute, we actually tried out Spack for a little bit, but consistently felt like it was implemented more as a research project rather than something that was production-level and maintainable. In large part, this was due to the dependency resolver, which attempts to tackle some very interesting CS problems I gather (although this is a bit above me at the moment; these problems are discussed in detail at https://extremecomputingtraining.anl.gov//files/2018/08/ATPE...), but which produces radically different dependency graphs when invoked with the same command across different versions of Spack.

I've since come to regard Spack as the kind of package manager that science deserves, with conda being the more pragmatic / maintainable package manager that we get instead . Spack/Guix/nix are the best solution in theory, but they come with a host of other problems that made them less desirable.


> Spack/Guix/nix are the best solution in theory, but they come with a host of other problems that made them less desirable.

I would be quite interested to learn more what these problems are, in your experience. I've only tried Guix (on top of Debian and Arch) and while it is definitively more resource-hungry (especially in terms of disk space), I don't percive it as impractical.


As someone coming from the computing side of things, I found nix to be quite difficult to grok enough to write a package spec, and guix was pretty close, at least in part because of the whole "packages are just side-effects of a functional programming language" idea. At least nix also suffers from a lot of "magic"; if you're trying to package, say, an autotools package then the work's done for you - and that's great, right up until you try to package something that doesn't fit into the existing patterns and you're in for a world of hurt.

Basically, the learning curve is nearly vertical.


> guix was pretty close, at least in part because of the whole "packages are just side-effects of a functional programming language" idea

This must be a misunderstanding. One of the big visible differences of Guix compared to Nix is that packages are first-class values.


You're right; on further reading I can see guix making packages the actual output of functions. I do maintain that the use of a whole functional language to build packages raises the barrier to entry, but my precise criticism was incorrect.


I can only speak to Spack in particular, but the main issue that I found with it was balancing researcher expectations for package installation speed with compile times. For most packages, compile times aren't a huge problem, but compilers themselves can take days to build, and it isn't unheard of for researchers to want a recent version of gcc for some of their environments.

In theory this isn't an issue with Spack (assuming that you have a largely homogeneous set of hardware or don't use CPU family-specific instruction sets), since you can set up cached, pre-compiled binaries on a mirror server (similar to a yum repo) and have people install from there.

Spack, however, has a lot of power/complexity. A lot of untamed power that means that bugs can sometimes be more likely than in other, more mature (or mature-ish) package managers. Namely, Spack allows you to not only specify the version number of a package, but also the compiler that you use to make that package, specific versions of dependencies that you want to use, which implementation of an API you want to use (i.e., MPICH or OpenMPI for MPI), and compiler flags for that package. When you run an install command / specify what you want to install, Spack then performs dependency resolution and "concretizes" a DAG that fulfills all of the constraints.

The issue that I ran into was that if you don't specify everything, Spack makes decisions for you about which version of a dependency, which compiler, etc to use (i.e., it fills in free variables in a space with a lot of dimensions). This would be great and dandy normally, although the version of Spack that I used occasionally constructed totally different graphs for the same "spack install gcc" command (if I recall correctly; take all of this with a grain of salt b/c I might be misremembering). This meant that it wouldn't use cached versions of gcc that had already been built, and ended up rebuilding minor variants of gcc with options I didn't care about.

At National Labs and larger outfits, the trade-offs between this kind of complexity/power and the accompanying bugginess (Spack has yet to hit 1.0) seem to favor complexity/power while accepting these sorts of bugs, but I don't work at a larger outfit and my group didn't need that level of power/control over dependencies and rather needed something that "just worked" and would allow researchers to be able to install packages independently of us (IT people). conda (mostly) fit the bill for this. I still think that Spack is the future and it has a special place in my heart, but it will have to be more stable for me to want to use it in production.


In addition - does your ten year old protocol still work? Do your 10 year old results replicate? This isn't isolates to just programming - making robust and reproducible tools, code, equiptment, protocols, and results is undervalued across all areas of research, leading to situations where protocols published weren't robust so a change in reagent supplier leads to failure, or to protocols so dependent on weird local or unreported environmental conditions or random extra steps that attempting to replicate them leaves you nowhere. Robustness needs to be improved in general.


I favor open code, but like everything, there are issues. For example, the EPA years ago required that research can only inform policy when data is open; open data, however, takes a lot of effort to document and provide. Companies, however, with vested interest in EPA policy can easily produce open (and often very biased) data.

Requirements for open code can lead to similar issues—what happens when a government agency rejects the outcome of a supercomputer simulation because the code wasn't documented well enough? What happens when those with vested interests are the ones best able to produce scientific code?

Scientists already wear many hats. Any shift in policy and norms needs to consider that they have limited time, a fact that can have far-reaching consequences.


The point of being able to run ten-year-old code is the ability to replay an analysis (exact replication). This allows an analysis to be verified after the fact, which increases trust and helps figure out what happened when contradictions appear between experiments. However, if the original work involved physical experimentation or any non-automated steps (as is the case for most science) the ability to run the original code provides only partial replication. Overall the ability to re-run old code is a fairly low priority.

From the perspective of someone who primarily uses computers as a tool to facilitate research, the priority list is closer to:

1. Retain documentation of what was meant to happen. Objectives, experimental design, experimental & analysis protocols, relevant background, etc.

2. Retain documentation of what actually happened, usually in terms of noting deviations from the protocol. This is the purpose of a lab notebook. Pen & paper excels here.

3. Retain raw data files.

4. Retain files produced in the course of analysis.

5. Retain custom source code.

6. Version control all the above.

7. Make everything run in the correct order with one command (i.e, full automation).

Only once all the above is achieved would it be worth ensuring that the software used in the analysis can be re-run in 10 years. Solving the "packaging problem" in a typical scientific context (multiple languages, multiple OSes, commercial software, mostly short scripts) is complex. When the outcome of an analysis is suspect, the easiest and most robust approach is to check the analysis by redoing it from scratch. This takes less time than trying to ensure every analysis will run on demand even as the computing ecosystem changes out from under it.

Most of the time spent writing analysis code is deciding what the code should do, not actually writing the code. There is generally very little code because few people were involved, and they probably weren't programmers. So redoing the work from scratch is generally pretty easy, especially for anyone with the skill to routinely produce fully reproducible computational environments.


An excellent article full of good suggestions. I appreciated that it's less certain of the Best Practices TM than many comments on this subject. I am curious how the goals/techniques for reproducibility change with the percentage of software/computational work that a scientific project contains. It feels like as the percentage of a paper's ultimate conclusions that are computationally derived increases, the importance of strict "the tests pass and the numerical results are identical" reproducibility also increases. Most of my projects are mixed wet-lab/dry-lab - a fair amount of custom code is required, but it's usually less than 50% of the work. When I'm relying on other papers that have a similar mix of things, I'm often not interested if the continuous integration tests of their code pass. I am more interested in understanding well the specific steps they take computationally and in a sensitivity analysis of their computational portion (if you slightly alter your binning threshold do you still get that fantastic clustering?). I believe this is because in my field (microbiology), computational tools can guide, but physical reality and demonstrated biology are the only robust evidence of a phenomenon/mechanism/etc. For most research I do not demand tests of all the analytical pieces they are relying on (was their incubator actually set to 37C? was the pH of the media +- 0.2? etc) - I trust they've done good science. Why would I demand their code meet a higher standard?


Any scientist with good foresight would've implemented their code in 6502 for the NES. The emulators are nearly flawless and will probably be around until the end of time.


I once had a thought, that if I wanted to write something that would last forever and run anywhere, I should write it to target DOS, and make sure to test it on FreeDOS in a VM and on DOSBox. That way it would run on a stable ABI with loads of emulators, and via DOSBox it will happily run on all modern desktop OSs (and some non-desktops; IIRC there's at least an Android port).


Glad to see many mentions of Nix in this thread!

I wonder if Nix and Guix should standardize the derivation format both share to kick that off as the agreed-upon "thin waste" other projects and the the academy can standardize around.


The derivation format is little more than a compilation artifact (a low-level representation of a build), and I think standardizing on it would not be as useful as it may seem.


Exactly, it's mostly boring, just like IP packets in isolation are boring. Great thing to standardize.

Also, we're working on making them a bit less boring :). I've been working on adding hashing schemes compatible with IPFS's IPLD. Also, we can make derivations that produce derivations, making a sort of recursive Nix that encourages more up-front planning than crude "nix-build inside derivations.

I hope Guix would want some of this stuff too.


As a scientist I've written massive amounts of shitty code that turned out to be reproducible by lucky accident. Part of the problem are the tools: depending on the field, scientists either use Matlab, C++, Fortran or some other framework that needs to die. They base their code on other ancient code that runs for unknown reasons, and use packages written by other scientists with the same problems.

As someone who's transitioning into industry, I can tell you that scientists will never adopt software engineering principles to any significant extent. It takes too much time to do things like write tests and thorough documentation, learn Git, etc., and software engineering just isn't interesting to most of them.

So the only alternative I see is changing the tools to stuff that's still easy to hack around with but where it's harder to mess up (or it's more obvious when you do so). That doesn't leave a ton of options (that I can see). Some I can think of are:

- Make your code look more like math and less like mathlib.linalg.dot(x1, x2).reshape(a, b).mean().euclidean_distance((x3, x4)) + (other long expression) or whatever: Use a language like Julia

- Your language/environment gets angry when you write massive hairballs, loads of nested for-loops and variables that keep getting changed: Use a language like Rust, and/or write more modular code with a functional-leaning language like Rust or Julia.

- You're forced to make your code semi-understandable to you and others more than an hour after writing it: Forcing people to write documentation isn't gonna work (a lot). Forcing sensible variable names is slightly more realistic. More likely, you need some combination of the above two things that just make your code more legible.

How do you make that happen? No idea.


Julia could be a big win, not just b/c of the notation, but the dependency control is a first-class language feature. Also, the Lispy-ness of Julia allows to do things like Latexify expressions.

To quote someone upthread: https://news.ycombinator.com/item?id=24260590

> Everyone just put together a few text files and Python or MATLAB scripts that output some numbers that went into Excel or gnuplot scripts that got copy-pasted into LaTeX documents with suffixes like "v2_final_modified.tex", shared over Dropbox.

It would be amazing to have an environment that could handle the entire workflow. Not everybody has time to make an executable thesis like this person did:

https://github.com/4kbt/PlateWash


Code written in Oak still works in Java 14. You can still write `public abstract interface BlaBla{}` and it still works. If it doesn't work (due to reflection safety changes in Java9), it sill surely compile with newer compiler.

Another thing, are tools used to compile still available? I tried to compile my BCS Android+native OpenCV project and failed quickly. Gradle removed some plugin for native code integration, another plugin was no longer maintained, it had internal check for gradle version and it said "I'm designed to work with gradle >= 1.x < 3.x" and just refused to run under 6.x ... I would have to fork that plugin, make it work with newer Gradle or find replacement. I was obviously too lazy and stopped working on that project before I even started.

I'm sure if (would) put more effort into making the build process reproducible, it would work effortlessly, but I didn't care at the point. I wrote it using beta release of OpenCV that's also no longer maintained, because there are better, faster official alternatives available.


Or use the old version of Gradle? It sounds like creating a vm/container/whatever with the old versions of everything is the fastest path, although I understand not wanting to do it after some point.


>Or use the old version of Gradle?

Intellij / Android Studio are nagging me to update gradle


I mean, Python 2->3 alone is gonna kill this challenge for most people.


You can always run old Python2 stuff in a Docker container, so long as the dependencies haven't disappeared.


As long as it does not use some CUDA hardware which is using tensorflow and Numba which is using a version of llvmlite which does not support Python2 any more.....

This isn't a theoretical example.


Then you make a vm or whatnot and install all the old versions of everything. I haven't seen an open source project in a while that doesn't have old versions for download easily. Still, annoying if you can't stand that kind of stuff.

(people seem to be in two camps: either they hate it or have almost no problem with it)


To clarify, the issue is that the old version of software won't work with the new libraries, and the old libraries won't work with the current GPU models, so you can't run the old code without modification unless you have old hardware as well, and you can't virtualize the GPUs.


Well, where do you download the hardware ? ;-)


Most of the "requirements.txt" I come across in the real world do not actually lock down all deps to Python 2.7 compatible versions. I've been able to get most of them running again, but it's a long porcess looking through changelogs to find the last 2.7-compatible version of each dependency.


Yes, because the "requirements.txt" is a dependency requirements file and not a lockfile. It took the Node.js ecosystem an embarrassingly long time to arrive at that insight, and I feel like the Python ecosystem/community still isn't there yet (though finally it's easily usable with Poetry).


This is why backwards compatibility is important. Many people have a problem when this is raised as a primary concern and goal of the C and C++ languages, but it is a must have feature.


A well-written Nix package should be buildable at any point in the future, producing near-identical results. This is why I sometimes publish Nix packages for obscure & hard to build pieces of software that I'm not likely to maintain - because it's like rescuing a snapshot of them from oblivion.


Mine does.

I swear 40% of the idiocy of science code is because people fundamentally don’t understand how file paths work. Stop hardcoding paths to data and the world gets better by an order.


The Fossil documentation has this gem:

> "The global state of a fossil repository is kept simple so that it can endure in useful form for decades or centuries. A fossil repository is intended to be readable, searchable, and extensible by people not yet born."

I always liked that they planned for the long-term. Keeping that in mind helps you build systems that will work in 10 years, or in 100, if it happens to last that long. When you are building a foundation, like a language or database, it is nice to plan for long term support since so much depends on it. C has stayed mostly recognizable over the years, much more so than C++ or other high level languages. When your design is simple, you can have a "feature complete" end.


We are building Nextjournal[0] exactly for this purpose.

It's a platform for interactive notebooks built on immutable and persistent storage (Datomic) and Docker:

- Changes to the document are automatically versioned to an immutable database (Datomic). Previous versions can be accessed and restored any time.

- Uploaded data or generated result files are automatically versioned in append-only content-addressed storage and can’t be accidentally overwritten.

- Changes to the file system state can be committed as Docker images. Reproducibility is ensured by referencing these images only by their immutable hashes.

[0]https://nextjournal.com


This was in the Guix-science mail list today

> Hello!

In an article entitled “Challenge to scientists: does your ten-year-old code still run?”, Nature reports on the Ten Years Reproducibility Challenge organized by ReScience C, led by Nicolas P. Rougier and Konrad Hinsen:

  https://www.nature.com/articles/d41586-020-02462-7
It briefly mentions Guix as well as the many obstacles that people encountered and solutions they found, including using Software Heritage and floppy disks. :-)

You can read the papers (and reviews!) at:

  https://rescience.github.io/read/#issue-1-ten-years-reproducibility-challenge
Ludo’.


With deep learning, using conda environments and environment modules to switch between various tensorflow, pytorch, cuda, and cudnn versions for reproducing/building off of other's results is essential.

It is sort of hilarious how many breaking changes there are even between minor versions of these.

Thank gosh I was able to install a 2 year old version of pytorch (0.4.1) today to reproduce some code from within the last year.


It's not academia but Kaggle that's really been on the forefront of building portable and reproducible computational pipelines.

The real key is incentives and there are two that standout to me:

- Incentive to get others to "star" and fork your code makes the coder compete to not only have an accurate result, but also prioritize producing code/notebooks that are digestible and instructive. That includes liberal commenting/markup, idiomatic syntax and patterns, diagnostic figures, and the use of modern and standard libraries.

- There is an incentive to move with the community on best practices for the libraries while still allowing experimental libraries. Traditionally, there is the incentive of inertia: e.g. "I always do my modelling in Lisp, and I won't change because then I'd be less productive". But with kaggle, to learn from the insights and advances of others, you need to have an ability to work with the developing common toolset.

In academia, if these incentives were given weight on par with publication and citation then we'd see the tools and practices fall into place.


Let's not criticize people who release their code. Let's criticize the people who don't release their code instead. We don't need more barriers to releasing code.

I'd much rather fix someone's broken build than reimplement a whole research paper from scratch without the relevant details that seem to always be accidentally omitted from the paper.


Not to disagree with any points in the article, but i would point out that the sciences also have cases of very old code being maintained and used in production successfully. For example we still use a kinematics code written in fortran over half a century ago. In practice parts of it get reimplemented in newer projects, but the original still sees use.


For my first scientific article, in 2007, I created a Subversion repo with a Makefile. Running `make` would recreate the whole paper: downloading data, running analyses, creating pictures (color or BW, depending on an environment flag) and generating the PDF.

I'm going to try to find the repo and see if it still works.


Wow, nice. I will be waiting :D


This is a challenge with many types of code.

Earlier this year it took me a weekend to get a 7 year old Rails project running again. It's a simple project but the packages it used had old system dependencies that were no longer available.

I ended up having to upgrade a lot of things, including code, just to get it running again.


I ran into this too. Rails has changed a lot in 7 years, even if you don't see it. My friend wanted to learn and somehow found the original demo/getting started page and was frustrated.


Would be a super-useful to have a sciencecode.com service which is a long-term CI system for scientific code and its required artifacts. Journals could include references to sciencecode.com/xyz and sciencecode.com/abc could be derived from sciencecode.com/xyz. Given Github Actions and Forks, the only thing holding this back is scientists doing it (and, possibly, the HN community helping).

And I get that it's not fun to have your code publicly critiqued but it's also not fun to live lives based on (medical, epidemiological) unpublished, unaudited, unverified code...

EDIT: hell, just post a "HELP HN: science code @ github.com/someone/project" and I'd be surprised if you weren't overwhelmed with offers of help.


I wrote a tool to visualise algorithms for binary decision diagrams [1], also in an academic context, where the problem was basically the same: Does the code still run in ten years? In particular, the assumption is that I will not be around then, and no one will have any amount of time to spend on maintenance.

In the end, I chose to write it in C++ with minimal dependencies (only X11, OpenGL and stb_truetype.h), with custom GUI, and packed all resources into a single executable.

A lot of effort, but if it causes the application to survive 5x as long then it is probably worth spending twice the time.

[1] https://github.com/suyjuris/obst


I'm continuously surprised that Code Review isn't a part of the review process for journal acceptance. The majority of academic code for a given paper isn't particularly large - and the benefits are significant.


To throw in one more data point: several years ago I wrote a simulator for a paper, it involves tracking large numbers of particles bouncing through a 2D space with different geometries. There's a front end, in PLT Scheme, and a backend in C. I'm pretty sure that today

- the C code would still compile and run

- the Scheme code would not without some effort, because PLT Scheme 372 is no longer supported, Racket made a number of breaking changes, so one would either have to rewrite the Scheme code for Racket or rewrite it for another Scheme implementation.


The easiest way to preserve code for posterity is the wrap up the runtime environment in a VM. I can boot up a VM from 15 years ago (when I was in grad school) and it will run.

When you're writing code for science, preserving code for posterity is rarely a priority. Your priority is to iterate quickly because the goal is scientific results, not code.

(this is in fact, correct prioritization. Under most circumstances, though not all, most grad students who try to write pristine code find themselves progressing more slowly than those who don't.)


IMO, this is why ISO standard programming languages are so important and will be around forever. One can always compile with --std=c++11 (or whatever) and be certain it will work.


As the systems architect and infra programmer for a scientific startup I'll simply chime in on the production != scientific conversation. When you don't hold your modeling code to the minimal production standard where it counts (documentation, comments, debug) it _will_ cause your evolving team hardship. When that same code goes into production for a startup (as it could/should) you will be causing everyone long nights and 80 hour weeks.


GitHub offers a free tier for GitHub actions with 2,000 Actions minutes/month [1]. This could be useful:

1. write some unit tests which don't use too much compute resources (so you can stick to the free tier)

2. package your code into a docker where the tests can be run

3. wire up the docker with tests to GitHub Actions

This way now you have continuous testing and can make sure your codes keep running.

References:

[1] https://github.com/pricing


> package your code into a docker

docker is not a general solution for this.

What is needed is a way to re-generate everything from source and from scratch.


Even if it broke, who would go back and fix it?

I do not see that happening, especially with complex library bugs.


I still use Windows binaries daily which I wrote and last modified over 20 years ago. I don't expect that to change in the next ten years either.


Code must be re-used or it rots.

Fortran was the first language I learned, and naming things was a problem.

Guix, mentioned in the fine article, or like it is the answer as long as we keep reading and writing our code. This is why all scientific code should be public, we need continuous review.

Papers without code should be refused.

I'm in the middle of a storm in the west of Ireland. It's windy and wet and hn is still a thing. Happy out


Not a scientist but 13+ year old Perl code I wrote is still running (based on my catchup chats with ex-colleagues) to generate MIS reports.


I think code is remarkably persistent in the scheme of things. Try reproducing a wet lab experimental technique from 5 years ago.


I am not a scientist, but actually I think most of the code I wrote 10 years ago still is in production at different companies.


Companies with code in production have a short-term real incentive to keep that code running.

This is different from code from research projects, which is on many cases just run a few times, and in other times, written by somebody who has, if he / she wants to make any kind of career in the field, to change to a new workplace and will not have any time to maintain that old code.

There are a few long-running mayor science projects, say, in particle physics or astronomy, which are forced to work differently. And in these environments, there are actually people who have knowledge on both science and software engineering.


If it's still in production, it's most likely still getting some level of maintenance attention as well. When I was an undergrad I did some coding for some of the professors at the college. A lot of scientific programming is stuff that gets written and run once and never run again. Try dusting off some 10-year-old C++ and try compiling it with the current version of your compiler.


I just shut down a VB6 app that had been running since 1998 at the company I work for last month. The leadership team finally decided they didn't want to sell that particular feature anymore. We still have a handful of apps from around that time period that do various small tasks. One day it will be a priority to get rid of them.


Snark: No, I wrote it in Python 2.7, because Python 3 was still wearing very smelly diapers in 2010. Ask me in another 10 years and I'll tell you the same, because Python doesn't care about backwards compatibility. Hopefully I'll have moved on to a more stable language by then. Maybe fortran?


Forget 10 year old code. Try to get your 2 year old javascript + webpack + react set up running...


I'm not sure how interesting the question is, given how few software engineers outside academic sciences have 10-year-old code that still runs (unless they've maintained a dedicated hardware platform for it without regular software updates).


"Visual Basic," Maggi writes in his report, "is a dead language and long since has been replaced..."

In fact I still have the VB6 IDE installed on my primary workstation and use it for quick and dirty projects from time to time.


  CMake Error at /usr/share/cmake-3.18/Modules/FindQt4.cmake:1314 (message):
    Found unsuitable Qt version "5.15.0" from /usr/bin/qmake, this code
    requires Qt 4.x
Well fuck.


For those interested, the results of the challenge are published here: https://rescience.github.io/read/ (volume 6, issue 1).


"Ten Simple Rules for Reproducible Computational Research" http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fj... :

> Rule 1: For Every Result, Keep Track of How It Was Produced

> Rule 2: Avoid Manual Data Manipulation Steps

> Rule 3: Archive the Exact Versions of All External Programs Used

> Rule 4: Version Control All Custom Scripts

> Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

> Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds

> Rule 7: Always Store Raw Data behind Plots

> Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

> Rule 9: Connect Textual Statements to Underlying Results

> Rule 10: Provide Public Access to Scripts, Runs, and Results

... You can get a free DOI for and archive a tag of a Git repo with FigShare or Zenodo.

... re: [Conda and] Docker container images https://news.ycombinator.com/item?id=24226604 :

> - repo2docker (and thus BinderHub) can build an up-to-date container from requirements.txt, environment.yml, install.R, postBuild and any of the other dependency specification formats supported by REES: Reproducible Execution Environment Standard; which may be helpful as Docker Hub images will soon be deleted if they're not retrieved at least once every 6 months (possibly with a GitHub Actions cron task)

BinderHub builds a container with the specified versions of software and installs a current version of Jupyter Notebook with repo2docker, and then launches an instance of that container in a cloud.

“Ten Simple Rules for Creating a Good Data Management Plan” http://journals.plos.org/ploscompbiol/article?id=10.1371/jou... :

> Rule 6: Present a Sound Data Storage and Preservation Strategy

> Rule 8: Describe How the Data Will Be Disseminated

... DVC: https://github.com/iterative/dvc

> Data Version Control or DVC is an open-source tool for data science and machine learning projects. Key features:

> - Simple command line Git-like experience. Does not require installing and maintaining any databases. Does not depend on any proprietary online services. Management and versioning of datasets and machine learning models. Data is saved in S3, Google cloud, Azure, Alibaba cloud, SSH server, HDFS, or even local HDD RAID.

> - Makes projects reproducible and shareable; helping to answer questions about how a model was built.

There are a number of great solutions for storing and sharing datasets.

... "#LinkedReproducibility"


Open textual formats for data and open source application and system software (more precisely, FLOSS), are just as important.

Imagine that x86 - and with it, the PC platform - gets replaced by ARM within a decade. For binary software, this would be a kind of geological extinction event.


The likelihood of there being a [security] bug discovered in a given software project over any significant period of time is near 100%.

It's definitely a good idea to archive source and binaries and later confirm that the output hasn't changed with and without upgrading the kernel, build userspace, execution userspace, and PUT/SUT Package/Software Under Test.

- Specify which versions of which constituent software libraries are utilized. (And hope that a package repository continues to serve those versions of those packages indefinitely). Examples: Software dependency specification formats like requirements.txt, environment.yml, install.R

- Mirror and archive all dependencies and sign the collection. Examples: {z3c.pypimirror, eggbasket, bandersnatch, devpi as a transparent proxy cache}, apt-cacher-ng, pulp, squid as a transparent proxy cache

- Produce a signed archive which includes all requisite software. (And host that download on a server such that data integrity can be verified with cryptographic checksums and/or signatures.) Examples: Docker image, statically-linked binaries, GPG-signed tarball of a virtualenv (which can be made into a proper package with e.g. fpm), ZIP + GPG signature of a directory which includes all dependencies

- Archive (1) the data, (2) the source code of all libraries, and (3) the compiled binary packages, and (4) the compiler and build userspace, and (5) the execution userspace, and (6) the kernel. Examples: Docker can solve for 1-5, but not 6. A VM (virtual machine) can solve for 1-5. OVF (Open Virtualization Format) is an open spec for virtual machine images, which can be built with a tool like Vagrant or Packer (optionally in conjunction with a configuration management tool like Puppet, Salt, Ansible).

When the application requires (7) a multi-node distributed system configuration, something like docker-compose/vagrant/terraform and/or a configuration management tool are pretty much necessary to ensure that it will be possible to reproducibly confirm the experiment output at a different point in spacetime.


I'm a grad student in biophysics - even if I wrote perfect code, it would almost certainly go obsolete in 10 years because the hardware that it interfaces with would go obsolete.


I visited my first employer recently (a local government) and found that the first MySQL/PHP database I created, an internal app, had been in continuous use for nearly 18 years.


I hear Common Lisp is quite good for this.

> but for fun, he spent €100 (US$110) on a Raspberry Pi, a single-board hobbyist computer that runs Linux and has Mathematica 12 pre-installed.

Did he get scammed?


Strangely, they were running (some of ) the code on old hardware. That's hardly a useful case, and much easier than 'resurrecting' the code for modern reuse.


Something with non-standard asm?


That sounds like a big issue. And certainly part of getting 10-year-old code resurrected.


My 20-30+ year old code still runs: TI-85 BASIC, Turbo Pascal, Turing, Haskell, Smalltalk, Fortran, roughly in that order. Bit of a skeleton cupboard.


I read it as does your ten-year-old still run code, and was thinking if this was a challenge for scientists to have their kids do better things than coding.


Bah, even the STEPS project (from Alan Kay) was a total failure from reproducibility POV and that was a software research project :-(


How do you do reproducible builds in R? It seems like a huge PITA to specify versions of R and especially the packages used...


Does code from ten years ago run ever? Try running something on that runs on Python 2 on the current python interpreter today.


Python is an extremely bad example.

Try twenty years old Common Lisp code. Or Fortran.


What are we doing then? We can choose bad scientific code but use "good examples" for other types of code? Seems convenient.


Pfff. Does my 3 month old code still run? Uh, nope. And I don't remember what it was supposed to do!


Sure it does. On a 10-year old machine.


It always runs if you used the same computer with the same environment you last time ran it. So yes.


Axiom is a computer algebra system written in the 1970s-80s. It still runs (and is open source).


Yes, it is all pasted into my thesis, comments and all, like all code should.


Almost certainly not, because it would have been written in Python 2.


If you wrap you code into Docker, I would say...probably.


18-year old code still runs. And generates revenue.


Depends on context and competition


Yes, it still run, of course.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: