This article brings up scientific code from 10 years ago, but how about code fro...

djaque · on Aug 24, 2020

I am all for open science, but you understand that the links in your post are the exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.

I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.

By the way, yes I tested my ten year old code and it does still work. What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts. In fact the time spent making it that way is time that a scientist spends doing software engineering instead of science, which isn't very efficient.

solatic · on Aug 24, 2020

Let's be clear - scientific-grade code is a substandard of production-grade code. But it is still a real standard.

Does scientific-grade code need to handle a large number of users running it at the same time? Probably not a genuine concern, since those users will run their own copies of the code on their own hardware, and it's not necessary or relevant for users to see the same networked results from the same instance of the program running on a central machine.

Does scientific-grade code need to publish telemetry? Eh, usually no. Set up alerting so that on-call engineers can be paged when (not if) it falls over? Nope.

Does scientific-grade code need to handle the authorization and authentication of users? Nope.

Does scientific-grade code need to be reproducible? Yes. Fundamentally yes. The reproducibility of results is core to the scientific method. Yes, that includes Monte Carlo code, when there is no such thing as truly random number generation on contemporary computers, only pseudorandom number generation, and what matters for cryptographic purposes is that the seed numbers for the pseudorandom generation are sufficiently hidden / unknown. For scientific purposes, the seed numbers should be published on purpose, so that a) the exact results you found, sufficiently random as they are for the purpose of your experiment, can still be independently verified by a peer reviewer, b) a peer reviewer can intentionally decide to pick a different seed value, which will lead to different results but should still lead to the same conclusion if your decision to reject / refuse to reject the null hypothesis was correct.

dllthomas · on Aug 24, 2020

> Does scientific-grade code need to be reproducible? Yes. Fundamentally yes.

I agree that this is a good property for scientific code to have, but I think we need to be careful not to treat re-running of existing code the same way we treat genuinely independent replication.

Traditionally, people freshly constructed any necessary apparatus, and people walked through the steps of the procedures. This is an interaction between experiment and human brain meats that's missing when code is simply reused (whether we consider it apparatus or procedure).

Once we have multiple implementations, if there is a meaningful difference between them, at that point replayability is of tremendous value in identifying why they differ.

But it is not reproducibility, as we want that term to be used in science.

ajford · on Aug 24, 2020

This! I struggled with this topic in university. I was studying pulsar astronomy, and there was only one or two common tools used at the lower levels of data processing, and had been the same tools used for a couple of decades.

The software was "reproducible" in that the same starting conditions produced the same output, but that didn't mean the _science_ was reproducible, as every study used the same software.

I repeatedly brought it up, but I wasn't advanced enough in my studies to be able to do anything about it. By the time I felt comfortable with that, I was on my way out of the field and into an non-academic career.

I have kept up with the field to a certain extent, and there is now a project in progress to create a fully independent replacement for that original code that should help shed some light (in progress for a few years now, and still going strong).

allenofthehills · on Aug 24, 2020

> The software was "reproducible" in that the same starting conditions produced the same output, but that didn't mean the _science_ was reproducible, as every study used the same software.

This is the difference between reproducibility and replicability [1]. Reproducibility is the ability to run the same software on the same input data to get the same output; replication would be analyzing the same input data (or new, replicated data following the original collection protocol) with new software and getting the same result.

I've experienced the same lack of interest with established researchers in my field, but I can at least ensure that all my studies are both reproducible and replicable by sharing my code and data.

[1] Plesser HE. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Front Neuroinform. 2018;11:76.

improbable22 · on Aug 24, 2020

This is almost an argument for not publishing code. If you publish all the equations, then everybody has to write their own implementation from that.

Something like this is the norm in some more mathematical fields, where only the polished final version is published, as if done by pure thought. To build that, first you have to reproduce it, invariably by building your own code -- perhaps equally awful, but independent.

dllthomas · on Aug 25, 2020

Maybe gate release of the code by some number of attempted replications.

7thaccount · on Aug 24, 2020

Should this be surprising? I'm not saying it is correct, but it is similar to the response many managers give concerning a badly needed rewrite of business software. Doing so is very risky and the benefits aren't always easy to quantify. Also, nobody wants to pay you to do that. Research is highly competitive, so no researcher is going to want to spend valuable time making a new tool that already exists even if needed if no other researchers are doing that.

BadInformatics · on Aug 24, 2020

Conversely though, it is often impossible to obtain the original code to replay and identify differences once that step is reached without some sort of strong incentive or mandate for researchers to publish it. When the only copy is lost in the now-inaccessible home folder of some former grad student's old lab machine, there is a strong disincentive to try replicating at all because one has little to consult on whether/how close the replicated methods are to the original ones.

dllthomas · on Aug 24, 2020

And so we find ourselves in the same situation as the rest of the scientific process, throughout history. When I try to replicate your published paper and I fail, it's completely unclear whether it's "your fault" or "my fault" or pure happenstance, and there's a lot of picking apart that needs to be done with usually no access to the original experimental apparatus and sometimes no access to the original experimenters.

The fact that we can have that option is an amazing opportunity that a confluence of attributes of software (specificity, replayability, easy of copying) afford us. Where we are not exploiting this like we could be, it is a failure of our institutions! But it is different-in-kind from traditional reproducibility.

BadInformatics · on Aug 24, 2020

Of course, but the flip side is that same confluence of attributes has also exacerbated issues of reproducibility. Just as science and the methods/mediums by which we conduct/disseminate it have changed, so too should the standard of what is considered acceptable to reproduce. This is especially relevant given how much broader the societal and policy implications have become.

More concretely, it is 100% fair (and I might argue necessary) to demand more of our institutions and work to improve their failures. I'm sure many researchers have encountered publications of the form "we applied <proprietary model (TM)> (not explained) to <proprietary data> (partially explained) after <two sentence description of preprocessing> and obtained SOTA results!" in a reputable venue. Sure, this might be even less reproducible 200 years ago than now, but the authors would also be less likely to be competing with you for limited funding! Debating about the traditional definition of reproducibility has its place, but we should also be doing as much as possible to give reviewers and replicators a leg up. This is often flies in the face of many incentives the research community faces, but shifting blame to institutions by default (not saying you're doing this, but I've seen many who do) is taking the easy road out and does little to help the imbalanced ratio of discussion:progress.

kkylin · on Aug 24, 2020

This. I absolutely agree there needs to be more transparency, and scientific code should be as open as possible. But this should not replace replication.

hobofan · on Aug 24, 2020

But "rerunning reproducability" is mostly a neccessary requirement for independent reproducability. If you can't even run the original calculations against the original data again how can you be sure that you are not comparing apples to oranges?

jabirali · on Aug 24, 2020

In some simulations, each rerun produces different results as you’re simulating random events (like lightning formation) or using a non-deterministic algorithm (like Monte Carlo sampling). Just “saving the random seed” might not be sufficient to make it deterministic either, as if you do parallelized or concurrent actions in your code (common in scientific code) the same pseudorandom numbers may be used in different orders each time you run it.

But repeating the simulation a large number of times, with different random seeds, should produce statistically similar output if the code is rigorous. So even if each simulation is not reproducible, as long as the statistical distribution of outputs is reproducible, that should be sufficient.

dllthomas · on Aug 24, 2020

Very interesting. I was thinking of software as most similar to apparatus, and secondarily to procedure. You raise a third possible comparison: calculations, which IIUC would be expected to be included in the paper.

There are some kinds of code (a script that controls a sensor or an actuator) where I think that doesn't match up well at all. There are plenty of kinds of code where they are, in fact, simply crunching numbers produced earlier. For the latter, I'm honestly not sure the best way to treat it, except to say that we should be sure that enough information is included in some form that replication should be possible, and that we keep in mind the idea that replication should involve human interaction.

kkylin · on Aug 24, 2020

This is not clear at all. It depends on the "result" in question. If I wrote a paper describing a super numerical algorithm for inverting matrices, and no one is able to replicate the superior performance of my algorithm despite following the recipe in my paper, then whether they can run my code or not doesn't seem to be of the highest priority.

Edit: more careful phrasing.

a1369209993 · on Aug 25, 2020

> whether they can run my code or not doesn't seem to be of the highest priority.

On the contrary; in that case there are four possibilities:

a: your algorithm doesn't work at all, and your observations are a artifact of convenient inputs or inept measurements.

b: your algorithm works, but the description in the paper is wrong or incomplete

c: your algorithm works as described, but the replicater implemented it incorrectly

d: other

Having the original implementation code is necessary to distinguish between cases a and b versus case c, and if the former, the code for the test harness is likely to help distinguish a versus b. (Case d is of course still a problem, but that doesn't mean it's reasonable to just give up.)

kkylin · on Aug 25, 2020

I agree with the case analysis, but disagree with the implication that the code needs to be runnable (which is seems to be the point of the discussion at hand). In many cases having source code, even if it no longer runs, should be sufficient.

kkylin · on Aug 25, 2020

I complete agree with your case analysis, but disagree with the conclusion that the code needs to be runnable for it to be useful -- I thought this was the point of the discussion at hand? In most situations, having source code, even if it no longer runs, would be sufficient to conduct the analysis you describe.

I'm all for more transparency, and this includes making codes and data public as much as is reasonable. But the real test is if someone can independently replicate the result, and how to incentivize replication studies (in both computational and experimental science) is also important, and in my view should not be divorced from discussions of reproducibility.

Edit: rewrote to clarify my position.

sriku · on Aug 25, 2020

I would also place as a requirement that the code be comprehensible to someone familiar with the domain - .i.e. a "peer".

a_zaydak · on Aug 24, 2020

I do agree with you on publishing seeds for Monte Carlo simulations however the argument against it is also very strong. Usually when you run a monte carlo simulation you are quoting the results in terms of statistics. I think it would be sufficient to say that you can 'reproduce' the results as long as your statistics (over many simulations with different seeds) is consistent with the published results. If you run a single simulation with are particular seed you should get the same results however this might be cherry picking a particular simulation result. This is good for code testing but probably not for scientific results. I think by running the code with new seeds is a better way to test the science.

dekhn · on Aug 24, 2020

As an ex-scientist who used to run lots of simulations, I really fail to see a truly compelling reason why most numerical results (for publication purposes) truly need to publish (and support) deterministic seeding.

We've certainly done a lot, scientifically speaking (in terms of post-validated studies), without that level of reproducibility.

jnxx · on Aug 24, 2020

If nothing else, it helps debugging code which tries to reproduce your findings.

dekhn · on Aug 24, 2020

The code I work with is not debuggable in that way under most circumstances. It's a complex distributed system. You don't attempt to debug it by being deterministic- you debug it by sampling its properties.

kag0 · on Aug 24, 2020

> there is no such thing as truly random number generation on contemporary computers

well that's just not true. there's no shortage of noise we can sample to get true random numbers. we just often stretch the random numbers for performance purposes.

James_Henry · on Aug 25, 2020

There is a shortage of truly random easily sampled noise though.

konjin · on Aug 25, 2020

You can run the script multiple times and get a statistical representation of what the results should be. That's the point of science.

This reminds me of being in gradschool and the comp-sci people complaining that we don't get bit-wise equal floats when we solve DEs.

Having to re-implement a library from scratch for a project is much more valuable than running the same code in two places. The same way that getting the same results from two different machines is a lot more significant than getting the same result from two cloned machines.

In short: code does not need to be reproducible because scientists know how to average.

andrewprock · on Aug 24, 2020

> Does scientific-grade code need to be reproducible? Yes. Fundamentally yes

This is definitely not correct. The experiment as a whole needs to be reproducible independently. This is very different, and more robust, from requiring that a particular portion of a previous version of the experiment to be reproducible in isolation.

jnxx · on Aug 25, 2020

> Does scientific-grade code need to be reproducible? Yes. Fundamentally yes. The reproducibility of results is core to the scientific method. Yes, that includes Monte Carlo code, [...]

Reproducibility in the scientific sense is different from running the same program with the same input, and getting exactly the same result. Repreducibility means that if you repeat the measurements in another environment, getting somewhat different data, and apply the same theory and methods, you get to the same conclusion.

The property of a computer program that when you run it again with the same input, you get the same output, is nice and very helpful for debugging. But the fact that you can run the same program does not mean that it is bug-free, as much as the fact that you can copy a paper with a mathematical proof does not mean that the proof is correct.

Also, multi-threaded and parallel code is inherently non-deterministic.

> when there is no such thing as truly random number generation on contemporary computers, only pseudorandom number generation,

That is wrong. Linux for example uses latency measurements from drivers such as HDD drive seek latencies or keyboards to generate entropy. While it might not the best thing to rely on for purposes of cryptography, it is surely not deterministic. If it would matter, you could download real-time astronomical noise measurements and use them to seed your Mersenne Twister generator.

K0balt · on Aug 25, 2020

As a software developer, I tgi k maybe you misunderstand scientific reproducibility.

Other scientists should be building their own apparatus, writing and running their own code. That the experiment is actually different is what validates the hypothesis, which specifies the salient conditions leading to the outcome.

That an identical experiment leads to an identical outcome fails to validate the hypothesis, because the casual factors may have been misidentified.

Precise reproducibility still matters if generalized reproduction fails, however, because the differences in the experimental implementation may lead to new and more accurate hypothesis about causality.

Izkata · on Aug 25, 2020

> Other scientists should be building their own apparatus, writing and running their own code. That the experiment is actually different is what validates the hypothesis, which specifies the salient conditions leading to the outcome.

The problem here is a step before this: The results of these "identical" experiments are so wildly different there is nothing valid to propose, let alone for a recreation to compare against.

logifail · on Aug 25, 2020

> That an identical experiment leads to an identical outcome [..]

So what happens if/when an identical experiment fails to lead to an identical outcome?

throwaway287391 · on Aug 24, 2020

Controlling randomness can be extremely difficult to get right, especially when there's anything asynchronous about the code (e.g. multiple worker threads populating a queue to load data). In machine learning, some of the most popular frameworks (e.g. TensorFlow [0]) don't offer this as a feature, and in other frameworks that do (PyTorch [1]) it will cripple the speed you get as a result as GPU accelerators rely on non-deterministic accumulation for reasonable speed.

Scientific reproducibility does not mean, and has never meant, you rerun the code and the output perfectly matches bit-for-bit every time. If you can achieve that, great -- it's certainly a useful property to have for debugging. But a much stronger and more relevant form of reproducibility for actually advancing science is running the same study e.g. on different groups of participants (or in computer science / applied math/stats / etc., with different codebases, with different model variants/hyperparameters, on different datasets) and the overall conclusions hold.

To paraphrase a comment I saw from another thread on HN: "Plenty of good science got done before modern devops came to be."

[0] https://github.com/tensorflow/tensorflow/issues/12871 https://github.com/tensorflow/tensorflow/issues/18096

[1] https://pytorch.org/docs/stable/notes/randomness.html

==========

EDIT to reply to solatic's replies below (I'm being rate-limited):

The social science arguments are probably fair (or at least I'll leave it to someone more knowledgeable to defend them if they want) -- perhaps I shouldn't have led with the example of "different groups of participants".

> If you can achieve that, for the area of study in which you conduct your experiment, it should be required. Deciding to forego formal reproducibility should be justified with a clear explanation as to why reproducibility is infeasible for your experiment, and peer-review should reject studies that could have be reproducible but weren't in practice.

This might be a reasonable thing to enforce if everyone in the field were using the same computing platform. Given that they're not (and that telling everyone that all published results have to be done using AWS with this particular machine configuration is not a tenable solution) I don't see how this could ever be a realistic requirement. Or if you don't want to enforce that the results remain identical across different platforms, what's the point of the requirement in the first place? How would it be enforced if nobody else has the exact combination of hardware/software to do so? And then even if someone does, almost inevitably there'll be some detail of the setup that the researcher didn't think to report and results will differ slightly anyway.

Besides, if you're allowing for exemptions, just about every paper in machine learning studying datasets larger than MNIST (where asynchronous prefetching of data is pretty much required to achieve decent speeds) would have a good reason to be exempt. It's possible that there are other fields where this sort of requirement would be both useful and feasible for a large amount of the research in that field, but I don't know what they are.

> Also, reading through the issues you linked points to: https://github.com/NVIDIA/framework-determinism which is a relatively recent attempt by nVidia to support deterministic computation for TensorFlow. Not perfect yet, but the effort is going there.

(From your other comment.) Yes, there exists a $300B company with an ongoing-but-incomplete funded effort of so far >6 months' work (and that's just the part they've done in public) to make one of its own APIs optionally deterministic when it's being used through a single downstream client framework. If this isn't a perfect illustration that it's not realistic to expect exact determinism from software written by individual grad students studying chemistry, I'm not sure what to say.

SilasX · on Aug 24, 2020

You're right about bit-for-bit reproducibility possibly being overkill, but I don't think that invalidates the parent's point that Monte Carlo randomization doesn't obviate reproducibility concerns. It just means that e.g. your results shouldn't be hypersensitive to the details of the randomization. That is, reviewers should be able to take your code, feed it different random data from a similar distribution to what you claimed to use (perhaps by choosing a different seed), and get substantively similar results.

throwaway287391 · on Aug 24, 2020

That brings up a separate issue that I didn't comment on above: the expectation that the code runs in a completely different development/execution environment (e.g. the one the reviewer is using vs. the one that the researcher used). That means making it run regardless of the OS (Windows/OSX/Linux/...) and hardware (CPU/GPU/TPU, and even within those, which one) the reviewer is using. This would be an extremely difficult if not impossible thing for even a professional software engineer to achieve. It could easily be a full time job. There are daily issues on even the most well-funded projects in machine learning by huge companies (ex: TF, PyTorch) that the latest update doesn't work on GPU X or CUDA version Y or OS Z. It's not a realistic expectation for a researcher even in computer science, let alone researchers in other fields, most of whom are already at the top of the game programming-wise if they would even think to reach for a "script" to automate repetitive data entry tasks etc.

==========

EDIT to reply to BadInformatics' reply below (I'm being rate-limited): I fully agree that a lot of ML code releases could be better about this, and it's even reasonable to expect them to do some of these more basic things like you mention. I don't agree that bit-for-bit reproducibility is a realistic standard that will get us there.

BadInformatics · on Aug 24, 2020

I don't think that removes the need to provide enough detail to replicate the original environment though. We write one-off scripts with no expectation that they will see outside usage, whereas research publications are meant for just that! The bar isn't terribly high either: for ML, a requirements.txt + OS version + CUDA version would go a long way, no need to learn docker just for this.

konjin · on Aug 25, 2020

Have you tried running a specific Cuda version from 10 years ago?

Because I have and I pity anyone who tries and builds a kernel that can run it.

jbay808 · on Aug 24, 2020

It does seem like a valid response to OP's objection to the imperial college's COVID model, though. Doesn't it?

SilasX · on Aug 24, 2020

Reviewing the original comment, I think so (that the original comment is overcritical). For purpose of reproducibility, it's enough that you can validate that you can run the model with different random data and see that their results aren't due to pathological choices of initial conditions. If the race conditions and non-determinism just transform the random data into another set of valid random data, that doesn't compromise reproducibility.

throwawaygh · on Aug 24, 2020

> or in computer science / applied math/stats / etc., with different codebases, with different model variants, on different datasets) and the overall conclusions hold

A lot of open sourced CS research is not reproducible.

"the code still runs and gives the same output" is not the same as reproducibility.

throwaway287391 · on Aug 24, 2020

> A lot of open sourced CS research is not reproducible.

I'm not sure if this was meant to be a counter-argument to me, but I completely agree!

> "the code still runs and gives the same output" is not the same as reproducibility.

Yes, bit-for-bit identical results are neither necessary nor sufficient for reproducibility in the usual scientific sense.

throwawaygh · on Aug 24, 2020

> I'm not sure if this was meant to be a counter-argument to me

It wasn't :)

solatic · on Aug 24, 2020

> But a much stronger and more relevant form of reproducibility for actually advancing science is running the same study e.g. on different groups of participants (or in computer science / applied math/stats / etc., with different codebases, with different model variants/hyperparameters, on different datasets) and the overall conclusions hold

> Plenty of good science got done before modern devops came to be

This isn't as strong of an argument as you think. This is more-or-less the underlying foundation behind the social sciences, which argues that no social sampling can ever be entirely reproduced since no two people are alike, and even the same person cannot be reliably sampled twice as people change with time.

Has there been "good science" done in the social sciences? Sure. I don't think that you're going to find anybody arguing that the state of the social sciences today is about the same as it was in the Dark Ages.

With that said, one of the reasons why so many laypeople look at the social sciences as a kind of joke is because so many contradictory studies come out of these peer-reviewed journals that their trustworthiness is quite low. One of the reasons why there's so much confusion surrounding what constitutes a healthy diet and how people should best attempt to lose weight is precisely because diet-and-exercise studies are more-or-less impossible to reproduce.

> If you can achieve that, great -- it's certainly a useful property to have for debugging

If you can achieve that, for the area of study in which you conduct your experiment, it should be required. Deciding to forego formal reproducibility should be justified with a clear explanation as to why reproducibility is infeasible for your experiment, and peer-review should reject studies that could have be reproducible but weren't in practice.

jbay808 · on Aug 24, 2020

Plenty of good physics got done before modern devops came to be, too! Maybe the pace of advancement was slower when the best practice was to publish a cryptographic hash of your discoveries in the form of a poetic latin anagram rather than just straight-up saying it, but it's not like Hooke's law is considered unreproducible today because you can't deterministically re-instantiate his experimental setup with a centuries-old piece of brass and get the same result to n significant figures.

mnl · on Aug 24, 2020

And physicists have been writing code for a while simply because the number of software engineers with a working knowledge of physics (as in ready for research), have been trained in numerical analysis (as in being able to read applied mathematics) and then are willing to help you with your paper for peanuts is about zero.

I don't understand why it is so hard to see that you need either a pretty big collaboration where somebody else has isolated the specifications so you don't need to know anything about the problem your code solves really, or becoming a physics graduate student yourself for this line of work.

a1369209993 · on Aug 25, 2020

I wouldn't argue for it, but I would be extremely reluctant to argue against the assertion that the state of the social 'sciences' today is about the same as it was in the Dark Ages. To the extent that any "good science" gets done in the social 'sciences', it is entirely in spite of the entire (social-'science'-specific) underlying foundation thereof. If your results aren't reproducible with (the overwhelming majority of[0]) other samples collected based on the same criteria, your results aren't.

0: specifically, for results claiming a 95% confidence level (p<0.05), if nineteen replications are attempted, you should encounter a replication failure exactly once. I would accept perhaps four or five out of nineteen (or one out of two or three) under the reasoning that the law of large numbers hasn't kicked in yet, but anything with zero successful replications is not science, it's evidence (in this, against the entire field of study).

solatic · on Aug 24, 2020

Also, reading through the issues you linked points to: https://github.com/NVIDIA/framework-determinism which is a relatively recent attempt by nVidia to support deterministic computation for TensorFlow. Not perfect yet, but the effort is going there.

dnautics · on Aug 24, 2020

the correct way to control randomness in scientific code is to have the RNG be seeded with a flag and have the result check out with a snapshot value. Almost no one does this, but that doesn't mean it shouldn't be done.

ska · on Aug 24, 2020

This is not correct on several levels. Reproducibility is not achievable in many real world scenarios, but worse it's not even very informative.

Contra your assertion, many people do some sort of regression testing like this but it's isn't terribly useful for verification or validation - but it is good at catching bad patches.

throwaway287391 · on Aug 24, 2020

Did you read my post? I know what a seed is. Setting one is typically not enough to ensure bit-for-bit identical results in high-performance code. I gave two examples of this: CUDA GPUs (which do non-deterministic accumulation) and asynchronous threads (which won't always run operations in the same order).

dnautics · on Aug 24, 2020

Most scientific runs are scaled where you run multiple replicates. And not all scientific runs are high-performance in the HPC sense. Even if your code is HPC in the HPC sense, and requires CUDA, and 40,000 cores, you should consider creating a release flag where an end user can do at least single "slow" run on a CPU on a reduced dataset, in single threaded mode, to sanity check the results and at least verify that the computational and algorithmic pipeline is sound at the most basic level.

I used to be a scientist. I get it, getting scientists to do this is like pulling teeth, but it's the least you could do to give other people confidence in your results.

throwaway287391 · on Aug 24, 2020

> consider creating a release flag where an end user can do at least single "slow" run on a CPU on a reduced dataset, in single threaded mode, to sanity check the results and at least verify that the computational and algorithmic pipeline is sound at the most basic level.

Ok, that's a reasonable ask :) But yeah as you implied, good luck getting the average scientist, who in the best case begrudgingly uses version control, to care enough to do this.

mmmBacon · on Aug 24, 2020

Monte-Carlo can and should be deterministic and repeatable. It’s a matter of correctly initializing you random number generators and providing a known/same random seed from run to run. If you aren’t doing that, you aren’t running your Monte-Carlo correctly. That’s a huge red flag.

Scientists need to get over this fear about their code. They need to produce better code and need to actually start educating their students on how to write and produce code. For too long many in the physics community have trivialized programming and seen it as assumed knowledge.

Having open code will allow you to become better and you’ll produce better results.

Side note: 25 years ago I worked in accelerator science too.

djaque · on Aug 24, 2020

Hello fellow accelerator physicist!

Yes I understand how seeding PRNGs work and I personally do that for my own code for debugging purposes. My point was that not using a fixed seed doesn't invalidate their result. It's just a cheap shot and, to me, demonstrates that the lockdownskeptics author doesn't have a real understanding of the methods being used.

Also, to be clear, I support open science and have some of my own open-source projects out in the wild (which is not the norm in my own field yet). I'm not arguing against releasing code, I'm arguing against OP arguing against this particular piece of code.

SiempreViernes · on Aug 24, 2020

Indeed it was a cheap shot, the code does give reproducible results: https://www.nature.com/articles/d41586-020-01685-y

The main issue is if it used sensible inputs, but that's entirely different from code quality and requires subject matter expertise, so programmers don't bother with such details -_-

tgvaughan · on Aug 24, 2020

I write M-H samplers for a living. While I agree that being able to rerun a chain using the same seed as before is crucial for debugging, and while I'm very strongly in favour of publishing the code used for a production analysis, I'm generally opposed to publishing the corresponding RNG seeds. If you need the seeds to reproduce my results, then the results aren't worth the PDF they're printed on. [edit: typo]

improbable22 · on Aug 24, 2020

> Monte-Carlo can and should be deterministic and repeatable

I guess it can be made so, but not necessarily easy / fast (if it's parallel, and sensitive to floating point rounding). And sounds like the kind of engineering effort GP is saying isn't worth it. Re-running exactly the same monte-carlo chain does tell you something, but is perhaps the wrong level to be checking. Re-running from a different seed, and getting results that are within error, might be much more useful.

jbay808 · on Aug 24, 2020

I guess the best thing would be that it uses a different random seed every time it's run (so that, when re-running the code you'll see similar results which verifies that the result is not sensitive to the seed), but the particular seed that produced the particular results published in a paper is noted.

But still, for code running on different machines, especially for numeric-heavy code that might be running on a particular GPU setup, distributed big data source (where you pull the first available data rather than read in a fixed order), or even on some special supercomputer, it's hard to ask that it be totally reproducible down to the smallest rounding error.

neutronicus · on Aug 24, 2020

Then you need to re-imagine the system in such a way that junior scientific programmers (i.e. Grad Students) can at least imagine having enough job security for code maintainability to matter, and for PIs to invest in their students' knowledge with a horizon longer than a couple person-years.

jnxx · on Aug 24, 2020

> Monte-Carlo can and should be deterministic and repeatable.

That's a nitpick, but if the computation is executed in parallel threads (e.g. on multicore, or on a multicomputer), and individual terms are, for example, summed in a random order, caused by the non-determinism introduced by the parallel computation, then the result is not strictly deterministic. This is a property of floating-point computation, more specifically, the finite accuracy of real floating-point implementations.

So, it is not deterministic, but that should not cause large qualitative differences.

jnxx · on Aug 25, 2020

> Monte-Carlo can and should be deterministic and repeatable. It’s a matter of correctly initializing you random number generators and providing a known/same random seed from run to run.

Perhaps if you use only single-threaded computation, you are interested in averages, and the processes you are interested in behave well and mostly linear.

But

- running code in parallel easily introduces non-determinism, even if your result computation is as simple as summing up results from different threads

- the processes one is examining might be highly non-linear - like lightning, weather forecasts, simulation of wildfires, and also epidemic simulations

- especially for all kind of safety research, you might actually be interested not only in averages, but in freak events, like "what is the likelihood that you have two or three hurricanes at the same time in the Gulf of Mexico", or "what happens if your nuclear plant gets struck by freak lightning in the first second of a power failure".

What should be reproducible are the conclusions you come to, not the hashed bits of program output.

> If you aren’t doing that, you aren’t running your Monte- Carlo correctly. That’s a huge red flag.

No, it does not follow from that.

jack_h · on Aug 24, 2020

Since I have a bit of experience in this area, quasi-Monte Carlo methods also work quite well and ensure deterministic results. They're not applicable for all situations though.

ivanbakel · on Aug 24, 2020

Doesn't it concern you that it would be possible for critics to look at your scientific software and find mistakes (some of which the OP mentioned are not "minor") so easily?

Given that such software forms the very foundation of the results of such papers, why shouldn't it fall under scrutiny, even for "minor" points? If you are unable to produce good technical content, why are you qualified to declare what is or isn't minor? Isn't the whole point that scrutiny is best left to technical experts (and not subject experts)?

smnrchrds · on Aug 24, 2020

> Doesn't it concern you that it would be possible for critics to look at your scientific software and find mistakes (some of which the OP mentioned are not "minor") so easily?

A non-native English speaker may make grammatical mistakes when communicating their research in English—it does not in any way invalidate their results or hint that there is anything amiss. It is simply what happens when you are a non-native speaker.

Some (many?) code critiques by people unfamiliar with the field of study the research will be about superficial mistakes that do not invalidate the results. They are the code equivalents of grammatical mistakes. That's what the OP is talking about.

stult · on Aug 24, 2020

Journals employ copy editors to address just those sorts of mistakes, why should we not hold software to the same standard as academic language? But more importantly, these software best practices aren't mere "grammatical mistakes," they exist because well-organized, well-tested code has fewer bugs and is easier for third parties to verify. Third-parties validating that the code underlying an academic paper executes as expected is no different than third-parties replicating the results of a physical experiment. You can be damn sure that an experimental methodology error invalidates a paper, and you can be damn sure that bad documentation of the methodology dramatically reduces the value/reliability of the paper. Code is no different. It's just been the wild west because it is a relatively new and immature field, so most academics have never been taught coding as a discipline nor held to rigorous standards in their own work. Is it annoying that they now have to learn how to use these tools properly? I'm sure it is. That doesn't mean it isn't a standard we should aim for, nor that we shouldn't teach the relevant skills to current students in sciences so that they are better prepared when they become researchers themselves.

labcomputer · on Aug 24, 2020

> Third-parties validating that the code underlying an academic paper executes as expected is no different than third-parties replicating the results of a physical experiment.

First, it's not no different--it's completely different. Third parties have always constructed their own apparatus to reproduce an experiment. They don't go to the original author's lab to perform the experiment!

Second, a lot of scientific code won't run at all outside the environment it was developed in.

If it's HPC code, it's very likely that the code makes assumptions about the HPC cluster that will cause it to break on a different cluster. If it's experiment control / data-acquisition code, you'll almost certainly need the exact same peripherals for the program to do anything at all sensible.

I see a lot of people here on HN vastly over-estimating the value of bit-for-bit reproducibility of one implementation, and vastly underestimating the value of having a diversity of implementations to test an idea.

Bukhmanizer · on Aug 24, 2020

I’m glad someone else feels this way. It’s an expectation that scientists can share their with other scientists using language. Scientists aren’t always the best writers, but there are standards there. Writing good code is a form of communication. It baffles me that there are absolutely no standards there.

garden_hermit · on Aug 24, 2020

I agree with your overall point, but I just want to point out that many (most?) journals don't employ copy-editors, or if they do, then they overlook many errors, especially in the methods section of papers.

ryandrake · on Aug 24, 2020

On the contrary: If I'm (in industry) doing a code review and see simple, obvious mistakes like infinite loops, obvious null pointer exceptions, ignored compiler warnings, etc., in my mind it casts a good deal of doubt over the entire code. If the author is so careless with these obvious errors, what else is he/she being careless about?

Same with grammatical or spelling errors. I don't review research but I do review resumes, and I've seen atrocious spelling on resumes. Here's the candidate's first chance to make an impression. They have all the time in the world to proofread, hone, and have other eyes edit it. Yet, they still miss obvious mistakes. If hired, will their work product also be sloppy?

SiempreViernes · on Aug 24, 2020

This sort of scrutiny only matters once someone else has a totally different code that gives incompatible results, before that point there's no sense in looking for bugs because all you're proving is that there are no obvious mistakes: you don't say anything about the interesting questions since you only bother with codes for things with non-obvious answers.

James_Henry · on Aug 24, 2020

When you say OP, do you mean djsumdog? If so, what mistakes does he mention that aren't minor?

gowld · on Aug 24, 2020

How is it possible to know the difference between minor and major, if the mistakes are kept secret?

If we're supposed to accept scientific results on faith, why bother with science at all?

sitkack · on Aug 24, 2020

> exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.

If code is what is substantiating a scientific claim, then code needs to stand up to scientific scrutiny. This is how science is done.

I came from physics, but systems and computer engineering was always an interest of mine, even before physics, I thought it was kooky-dooks that CS people can release papers w/o code, fine if the paper contains all the proofs but otherwise it shouldn't even be looked at. PoS (proof-of-science) or GTFO.

We are the point in human and scientific civilization that knowledge needs to prove itself correct. Papers should be self contained execution environments that generate PDFs and resulting datasets. The code doesn't need to be pretty, or robust, but it needs to be sealed inside of a container so that it can be re-run, re-validated and someone else can confirm the result X years from now. And it isn't about trusting or not trusting the researcher, we need to fundamentally trust the results.

snowwrestler · on Aug 24, 2020

The history of physics is full of complex, one-off custom hardware. Reviewers have not been expected to take the full technical specs and actually build and run the exact same hardware, just to verify correctness for publication.

I doubt any physicist believes we need to get the Tevatron running again just to check decade-old measurements of the top quark. I don't understand why decade-old scientific software code must meet that bar.

martingab · on Aug 26, 2020

They didn't rebuild the Tevatron but still were able to rediscover the top within a different experimental environment (i.e. LHC with tons of different discovery channels) and have lots of fits for it properties from indirect measurements (LEP, Belle). Physics is not an exact science. If you have only one measurement (no matter if its software- or hardware-based), no serious physicist would fully trust in the result as long as it wasn't confirmed by an independent research group (by doing more than just rebuilding/copying the initial experiment but maybe using slightly different approximations or different models/techniques). I'm not so much in computer science, but I guess here it might be a bit different ones a prove is based on rigorous math. However even if so, I guess, it's sometimes questionable if the prove is applicable to real-world systems and then one might be in a similar situation.

Anyways, in physics they always require several experimental proves for our theory. They also have several "software experiments" for e.g. predicting the same observables. Therefore, researchers need to be able to compile and run the code of their competitors in order to compare and verify the results in detail. In this place, bug-hunting/fixing is sometimes also taking place - of course. So applying the articles suggestions would have the potential to accelerate scientific collaboration.

btw; I know some people who do still work with the data taken at the LEP experiment which was shut down almost 20 (!) years ago and they have a hard time in combining old detector-simulations, monte-carlos etc. with new data-analysis techniques for the exact same reasons mentioned in the article. For large-scale experiments it is a serious problem which nowadays has much more attention than at LEP ages, since LHC has anyways obvious big-data problems to solve before their next upgrade, including also software-solutions.

reitzensteinm · on Aug 25, 2020

If you could have spun up a Tevatron at will for $10, would the culture be the same today?

I suspect that software really is different in this way, and treating it like it's complex, one off hardware is cultural inertia that's going to fade away.

matthewdgreen · on Aug 24, 2020

All of my 2010 scientific code runs on the then-current edition of Docker. /s

sitkack · on Aug 24, 2020

I made no mention of Docker, VMs or any virtualization system. Those would be an implementation detail and would obviously change over time.

A container can be a .tar.gz, a zip or a disk image of artifacts, code, data and downstream deps. The generic word has been co-opted to mean a specific thing which is very unfortunate.

matthewdgreen · on Aug 24, 2020

My point, which I guess I did not make clearly enough, is that container systems don't necessarily exist or remain supported over the ten-year period being discussed. The idea of ironing over long-term compatibility issues using a container environment seems like a great one! (For the record, .tgz -- the "standard" format for scientific code releases in 2010, does not solve these problems at all.)

But the "implementation detail" of which container format you use, and whether it will still be supported in 10 years, is not an implementation detail at all -- since this will determine whether containerization actually solves the problem of helping your code run a decade later. This gets worse as the number, complexity and of container formats expands.

Of course if what you mean is that researchers should provide perpetual maintenance for their older code packages, moving them from one obsolete platform to a more recent one, then you're making a totally different and very expensive suggestion.

sitkack · on Aug 25, 2020

Of course of course. I am not trying to boil the ocean here, or we would have a VM like wasm and a execution env like wasi and run all our 1000 year code inside of that.

The first step is just having your code, data and deps in an archive. Depending on the project and the age, more stuff makes it into the archive. I have been on projects where the source to the compiler toolchain was checked into the source repo and the first step was to boostrap the tooling (from a compiler binary checked into the repo).

We aren't even to the .tar.gz stage yet.

jnxx · on Aug 24, 2020

> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

Specifically, to that point, I want to cite the saying:

"The dogs bark, but the caravan passes."

(There is a more colorful German variant which is, translated: "What does it bother the mighty old oak tree if a dog takes a piss...").

Of course, if you publish your code, you expose it to critics. Some of this will be unqualified. And as we have seen in the case e.g. of climate scientists, some might be even nasty. But who cares? What matters is open discussion which is a core value of science.

throwaway7281 · on Aug 24, 2020

That's not how the game is played. If you cannot the release the code because the code is too ugly or untested or has bugs, how do you expect anyone with the right expertise to assess your findings?

It reminds me of Kerckhoffs's principle in cryptography, which states: A cryptosystem should be secure even if everything about the system, except the key, is public knowledge.

labcomputer · on Aug 24, 2020

In GIS, there's a saying "the map is not the terrain". It seems like HN is in a little SWE bubble, and needs to understand "the code is not the science".

In science, code is not an end in-and-of-itself. It is a tool for simulation, data reduction, calculation, etc. It is a way to test scientific ideas.

> how do you expect anyone with the right expertise to assess your findings

I would expect other experts in the field to write their own implementation of the scientific ideas expressed in a paper. If the idea has any merit, their implementations should produce similar results. Which is exactly what they would do if it were a physical experiment.

RandoHolmes · on Aug 24, 2020

No one is saying that code is the science.

If I'm given bad information and I act on that information, then problems can occur.

Similarly, if the software is giving the scientist bad information, problems can occur.

How many more stories do we have to read about some research getting published in a journal only to have to retract it down the road because they had a bug in the software before we start asking if maybe there needs to be more rigor in the software portion of the research as well?

There was a story on HN a while back about a professor who had written software, had come to some conclusions, and even had a Ph.D. student working on research based on that work. Only to find out that a software flaw meant the conclusions weren't useful to anyone and that student ended up wasting years of their life.

---

This stuff matters. This isn't a model of reality, it's an exploration of reality. It would be like telling a hiker that terrain doesn't matter. They would, rightfully, disagree with you.

kalenx · on Aug 24, 2020

> How many more stories do we have to read about some research getting published in a journal only to have to retract it down the road because they had a bug in the software before we start asking if maybe there needs to be more rigor in the software

We will always hear stories like that, as we will always hear stories about major bugs in stable software releases. Asking a scientist to do better than whole teams of software engineers makes little sense to me.

Of course, a bug that was introduced or kept with the counscious intention of fooling the reviewers and the readers is another story.

RandoHolmes · on Aug 24, 2020

> Asking a scientist to do better than whole teams of software engineers makes little sense to me.

This is not what is being asked, shame on you for the strawman.

Your entire post can be summed up with the following sentence: "if we can't be perfect then we may as well not try to be better".

kalenx · on Aug 25, 2020

I was reacting to the part of your post I quoted.

The thing is that it has little to do about rigor -- or if I may sin again, it is equivalent to say that software developers lack rigor: sure, some of them do (as some scientists do), but even among the most significant and severe bugs of the history of software, it is seldom the case that we can tell "right, definitely the guy who wrote that lacked rigor and seriousness".

Of course this is not a blank forgiveness for every bad scientist out there. Of course we should aim at getting better. But we should make the difference between the ideal science process and the science as performed by a human, prone to errors, misunderstandings and mistakes, and realize that these things will always happen, however many times we call for "more rigor because bugs have consequences".

RandoHolmes · on Aug 25, 2020

All you did was restate the argument that I've already rejected.

And stop comparing scientists to software developers, it's a hidden argument by authority, and it isn't needed.

kalenx · on Aug 25, 2020

I don't even understand the point you are making then (apart from me apparently arguing solely with sophisms, which is kind of a prowess).

How this "rigor" you are calling for should manifest, then? Put bluntly, my point was that every software has bug, so how "more rigor" would help? What should we do, what should we ask for in _practical_ terms?

Also, please do not rephrase this last sentence as "oh so since every software has bugs, then you obviously say that we shouldn't fix bugs, anyway other bugs will remain!".

RandoHolmes · on Aug 25, 2020

> Also, please do not rephrase this last sentence as "oh so since every software has bugs, then you obviously say that we shouldn't fix bugs, anyway other bugs will remain!".

That's exactly what I'm going to do. Point out that we can demand better even in the face of a lack of perfection.

There are two problems here with your stance.

1. The assumption that all bugs are created equal, and 2. The assumption that the truth isn't the overriding concern of science.

It's real easy to define the set of bugs that are unacceptable in science. Any bug that would render the results inaccurate is unacceptable.

The fact that some jackass web developer wrote a bug that deleted an entire database in no way obviates that responsibility of the scientists.

ufmace · on Aug 24, 2020

I don't entirely disagree, but haven't there also been cases of experimental results being invalidated due to subtle mechanical, electrical, chemical, etc complications with the test equipment, when none of the people involved in the experiment were experts in those fields?

I think that, while we could use a bit more training in software engineering best-practices in the science, the thesis is still that science is hard and we need real replication of everything before reaching important conclusions, and over-focusing on one specific type of errors isn't all that helpful.

RandoHolmes · on Aug 24, 2020

If they're setting up experiments whose correct results require electrical expertise, then yes, they should either get better training or bring in someone who has it.

It's not clear to me why you think I would argue that inaccuracies should be avoided in software but accept that they're ok for electrical systems.

yjftsjthsd-h · on Aug 24, 2020

> In GIS, there's a saying "the map is not the terrain". It seems like HN is in a little SWE bubble, and needs to understand "the code is not the science".

And if you're a map maker, it's a bit rich to start claiming that the accuracy of your maps is unimportant. If code is "a way to test scientific ideas", then it kinda needs to work if you want meaningful results. Would you run an experiment with thermometers that were accurate to +-30° and reactants from a source known for contamination?

jnxx · on Aug 24, 2020

In many parts of scientific research, researchers are, to stay in your metaphor, more travelers using a map, than map makers.

Of course, it is a difference whether you make a clinical study on drugs, and use a pocket calculator to compute a mean, or whether you research in numerical analysis, or are presenting a paper in how to use Coq to more efficiently prove the four-color theorem or Fermat's last theorem.

In short, much of science is not computer science, and for it, computation is just a tool.

kazagistar · on Aug 25, 2020

Mathematicians are expected to publish their proofs. Not so that people can do the proof again independently, but so that other mathematicians can find and point out if they have a tangible error in their proof that tangibly invalidates the result.

Sure, some people might point out spurious bugs and "design issues" or whatever, boo hoo. But others might actually find flaws in the code that meaningfully affect science itself: true bugs.

Sure, they could do this by doing a full replication in a lab and then custom coding everything from scratch. But even then, all you have is two conflicting results, with no good way yet to determine which one is more right or why they disagree. Technically, you can use the scientific progress to eventually find bugs in the scientific process, but why waste so much time when publishing the code will allow for reviews to find bugs so much faster. Its pure benefit to science to not obscure its proofs and rigor.

booleandilemma · on Aug 24, 2020

If you’re saying you produced certain results with code, then the code is indeed the science. Not being able to vouch for the code is like believing a mathematical theorem without seeing the proof.

avasthe · on Aug 26, 2020

You are missing the point.

How many actually try to reproduce the results by writing corresponding code themselves? Apparently lot of papers with slightly wrong findings because code errors have passed the peer review (all of us in the SWE bubble know how often bugs occur), at least in less prestigious journals.

There is nothing wrong with mandating the code to be supplied with the paper. Because, many time code is somewhere between the experimental setup and proof / result.

sjburt · on Aug 24, 2020

The findings really should be independent of the code. Reproduction should occur by taking the methodology and re-implementing the software and running new experiments.

martingab · on Aug 24, 2020

That's exactly the philosophy we follow e.g. in particle physics and its a common excuse to dismiss all guidelines made in the article. However, this kind of validation/falsification is often done between different research groups (maybe using different but formally equivalent approaches) while people within the same group have to deal with the 10 years old code base.

I myself had very bad experience with extending the undocumented Fortran 77 code (lots of gotos and common blocks) of my supervisor. Finally, I decided to rewrite the whole thing including my new results instead of just somehow embedding my results into the old code for two reasons: (1) I'm presumably faster in rewriting the whole thing including my new research rather than struggling with the old code and (2) I simply would not trust in the numerical results/phenomenology produced by the code. After all, I'm wasting 2 months of my PhD for the marriage of my own results with known results which -in principle- could have been done within one day if the code base would allow for it.

So yes, If it's a one-man-show I would not give too much on code quality (though unit tests and git can safe quite a lot of time during development) but if there is a chance that someone else is going to touch the code in near future it will save time to your colleagues and improve the overall (scientific) productivity.

PS: quite excited about my first post here

MaxBarraclough · on Aug 24, 2020

> If it's a one-man-show I would not give too much on code quality

This makes me a little uneasy, as I'm not too worried about code quality can easily translate into Yes I know my code is full of undefined behaviour, and I don't care.

> PS: quite excited about my first post here

Welcome to HN! reddit has more cats, Slashdot has more jokes about sharks and laserbeams, but somehow we get by.

ByteJockey · on Aug 25, 2020

Are we talking actual undefined behavior or just behavior that's undefined by the language standard?

The latter isn't great practice, but if your environment handles behavior deterministically, and you publish the version of the compiler you're using, it doesn't seem to be a problem for this type of code.

MaxBarraclough · on Aug 25, 2020

> Are we talking actual undefined behavior or just behavior that's undefined by the language standard?

'Undefined behaviour' is a term-of-art in C/C++ programming, there's no ambiguity.

> if your environment handles behavior deterministically, and you publish the version of the compiler you're using, it doesn't seem to be a problem for this type of code.

Code should be correct by construction, not correct by coincidence. Results from such code shouldn't be considered publishable. Mathematicians don't get credit for invalid proofs that happen to reach a conclusion which is correct.

Again, this isn't some theoretical quibble. There are plenty of sneaky ways undefined behaviour can manifest and cause trouble. [0][1][2]

In the domain of safety-critical software development in C, extreme measures are taken to ensure the absence of undefined behaviour. If scientists adopt a sloppier attitude toward code quality, they should expect to end up publishing invalid results. Frankly, this isn't news, and I'm surprised the standards seem to be so low.

Also, of all the languages out there, C and C++ are among the most unforgiving of minor bugs, and are a bad choice of language for writing poor-quality code. Ada and Java, for instance, won't give you undefined behaviour for writing int i; int j = i;.

[0] https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...

[1] https://blog.regehr.org/archives/213

[2] https://cryptoservices.github.io/fde/2018/11/30/undefined-be...

See also my longer ramble on this topic at https://news.ycombinator.com/item?id=24264376

kazagistar · on Aug 25, 2020

I think its poor practice, but undefined behavior shouldn't instantly invalidate results. In fact, this mindset is what keeps people from publishing the code in the first place.

Let the scientists publish UB code, and even the artifacts produced, the executables. Then, if such problems are found in the code by professionals, they can investigate it fully and find if it leads to a tangible flaw that invalidates the research or not.

You would drive yourself mad pointing out places in math proofs where some steps, even seemingly important ones, were skipped. But the papers are not retracted unless such a gap actually holds a flaw that invalidates the rest of thr proof.

Let thdm publish their gross, awful, and even buggy code. Sometimes the bugs don't effect the outcomes.

MaxBarraclough · on Aug 25, 2020

> undefined behavior shouldn't instantly invalidate results

Granted, it's not a guarantee that the results are wrong, but it's a serious issue with the experiment. I agree it wouldn't generally make sense to retract a publication unless it can be determined that the results are invalid. It should be possible to independently investigate this, if the source-code and input data are published, as they should be.

(It isn't universally true that reproduction of the experiment should be practical given that the source and data are published, as it may be difficult to reproduce supercomputer-powered experiments. iirc, training AlphaGo cost several million dollars of compute time, for instance.)

> this mindset is what keeps people from publishing the code in the first place

As I explained in [0], this attitude makes no sense at all. It has no place in modern science, and it's unfortunate the publication norms haven't caught up.

Scientific publication is meant to enable critical independent review of work, not to shield scientists from criticism from their peers, which is the exact opposite.

> Let the scientists publish UB code, and even the artifacts produced, the executables. Then, if such problems are found in the code by professionals, they can investigate it fully and find if it leads to a tangible flaw that invalidates the research or not.

I'm not sure what to make of 'professionals', but otherwise I agree, go ahead and publish the binaries too, as much as applicable. Could be a valuable addition. (In some cases it might not be possible/practical to publish machine-code binaries, such as when working with GPUs, or Java. These platforms tend to be JIT based, and hostile to dumping and restoring exact binaries.)

I agree with your final two paragraphs.

[0] https://news.ycombinator.com/item?id=24264376

ByteJockey · on Aug 25, 2020

> Code should be correct by construction, not correct by coincidence.

Glad we agree, if you're aware of how your compiler handles these things, you can construct it to be correct in this way.

It won't be portable at all (even to the next patch version of the compiler), I would never let it pass a code review, but that doesn't sound like an issue that's relevant here.

MaxBarraclough · on Aug 26, 2020

> if you're aware of how your compiler handles these things, you can construct it to be correct in this way.

I presume we agree but I'll do my usual rant against UB: Deliberately introducing undefined behaviour into your code is playing with fire, and trying to outsmart the compiler is generally a bad idea. Unless the compiler documentation officially commits to a certain behaviour (rollover arithmetic for signed types, say), then you should take steps to avoid undefined behaviour. Otherwise, you're just going with guesswork, and if the compiler generates insane code, the standards documents define it to be your fault.

It might be reasonable to make carefully disciplined and justified exceptions, but that should be done very cautiously. JIT relies on undefined behaviour, for instance, as ultimately you're treating an array as a function pointer.

> It won't be portable at all (even to the next patch version of the compiler)

Right, doing this kind of thing is extremely fragile. Does it ever crop up in real-life? I've never had cause to rely on this kind of thing.

It would be possible to use a static assertion to ensure my code only compiles on the desired compiler, preventing unpleasant surprises elsewhere, but I've never seen a situation where it's helpful.

This isn't the same thing as relying on 'ordinary' compiler-specific functionality, such as GCC's fixed-point functionality. Such code will simply refuse to compile on other compilers.

> I would never let it pass a code review, but that doesn't sound like an issue that's relevant here.

Disagree. It should be possible to independently reproduce the experiment. Robust code helps with this. Code shouldn't depend on an exact compiler version, there's no good reason code should.

jnxx · on Aug 24, 2020

> After all, I'm wasting 2 months of my PhD for the marriage of my own results with known results which -in principle- could have been done within one day if the code base would allow for it.

Sounds like it is quite good science to do that, because it puts the computation on a pair of independent feet.

Otherwise, it could just be that the code you are using as a bug and nobody notes until it is too late.

martingab · on Aug 25, 2020

I see your and MaxBarraclough concerns. In my case, there exist 5-6 codes which do -at their core- the same thing as ours does and they all have been cross-checked against each other within either theoretical or numerical precision (where possible). That's the spirit that sjburt was referring to, I guess, and which triggered me because it is only true to a certain extend.

The cross-checking is anyways good scientific practise, not only because of bugs in the code (that's actually a sub-leading problem imho), but because of the degree of difficulty of the problems and the complexity of their solutions (and their reproducibility). In that sense, cross-checking should discover both, scientific "bugs" and programming-bugs. The "debugging" is partly also done at the community level - at least in our field of research.

However, it is also a matter of efficiency. I -and many others too- need to re-implement not because of bug-hunting/cross-checking but simply because we do not understand the "ugly" code of our colleagues and instead of taking the risk to break existing code we simply write new one which is extremely inefficient (others may take the risk and then waste months on debugging and reverse-engineering which is also inefficient). So my point on writing "good code" is not so much about avoiding bugs but about being kind to you colleagues, saving them nerves and time (which they can then spend on actual science) and thus also saving taxpayers money...

jnxx · on Aug 24, 2020

> If you cannot the release the code because the code is too ugly or untested or has bugs, how do you expect anyone with the right expertise to assess your findings?

Yes, that should be this way.

Also all cases where some company research team goes to a scientific conference and presents a nifty solution for problem X without telling how it was purportedly done, it should be absolutely required to publish code and data for this.

*And that's also something which is broken about software patents - patents are about open knowledge, software which uses such patents is not open - this combination should not be allowed at all).

jnxx · on Aug 24, 2020

With the caveat that while in some cases, like computational science, numerical analysis, machine learning algorithms, computer-assisted proofs, and so on, details of the code could be crucial, in other cases, they should not matter that much. I too have the impression that the HN public tends to over-value the importance of code in these cases when it is mostly a tool for evaluating a scientific result.

Jabbles · on Aug 24, 2020

I am interested to know the distinction between "production-ready" and "science-ready" code.

I do not think "non-experts" should be able to use your code, but I do think an expert who was not involved in writing it should be.

dmlorenzetti · on Aug 24, 2020

Hard-coded file paths for input data. File paths hard-coded to use somebody's Google Drive so that it only runs if you know their password. Passwords hard-coded to get around the above problem.

In-code selection statements like `if( True ) {...}`, where you have no idea what is being selected or why.

Code that only runs in the particular workspace image that contains some function that was hacked out to make things work during a debugging session 5 years ago.

Distributed projects where one person wrote the preprocessor, another wrote the simulation software, and a third wrote the analysis scripts, and they all share undocumented assumptions worked out between the three researchers over the course of two years.

Depending on implementation-defined behavior (like zeroing out of data structures).

Function and variable names, like `doit()` and `hold`, which make it hard to understand the intention.

Files that contain thousands of lines of imperative instructions with documentation like "Per researcher X" every 100 lines or so.

Code that runs fine for 6 hours, then stops because some command-line input had the wrong value.

I've seen all of these over the years. Even as a domain expert who has spoken directly with authors and project leads, this kind of stuff makes it very hard to tease out what the code actually does, and how the code corresponds to the papers written about the results.

mroche · on Aug 24, 2020

You’re giving me flashbacks! I spent a year as an admin on an HPC cluster at my university building tools/software and helping researchers get their projects running and re-lead the implementation of container usage. The amount of scientific code/projects that required libraries/files to be in specific locations, or assumed that everything was being run from a home directory, or sourced shell scripts at run time (that would break in containers) was staggering. A lot of stuff had the clear “this worked on my system so...” vibe about it.

As an admin it was quite frustrating, but I understand it sometimes when you know the person/project isn’t tested in a distributed environment. But when it’s the projects that do know how they’re used and still do those things...

petschge · on Aug 24, 2020

One example: My code used to crash for a long time if you set the thermal speed to something greater than the speed if light. Should the code crash? No. And by now I have found the time to write extra code to catch the error and midly insult the user (It says "Faster than light? Please share that trick with me!") Does it matter? No. It didn't run and give plausible-but-wrong results. So that is code that I would call "science-ready" but I wouldn't want it criticized by people outside my domain.

jnxx · on Aug 24, 2020

I don't think that would be any problem (why should it?).

Code exhibiting undefined behavior is a different kettle of fish...

petschge · on Aug 24, 2020

Which is why I run valgrind on my code (with a parameter file containing physically valid inputs) to get rid of all undefined behavior. But I gave up on running afl-fuzz, because all it found was crashes following from physically invalid inputs. I fixed the obvious once to make the code nicer for new users, but once afl started to find only very creative corner cases I stopped.

jnxx · on Aug 24, 2020

Well done!

gowld · on Aug 24, 2020

Then you publish your work and critics publish theirs and the community decides which claims have proven their merit. This is the fundamental structure of the scientific community.

How is "your code has error and I rebuke you" a more painful critique than "you are hiding your methodology and so I rebuke you"?

petschge · on Aug 24, 2020

Nothing limits the field of critics to people who have written their own code and know what they are doing.

lemmsjid · on Aug 24, 2020

There's a ton of overlap, because science code might be a long running, multi-engineer distributed system and production code might be a script that supports a temporary business process. But let's assume production ready is a multi customer application and science ready is computations to reproduce results in a paper.

Here's a quick pass, I'm sure I'm missing stuff, but I've needed to code review a lot of science and production output and below is how I tend to think of it, especially taking efficiency of engineer/scientist time into account.

Production Ready?

* code well factored for extensibility, feature change, and multi-engineer contribution

* robust against hostile user input

* unit and integration tested

Science Ready?

* code well factored for readability and reproducibility (e.g. random numbers seeded, time calcs not set against 'now')

* robust against expected user input

* input data available? testing optional but desired, esp unit tests of algorithmic functions

* input data not available? a schema-correct facsimile of input data available in a unit test context to verify algorithms correct

Both?

* security needs assessed and met (science code might be dealing with highly secure data, as might production code)

* performance and stability needs met (production code more often requires long term stability, science sometimes needs performance within expected Big O to save compute time if it's a big calculation)

PeterisP · on Aug 24, 2020

Your requirements seem to push 'Science ready' far into what I'd consider "worthless waste of time", coming from the perspective of code that's used for data analysis for a particular paper.

The key aspect of that code is that it's going to be run once or twice, ever, and it's only ever going to be run on a particular known set of input data. It's a tool (though complex) that we used (once) to get from A to B. It does not need to get refactored, because the expectation is that it's only ever going to be used as-is (as it was used once, and will be used only for reproducing results), it's not intended to be built upon or maintained. It's not the basis of the research, it's not the point of research, it's not a deliverable in that research, it's just a scaffold that was temporarily neccessary to do some task - one which might have been done manually earlier through great effort, but that's automated now. It's expected that the vast majority of the readers of that paper won't ever need to touch that code, they care only about the results and a few key aspects of the methodology, which are (or should be) all mentioned in the paper.

It should be reproducible to ensure that we (or someone else) can obtain the same B from A in future, but that's it, it does not need to be robust to input that's not in the input datafile - noone in the world has another set of real data that could/should be processed with that code. If after a few years we or someone else will obtain another dataset, then (after those few years, if that dataset happens) there would be a need to ensure that it works on that dataset before writing a paper about that dataset, but it's overwhelmingly likely that you'd want to modify that code anyway both because that new dataset would not be 'compatible' (because the code will be tightly coupled to all the assumptions in the methodology you used to get that data, and because it's likely to be richer in ways you can't predict right now) and you'd want to extend the analysis in some way.

It should have a 'toy example' - what you call 'a schema-correct facsimile of input data' that's used for testing and validation before you run it on the actual dataset, and it should have test scenarios and/or unit tests that are preferably manually verifiable for correctness.

But the key thing here is that no matter what you do, that's still in most cases going to be "write once, run once, read never" code, as long as we're talking about the auxiliary code that supports some experimental conclusions, not the "here's a slightly better method for doing the same thing" CS papers. We are striving for reproducible code, but actual reproductions are quite rare, the incentives are just not there. We publish the code as a matter of principle, knowing all well that most likely noone will download and read it. The community needs the possibility for reproduction for the cases where the results are suspect (which is the main scenario where someone is likely to attempt reproducing that code), it's there to ensure that if we later suspect that the code is flawed in a way where the flaws affect the conclusions then we can go back to the code and review it - which is plausible, but not that likely. Also, if someone does not trust our code, they can (and possibly should) simply ignore it and perform a 'from scratch' analysis of the data based what's said in the paper. With a reimplementation, some nuances in the results might be slightly different, but all the conclusions in the paper should still be valid, if the paper is actually meaningful - if a reimplementation breaks the conclusions, that would be a successful, valuable non-reproduction of the results.

This is a big change from industry practice where you have mantras like "a line of code is written once but read ten times", in a scientific environment that ratio is the other way around, so the tradeoffs are different - it's not worth investing refactoring time to improve readability, if it's expected that most likely noone will ever read that code; it makes sense to spend that effort only if and when you need it.

lemmsjid · on Aug 24, 2020

Yep! I don't disagree with anything you're saying when I think from a particular context. It's really hard to generalize about the needs of 'science code', and my stab at doing so was certain to be off the mark for a lot of cases.

PeterisP · on Aug 24, 2020

Yes, there are huge differences between the needs of various fields. For example, some fields have a lot of papers where the authors are presenting a superior method for doing something, and if code is a key part of that new "method and apparatus", then it's a key deliverable of that paper and its accessibility and (re-)usability is very important; and if a core claim of their paper is that "we coded A and B, and experimentally demonstrated that A is better than B" then any flaws in that code may invalidate the whole experiment.

But I seem to get the vibe that this original Nature article is mostly about the auxiliary data analysis code for "non-simulated" experiments, while Hacker News seems biased towards fields like computer science, machine learning, etc.

dandelion_lover · on Aug 24, 2020

> the distinction between "production-ready" and "science-ready" code

In the first case, you must take into account all (un)imaginable corner cases and never allow the code to fail or hang up. In the second case it needs to produce a reproducible result at least for the published case. And do not expect it to be user-friendly at all.

arethuza · on Aug 24, 2020

I would regard (from experience) "science ready" code as something that you run just often enough to get the results to create publications.

Any effort to get code working for other people, or documented in any way would probably be seen as wasted effort that could be used to write more papers or create more results to create new papers.

This kind of reasoning was one of the many reasons I left academic research - I personally didn't value publications as deliverables.

chriswarbo · on Aug 24, 2020

My experience has been similar.

Still, there's plenty of room to encourage good(/better) practices which cost essentially nothing, e.g. using $PWD rather than /home/bob/foo

gowld · on Aug 24, 2020

If your experiment is not repeatable, it's an anecdote not data.

Any effort to write a paper readable for other people, or document the experiment in any way would probably be seen as wasted effort that could be used to create more results.

The "don't show your work" argument only makes sense if you are doing PR, not science.

neutronicus · on Aug 24, 2020

If it's repeatable by you then it's a trade secret, not an anecdote

arethuza · on Aug 25, 2020

I specifically got told off by my supervisor for trying to "improve" some of the software we were working on!

qppo · on Aug 24, 2020

Disclaimer, I'm a professional engineer and not a researcher.

The kind of code I'll ship for production will include unit testing designed around edge or degenerate cases that arose from case analysis, usually some kind of end to end integration test, aggressive linting and crashing on warnings, and enforcing of style guidelines with auto formatting tools. The last one is more important than people give it credit for.

For research it would probably be sufficient to test that the code compiles and given a set of known valid input the program terminates successfully.

searine · on Aug 24, 2020

>I am interested to know the distinction between "production-ready" and "science-ready" code.

In general, scientists don't care how long it takes or how many resources the code uses. It is not a big deal to run a script for an extra hour, or use up a node of supercomputer. Extravagent solutions or added packages to make the code run smoother or faster is only wasting time. It speed/elegance only really matters when you know the code is going to be distributed to the community.

Basically scientists only care if the result, is true. If the result it outputs is sensible, defensible, reliable, reproducible. It would be considered a dick move to criticism someones code, if the code was proven to produce the correct result.

jnxx · on Aug 24, 2020

> It would be considered a dick move to criticism someones code, if the code was proven to produce the correct result.

Formal proof is much much harder than making code understandable and reviewable. It can be done but it is not easy, and can yield surprising results:

https://en.wikipedia.org/wiki/CompCert

http://envisage-project.eu/proving-android-java-and-python-s...

Jabbles · on Aug 24, 2020

Do you know how you could get to the state that "the code was proven to produce the correct result"?

If not by unit tests, code review or formal logic, then what?

jabirali · on Aug 24, 2020

Not all scientific code is amenable to unit testing. From my own experience from a PhD in condensed matter physics, the main issue was that how important equations and quantities “should” behave by themselves was often unknown or undocumented, so very often each such component could only be tested as part of a system with known properties.

You can then use unit testing for low-level infrastructure (e.g. checking that your ODE solver works as expected), but do the high-level testing via scientific validation. The first line of defense is to check that you don’t break any laws of physics, e.g. that energy and electric charge is conserved in your end results. Even small implementation mistakes can violate these.

Then you search for related existing publications of a theoretical or numerical nature, trying to reproduce their results; the more existing research your code can reproduce, the more certain you can be that it is at least consistent with known science. If this fails, you have something to guide your debugging; or if you’re very lucky, something interesting to write a paper about :).

The final validation step is of course to validate against experiments. This is not suited for debugging though, since you can’t easily say whether a mismatch is due to a software bug, experimental noise, neglected effects in the mathematical model, etc.

searine · on Aug 24, 2020

>If not by unit tests, code review or formal logic, then what?

Cross referencing independent experiments and external datasets.

Science doesn't work like software. The code can be perfect and still not give results that reflect reality. The code can be logical and not reflect reality. Most scientists I know go in with the expectation that "the code is wrong" and its results must be validated by at least one other source.

analog31 · on Aug 24, 2020

I'm a scientist in a group that also includes a software production team. For me, the standard of scientific reproducibility is that a result can be replicated by a reasonably skilled person, who might even need to fill in some minor details themselves.

Part of our process involves cleaning up code to a higher state of refinement as it gets closer to entering the production pipeline.

I've tested 30 year old code, and it still runs, though I had to dig up a copy of Turbo Pascal, and much of it no longer exists in computer readable form but would have to be re-entered by hand. Life was actually simpler back then -- with the exception of the built-ins of Turbo Pascal, it has no dependencies.

My code was in fact adopted by two other research groups with only minor changes needed to suit slightly different experimental conditions. It contained many cross-checks, though we were unaware of modern software testing concepts at the time.

For a result to have broader or lasting impact, replication is not enough. The result has to fit into a broader web of results that reinforce one another and are extended or turned into something useful. That's the point where precise replication of minor supporting results becomes less important. The quality of any specific experiment done in support of modern electromagnetic theory would probably give you the heebie jeebies, but the overall theory is profoundly robust.

The same thing has to happen when going from prototype to production. Also, production requires what I call push-button replication. It has to replicate itself at the click of a mouse, because the production team doesn't have domain experts who can even critique the entirety of their own code, and maintaining their code would be nearly impossible if it didn't adhere to standards that make it maintainable by multiple people at once.

Jabbles · on Aug 24, 2020

This sounds great. In your opinion, do you think your team is unusual in those aspects? Do you have any knowledge of the quality of code in other branches of physics or other sciences?

analog31 · on Aug 24, 2020

Well, I know the quality of my own code before I got some advice. And I've watched colleagues doing this as well.

My own code was quite clean in the 1980s, when the limitations on the machines themselves tended to keep things fairly compact with minimal dependencies. And I learned a decent "structured programming" discipline.

As I moved into more modern languages, my code kind of degenerated into a giant hairball of dependencies and abstractions. "Just because you can do that, doesn't mean you should." I've kind of learned that the commercial programmers limit themselves to a few familiar patterns, and if you try to create a new pattern for every problem, your code will be hard to hand off.

Scientists would benefit from receiving some training in good programming hygiene.

chrchang523 · on Aug 24, 2020

Nit: implementations of Monte Carlo methods are not necessarily nondeterministic. Whenever I implement one, I always aim for a deterministic function of (input data, RNG seed, parallelism, workspace size).

petschge · on Aug 24, 2020

It really helps with debugging if your MC code is deterministic for a given input seed. And then you just run for a sufficient number of different seeds to sample the probability space.

vngzs · on Aug 24, 2020

Alternatively: seed the program randomly by default, but allow the user to specify a seed as a CLI argument or function argument (for tests).

In the common case, the software behaves as expected (random output), but it is reproducible for tests. You can then publish your RNG seed with the commit hash when you release your code/paper, and others may see your results and investigate that particular code execution.

petschge · on Aug 24, 2020

Sure that works too. But word of advice from real life: Print the random seed at the beginning of the run so you can find out which seed caused it to crash or do stupid things.

jnxx · on Aug 24, 2020

And it seems that the people from Imperial College have done that with their epidemiological simulation. What critics claim is that their code produces non-deterministic results when given deterministic input and random seeds, i.e. that their code is seriously broken. Which would be a serious issue if true.

jnxx · on Aug 25, 2020

To be more specific, the critics claim the code would yield completely different results.

halfdan · on Aug 24, 2020

I have done research on Evolutionary Algorithm and numerical optimization. It was nigh impossible to reproduce poorly described algorithms from state of the art research at the time and researchers would very often not bother to reply to inquiries for their code. Even if you did get the code it would be some arcane C only compatible with a GCC from 1996.

Code belongs with the paper. Otherwise we can just continue to make up numbers and pretend we found something significant.

genewitch · on Aug 25, 2020

In 2006 or 2008 a university in England published some fluff about genetic/evolutionary algorithms that were evolving circuits on an fpga, specifically the published stuff regarded an fpga without a clock was able to differentiate between two tones.

I've spent the intervening years trying to find a way to implement this myself, going as far as to buy things like the ice40 fpga because the bitstreams are supposedly unlocked; this is a pre-req for modifying the actual gate/logic on the chip.

I've emailed the professor listed as the headliner in the articles published about it to no avail.

Nearly my entire adult life has been spent reading some interesting article, chasing down the paper, finding out if any code was published, and seeing if I could run the code myself.

It wasn't until machine learning with pytorch became mainstream that I started having luck replicating results. Just some more data points for this discussion.

jacobwilliamroy · on Aug 25, 2020

Our first job as scientists is to make sure we're not fooling ourselves. I wouldn't just use any old scale to take a measurement. I want a calibrated scale, adjusted to meet a specific standard of accuracy. Such standards and calibrations ensure we can all get "the same" result doing "the same" thing, even if we use different equipment from different vendors. The concerns about code are exactly the same. It's even scarier to me because I realize that unlike a scale, most scientists have no idea how to calibrate their code to ensure accurate, reproducible results. Of course with the scales, the calibration is done by a specialized professional who's been trained to calibrate scales. Not sure how we solve this issue with the code.

woah · on Aug 24, 2020

I’m very puzzled by this attitude. As an accelerator physicist, would you want you accelerator to be held together by duct tape, and producing inconsistent results? Would you complain that you’re not a professional machinist when somebody pointed it out? Why is software any different than hardware in this respect?

enriquto · on Aug 24, 2020

> I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

In what way do idiots making idiotic comments about your correct code invalidate your scientific production? You can still turn out science and let people read and comment freely on it.

> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.

I guess you would not need to engage personally with the idiots at "acceleratorskeptics.com", but likely most of their critique would be easily shut off by a simple sentence such as this one. Since most of your readers would not be idiots, they could scrutinize your code and even provide that reply on your behalf. This is called the scientific method.

I agree that you produce science, not merely code. Yet, the code is part of the science and you are not really publishing anything if you hide that part. Criticizing scientific code because it is bad software engineering is like criticizing it because it uses bad typography. You should not feel attacked by that.

spamizbad · on Aug 24, 2020

> In what way do idiots making idiotic comments about your correct code invalidate your scientific production? You can still turn out science and let people read and comment freely on it.

How would a layperson identify a faulty critique? It would be picked up by the media who would do their usual “both sides” thing.

enriquto · on Aug 24, 2020

Not that they abstain from doing that shit today, when code is not often published.

An educated and motivated layperson at least would have the chance to learn whether the critique is faulty. Today, with secret code, it is impossible to verify for almost everybody.

shirakawasuna · on Aug 24, 2020

Race conditions and certain forms of non-determinism could invalidate the results of a given study. Code is essentially a better-specified methods section, it just says what they did. Scientists are expected to include a methods section for exactly this reason, and any scientist worried about including a methods section in their paper would be rightly rejected.

However, a methods section is always under-specified. Code provides the unique opportunity to actually see the full methods on display and properly review their work. It should be mandated by all reputable journals and worked into the peer review process.

ativzzz · on Aug 24, 2020

While you're running experiments, it doesn't matter, but publishing any sort of result or using your code in parts of other publishable code IS production code, and you should treat it as such.

pbalau · on Aug 24, 2020

> people claiming that their non-software engineering grade code invalidates the results of their study.

But that's exactly the problem.

Are you familiar with that bug in early Civ games where an overflow was making Ghandi nuke the crap out of everyone? What if your code has a similar issue?

What if you have a random value right smack in the middle of your calculations and you just happened to be lucky when you run your code?

I'm not that familiar with Monte Carlo, my understanding is that this is just a way to sample the data. And I won't be testing your data sampling, but I will expect that given the same data to your calculations part (eg, after the sampling happens), I get exactly the same results every time I run the code and on any computer. And if there are differences I expect you to be able to explain why they don't matter, which will show you were aware of the differences in the first place and you were not just lucky.

And then there is the matter of magic values that plaster research code.

Researchers should understand that the rules for "software engineering grade code" are not there just because we want to complicate things, but because we want to make sure the code is correct and does what we expect it to do.

/edit: The real problem is not getting good results with faulty code, is ignoring good solutions because faulty code.

touristtam · on Aug 25, 2020

> What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts. In fact the time spent making it that way is time that a scientist spends doing software engineering instead of science, which isn't very efficient.

If the proof on which the paper is based is in the code that produced the evidence, you absolutely need to be able to let a lambda user run it without specific knowledge to abide to the reproducible principle. Asking a reviewer to fiddle about like a IT professional to get something working is bound to promote lazy reviewing and either will result into dismissing the result or approval without real review.

And by the way producing a paper could be argued it isn't really science either, but if you are working with MSFT Office, you know there is a fair amount of non science work hours that has been put into that as well.

eru · on Aug 25, 2020

> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.

Not so fast. Monte Carlo code turns arbitrary RNG seeds into outputs. That process can, and arguably should be, deterministic.

To do your study, you feed your Monte Carlo code 'random enough' seeds. Coming up with the seeds does not need to be deterministic. But once the seeds are fixed, the rest can be deterministic. Your paper should probably also publish the seeds used, so that people can reproduce everything. (And so they can check whether your seeds are carefully chosen, or really produce typical outcomes.)

CamperBob2 · on Aug 25, 2020

I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

Sure, and that rationale works OK when your code operates in a limited, specialized domain.

But if you're modeling climate change or infectious diseases, and you expect your work to affect millions of human lives and trillions of dollars in spending, then you owe us a full accounting of it.

jnxx · on Aug 25, 2020

> Sure, and that rationale works OK when your code operates in a limited, specialized domain.

There are a lot of domains which do have a deep impact on society but are completely underfunded. For example research on ecology and declining insect populations. Or research on education. And domains like epidemiology, climate change research, cancer research or such are not known to earn better salaries to scientists as on average. Most scientists earn a pity.

> But if you're modeling climate change or infectious diseases, and you expect your work to affect millions of human lives and trillions of dollars in spending, then you owe us a full accounting of it.

What one can expect from scientists, in whatever subject they work, is honesty, integrity, a full account of their findings. What you can't expect is that they just turn into expert software engineers or make their working code beautiful. You can't expect them to work for free. What the academic system demands from them is that they work on their next paper instead, so if you want pretty code, you need at least in part to change the system.

spott · on Aug 27, 2020

>when that is the entire point of Monte Carlo methods and doesn't change their result.

Two nitpicks: a) it shouldn't change the conclusions, but MC calculations will get different results depending on the seed. and b) it is considered good practice in reproducible science to fix the seed so that the results of subsequent runs give exactly the same results.

Ultimately, I think there is a balance: really poor code can lead to incorrect conclusions... but you don't need production ready code for scientific exploration.

selectionbias · on Aug 25, 2020

Sorry to be pedantic, but although Monte Carlo simulations are based on pseudo-randomness, I still think it is good practice that they have deterministic results (i.e., use a given seed) so that the exact results can be replicated. If the precise numbers can be reproduced then a) it helps me as a reviewer see that everything is kosher with their code and b) it means that if I tweak the code to try something out my results will be fully compatible with theirs.

oliver101 · on Aug 24, 2020

Why is "doing software engineering" not "doing science"?

Anybody who has conducted experimental research will say they spent 80% of the time using a hammer or a spanner. Repairing faulty lasers or power supplies. This process of reliable and repeatable experimentation is the basis of science itself.

Computational experiments must be held to the same standards as physical experiments. They must be reproducible and they should be publicly available (if publicly funded).

mikemotherwell · on Aug 25, 2020

What are the frameworks used in scientific endeavours? Given that scaling is not an issue, something like Rails for science seems like it could potentially return many $(B/M)illions of dollars for humanity.

booleandilemma · on Aug 24, 2020

What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts.

Sounds like I should just become a scientist then.

Do you guys write unit tests or is that beneath you too?

jnxx · on Aug 24, 2020

edit: please read the grandchild comment before going off on the idea that some random programmer on the Internet dares to criticize scientific code he does not understand. What is crucial in the argument here is indeed the distinction between methods employing pseudo-randomness, like Monte Carlo simulation, and non-determinism caused by undefined behavior.

> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points.

The person which wrote the linked blog post writes that it was a software engineer at google. Unfortunately, that claim is not falsifiable as the person decided to remain anonymous.

> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.

The claim is that even with the same random seed for the random generator, the program produces different results, and this is explained by the allegation that it runs non-deterministic (in the sense of undefined behavior) in multiple threads. It claims also that it produces significantly different results depending on which output file format is chosen.

If this is true, the code would have race conditions, and as being impacted by race conditions is a form of undefined behavior, this would make any result of the program questionable, as the program would not be well-defined.

Personally, I am very doubtful whether this is true, this would be incredibly sloppy by the imperial college scientists. Some more careful analysis by a recognized programmer might be warranted.

However it underlines well the importance of the main topic that scientific code should be open to analysis.

> What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts.

Fully agree with this. But it should try to document its limitations.

aspaceman · on Aug 24, 2020

> If this is true, the code would have race conditions, and as being impacted by race conditions is a form of undefined behavior, this would make any result of the program questionable, as the program would not be well-defined.

That’s not at all what that means. What are you talking about? As long as a Monte Carlo process works towards the same result it’s equivalent.

You’re speaking genuine nonsense as far as I’m concerned. Randomness doesn’t imply non deterministic. Non-determinitism in no way implies race conditions or undefined behavior. We care that the random process reaches the same result, not that the exact sequence of steps is the same.

This is what scientists are talking about. A bunch of (pretty stupid) nonexperts want to criticize your code, so they feel smart on the internet.

jnxx · on Aug 24, 2020

I am referring to this blog post:

https://lockdownsceptics.org/code-review-of-fergusons-model/

It says, word-by-word:

> Clearly, the documentation wants us to think that, given a starting seed, the model will always produce the same results.

>

>Investigation reveals the truth: the code produces critically different results, even for identical starting seeds and parameters.

> I’ll illustrate with a few bugs. In issue 116 a UK “red team” at Edinburgh University reports that they tried to use a mode that stores data tables in a more efficient format for faster loading, and discovered – to their surprise – that the resulting predictions varied by around 80,000 deaths after 80 days: ...

The bugs which the blog post implies here are such ones as described by Jens Regehr: https://blog.regehr.org/archives/213

Not that I do not endorse these statements in the blog - I am rather skeptical whether they are true at all.

What the authors of the blob post mean is clearly "undefined behaviour" in the sense of non-deterministic program execution of a program that is not well-formed. It is clear that many non-experts could confuse that with the pseudo-randomness implicit in Monte-Carlo simulations, but this is a very different thing. The first is basically a broken, invalid, and untrustworthy program. The second is the established method to produce a computational result by introducing stochastic behavior, which is for example how modern weather models work.

These are wildly different things. I do not understand why your comment just adds to the confusion between these two things??

> A bunch of (pretty stupid) nonexperts want to criticize your code, so they feel smart on the internet.

As said, I don't endorse the critique in the blog. However, critique in a software implementation, as well as in scientific matters, should never carry a call on authority - it should logically explain what is the problem, with concrete points. Unfortunately, the cited blog post remains very vague about this, while claiming:

> My background. I have been writing software for 30 years. I worked at Google between 2006 and 2014, where I was a senior software engineer working on Maps, Gmail and account security. I spent the last five years at a US/UK firm where I designed the company’s database product, amongst other jobs and projects. I was also an independent consultant for a couple of years.

It would be much better if, instead claiming that there could be race conditions, it could point to lines in the code with actual race conditions, and show how the results of the simulation are different when the race conditions are fixed. Otherwise, it just looks like he claims that the program is buggy, because he is in no position to question the science, and does not like the result.

jnxx · on Aug 24, 2020

There is something I need to add, it is a subtle but important point:

Non-determinism can be caused by

a) random seeds derived from hardware, such as seek times in a HDD controller, which is fed into pseudo random number (PRNG) generation. This is not a problem. For debugging, or comparison, it can make sense to switch it off, though.

b) data race conditions, which is a form of undefined behavior. This not only can dramatically change results of a program run, but also invalidates the program logic, in languages such as C and C++. This is what he blog post in "lockdownskeptics.org" suggests. For the application area and its consequences, this would be a major nightmare.

c) What I had forgotten is that parallel execution (for example in LAM/MPI, map/reduce or similar frameworks) is inherently non-deterministic and, in combination with properties of floating-point computation, can yield different but valid results.

Here an example:

A computation is carried out on five nodes and they return the values 1e10, 1e10, 1e-20, -1e10, -1e10, in random order. The final result is computed by summing these up.

Now, the order of computation could be:

((((1e10 + 1e10) + 1e-20) + -1e10) + -1e10)

or it could be:

(((1e10 + -1e10) + 1e-20) + (+1e10 + -1e10))

In the first case, the result would be zero, in the second case, 1e-20, because of the finite length of floating point representation.

_However_... if the numerical model or simulation or whatever is stable, this should not lead to a dramatic qualitative difference in the result (otherwise, we have a stability problem with the model).

Finally, I want to cite one last paragraph from the post on lockdownskeptics.org:

> Conclusions. All papers based on this code should be retracted immediately. Imperial’s modelling efforts should be reset with a new team that isn’t under Professor Ferguson, and which has a commitment to replicable results with published code from day one.

> On a personal level, I’d go further and suggest that all academic epidemiology be defunded. This sort of work is best done by the insurance sector. Insurers employ modellers and data scientists, but also employ managers whose job is to decide whether a model is accurate enough for real world usage and professional software engineers to ensure model software is properly tested, understandable and so on. Academic efforts don’t have these people, and the results speak for themselves.

UncleMeat · on Aug 24, 2020

Race conditions aren't undefined behavior in C/C++. Data races are. Lots and lots of real systems contain race conditions without catastrophe.

jnxx · on Aug 24, 2020

> Race conditions aren't undefined behavior in C/C++. Data races are.

You are right with the distinction, I had data race conditions in mind.

Race conditions can well happen in a correct C/C++ multi-threaded program in the sense that the order of specific computation steps is sometimes random. And for operations such as floating-point addition, where order of operations does matter, the exact result can be random as a consequence. But the end result should not depend dramatically on it (which is what the poster at lockdownskeptics.org claims).

beefee · on Aug 24, 2020

I want science to be held to a very high standard. Maybe even higher than "software engineering grade". Especially if it's being used as a justification for public policy.

MaxBarraclough · on Aug 24, 2020

Perhaps just a nitpick: software engineering runs the gamut from throwing together a GUI in a few hours, all the way up to avionics software where a bug could kill hundreds. There's no such thing as 'software engineering grade'.

kordlessagain · on Aug 24, 2020

> people that don't understand the material making low effort critiques of minor technical points

GPT-3 FTW!

MaxBarraclough · on Aug 24, 2020

At the risk of just mirroring points which have already been made:

> you understand that the links in your post are the exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.

It's profoundly unscientific to suggest that researchers should be given the choice to withhold details of their experiments that they fear will not withstand peer review. That's much of the point of scientific publication.

Researchers who are too ashamed of their code to submit it for publication, should be denied the opportunity to publish. If that's the state of their code, their results aren't publishable. Unpublishable garbage in, unpublishable garbage out. Simple enough. Journals just shouldn't permit that kind of sloppiness. Neither should scientists be permitted to take steps to artificially make it difficult to reproduce (in some weak sense) an experiment. (Independently re-running code whose correctness is suspect, obviously isn't as good as comparing against a fully independent reimplementation, but it still counts for something.)

If a mathematician tried to publish the conclusion of a proof but refused to show the derivation, they'd be laughed out of the room. Why should we hold software-based experiments to such a pitifully low standard by comparison?

It's not as if this is a minor problem. Software bugs really can result in incorrect figures being published. In the case of C and C++ code in particular, a seemingly minor issue can result in undefined behaviour, meaning the output of the program is entirely unconstrained, with no assurance that the output will resemble what the programmer expects. This isn't just theoretical. Bizarre behaviour really can happen on modern systems, when undefined behaviour is present.

A computer scientist once told me a story of some students he was supervising. The students had built some kind of physics simulation engine. They seemed pretty confident in its correctness, but in truth it hadn't been given any kind of proper testing, it merely looked about right to them. The supervisor had a suggestion: Rotate the simulated world by 19 degrees about the Y axis, run the simulation again, and compare the results. They did so. Their program showed totally different results. Oh dear.

Needless to say, not all scientific code can so easily be shown to be incorrect. All the more reason to subject it to peer review.

> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points.

Why would you care? Science is about advancing the frontier of knowledge, not about avoiding invalid criticism from online communities of unqualified fools.

I sincerely hope vaccine researchers don't make publication decisions based on this sort of fear.

RandoHolmes · on Aug 24, 2020

> people claiming that their non-software engineering grade code invalidates the results of their study.

How exactly is this a bad thing?

> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.

But it should be noted that what you didn't say is that you're here to turn out accurate science.

This is the software version of statistics. Imagine if someone took a random sampling of people at a Trump rally and then claimed that "98% of Americans are voting for Trump". And now imagine someone else points out that the sample is biased and therefore the conclusion is flawed, and the response was "Hey, I'm just here to do statistics".

---

Do you see the problem now? The poster above you pointed out that the conclusions of the software can't be trusted, not that the coding style was ugly. Most developers would be more than willing to say "the code is ugly, but it's accurate". What we don't want is to hear "the conclusions can't be trusted and 100 people have spent 10+ years working from those unreliable conclusions".

auntienomen · on Aug 24, 2020

Oh, he didn't say 'accurate science', nice gotcha!

This is exactly the sort of pedantic cluelessness that scientists are seeking to avoid by not publishing their code.

RandoHolmes · on Aug 24, 2020

I don't consider accuracy in science to be pedantic, and I suspect most others don't either.

To paraphrase what the other developer said: "I don't want my work to be checked, I'm not here for accuracy, just the act of doing science".

When I was young, the ability to invalidate was the core aspect of science, but apparently that's changed over the years.

dandelion_lover · on Aug 24, 2020

As a theoretical physicist doing computer simulations, I am trying to publish all my code whenever possible. However all my coauthors are against that. They say things like "Someone will take this code and use it without citing us", "Someone will break the code, obtain wrong results and blame us", "Someone will demand support and we do not have time for that", "No one is giving away their tools which make their competitive advantage". This is of course all nonsense, but my arguments are ignored.

If you want to help me (and others who agree with me), please sign this petition: https://publiccode.eu. It demands that all publicly funded code must be public.

P.S. Yes, my 10-year-old code is working.

SilasX · on Aug 24, 2020

>"Someone will demand support and we do not have time for that",

Well ... that part isn't nonsense, though I agree it shouldn't be a dealbreaker. And it means we should work towards making such support demands minimal or non-existent via easy containerization.

I note with frustration that even the Docker people, whose entire job is containerization, can get this part wrong. I remember when we containerized our startup's app c. 2015, to the point that you should be able to run it locally just by installing docker and running `docker-compose up`, and it still stopped working within a few weeks (which we found when onboarding new employees), which required a knowledgeable person to debug and re-write.

(They changed the spec for docker-compose so that the new version you'd get when downloading Docker would interpret the yaml to mean something else.)

onhn · on Aug 24, 2020

As a theoretical physicist your results should be reproducible based on the content of your papers, where you should detail/state the methods you use. I would make the argument that releasing code in your position has the potential to be scientifically damaging; if another researcher interested in reproducing your results reads your code, then it is possible their reproduction will not be independent. However they will likely still publish it as such.

pthread_t · on Aug 24, 2020

> "No one is giving away their tools which make their competitive advantage"

This hits close to home. Back in college, I developed software, for a lab, for a project-based class. I put the code up on GitHub under the GPL license (some code I used was licensed under GPL as well), and when the people from the lab found out, they lost their minds. A while later, they submitted a paper and the journal ended up demanding the code they used for analysis. Their solution? They copied and pasted pieces of my project they used for that paper and submitted it as their own work. Of course, they also completely ignored the license.

bumby · on Aug 24, 2020

I’m curious, are dedicated software assurance teams a thing in your research area? Or is quality left up to the primary researchers?

BeetleB · on Aug 24, 2020

> Or is quality left up to the primary researchers?

Individual researchers, and in many disciplines (like physics), there is almost no emphasis on quality.

I left academia a decade ago, but at the time all except one of my colleagues protested when version control was suggested to them. Some of these have code in the 30-40K lines.

bumby · on Aug 24, 2020

I formerly worked in research, left and am now back in a quasi-research organization.

It’s bit disconcerting seeing how much quality is brushed aside particularly in software. Researchers seem to intuitively grasp how they need quality hardware to do their job, yet software rarely gets the same consideration. I’ve never been able to get many to come around to the idea that software should be treated the same as any other engineered product that enables their research

core-questions · on Aug 24, 2020

> protested when version control was suggested

Academics are strange like this. The root reason is fear: fear that you're complicating their process, that you're going to interrupt their productivity or flow state, that you're introducing complication that has no benefit. They then build up a massive case in their minds for why they shouldn't do this; good luck fighting it.

Doubly so if you're IT staff and don't have a PhD. There's a fundamental lack of respect on behalf of (a vocal minority) of academics about bit plumbers, until of course when they need us to do something laughably basic. It's the seeds of elitism; in reality we should be able to work together, each of us understanding our particular domain and working to help the other.

BeetleB · on Aug 24, 2020

> The root reason is fear: fear that you're complicating their process, that you're going to interrupt their productivity or flow state, that you're introducing complication that has no benefit.

Yes, but how does it compare to all the complicated processes that exist in academic institutions currently? Almost all of which originated from academics themselves, mind you.

core-questions · on Aug 24, 2020

It's not that complicated. No one individual process is that bad. The problem is that there's so many that you need to steep in it for ages to pick everything up.

This means it makes most sense to pick up processes that are portable and have longevity. Learning Git is a pretty solid example.

gowld · on Aug 24, 2020

I think this is why industry does better science than academia, at least in any area where there are applications. Generally, they get paid for being right, not just for being published, so they put respect and money into people that help get correct results.

jack_h · on Aug 24, 2020

I think this is a much wider problem than just in academia/research. Really any area where software isn't the primary product tends to have fairly lax software standards. I work in the embedded firmware field and best practices are often looked at with skepticism and even derision by the electrical engineers who are often the ones doing the programming^[1].

I think software development as a field is incredibly vast and diverse. Programming is an amazing tool, but it's a tool that requires a lot of knowledge in a lot of different areas.

^[1] This isn't universally true of course, I'm not trying to be insulting here.

gowld · on Aug 24, 2020

"quality" is a subjectit word. Let's be clear what this means:

Individual researchers, and in many disciplines (like physics), there is almost no emphasis on correct results, merely on believable results.

bumby · on Aug 24, 2020

There are a few standardized definitions. The most succinct bring “quality is the adherence to requirements”.

As an example, if your science has the requirement of being replicable (as it should) there are a host of best practices that should flow down to the software development requirements. Not implementing those best practices would be indicative of lower quality

dandelion_lover · on Aug 24, 2020

Most of the codes I am developing alone. No one else looks at them ever. My supervisor also develops the code alone and never shows it to anyone (not even members of the group).

In other cases, a couple of other researchers may have a look at my code or continue its development. I worked with 4+ research teams and only saw one professional programmer in one of them helping the development. Never heard about a "dedicated software assurance team".

SiempreViernes · on Aug 24, 2020

To clarify, nobody sees the code because they aren't allowed, or nobody ever ask to see it?

dandelion_lover · on Aug 24, 2020

The second case. However I am hesitating to ask to look at the code of my supervisor. How would I explain why I need it (if it's not needed for my research)? It's also unlikely user-friendly, so it would take a lot of time to understand anything.

bumby · on Aug 24, 2020

I think you touched on something important. Researchers are most concerned with “getting things working”.

One of my favorite points from the book Clean Code was that professional developers aren’t satisfied with “working code”, they aim to make it maintainable. Which may mean writing it in a way that is more clear and concise than we are used to

throwaway287391 · on Aug 24, 2020

> I’m curious, are dedicated software assurance teams a thing in your research area?

Are these a thing in any research area? I've heard of exactly one case of an academic lab (one that was easily 99th+ percentile in terms of funding) hiring one software engineer not directly involved in leading a research effort, and when I tell other academics about this they're somewhat incredulous. (I admittedly have a bit of trouble believing it myself -- I can't imagine the incentive to work for low academic pay in an environment where you're inevitably going to feel a sense of inferiority to first year PhD students who think they're hot shit because they're doing "research".)

bumby · on Aug 24, 2020

>Are these a thing in any research area

I can say there are some that have the explicit intent but it can often fall to the wayside due to cost pressure. For example, government funded research from large organizations (think DoD or NASA) have these quality requirements but they can often be hand-waved away or just plain ignored due to cost concerns

Vinnl · on Aug 24, 2020

Interestingly each of those arguments also applies to publishing an article describing your work.

arcanus · on Aug 24, 2020

> Scientists really need to publish their code artifacts, and we can no longer just say "Well they're scientists or mathematicians" and allow that as an excuse for terrible code with no testing specs.

You are blaming scientists but speaking from my personal experience as a computational scientist, this exists because there are few structures in place that incentivize strong programming practices.

* Funding agencies do not provide support for verification and validation of scientific software (typically)

* Few journals require assess code reproducibility and few require public code (few require even public data)

* There are few funded studies to reproduce major existing studies

Until these structural challenges are addressed, scientists will not have sufficient incentive to change their behavior.

> Scientific code needs to have tests, a minimal amount of test coverage, and code/data used really need to be published and run by volunteers/editors in the same way papers are reviewed, even for non-computer science journals.

I completely agree.

geoalchimista · on Aug 24, 2020

Second this. Research code is already hard, and with misaligned incentives from the funding agencies and grad school pipelines, it's an uphill battle. Not to mention that professors with an outdated mindset might discourage graduate students from committing too much time to work on scientific code. "We are scientists, not programmers. Coding doesn't advance your career" is often an excuse for that.

In my opinion, enforcing standards without addressing this root cause is not gonna fix the problem. Worse, students and early career researchers will bear the brunt of increased workload and code compliance requirements from journals. Big, well-funded labs that can afford a research engineer position is gonna have an edge over small labs that cannot do so.

bartvbl · on Aug 24, 2020

The graphics community has started an interesting initiative at this end: http://www.replicabilitystamp.org/

After a paper has been accepted, authors can submit a repository containing a script which automatically replicates results shown in the paper. After a reviewer confirms that the results were indeed replicable, the paper gets a small badge next to its title.

While there could certainly be improvements, I think it's a step in the right direction.