
A Code Glitch May Have Caused Errors in More Than a Hundred Published Studies - daddylonglegs
https://www.vice.com/en_us/article/zmjwda/a-code-glitch-may-have-caused-errors-in-more-than-100-published-studies
======
ken
I've implemented algorithms I dug out of original research papers, which often
included sample code. (That's why I first learned to read Fortran!) I've
almost never gotten results that match the authors' exactly. Sometimes the
bugs are obvious, and sometimes they're subtle. I'd estimate that 100% of
sample implementations in published research papers have bugs. Researchers,
even in computer science, are usually not skilled programmers. The product for
them is the paper, not a program.

It's the same category of problem as "enterprise software". Whenever the
_customer_ is not the _user_ , the user gets screwed. With research, the
customer is the journal.

~~~
catalogia
Something I don't get is why universities don't to a better job of having
professional statisticians and programmers on staff for the explicit purpose
of providing support to researchers.

I guess grad students are cheaper and "good enough."

~~~
analog31
As I understand it, most large universities have something like this. At
smaller schools it might be less formal. Where I went to college, the students
were expected to have their methodology approved by the "stats guy" in the
department.

As for programmers, it's not just a matter of cost. Most scientists don't
understand software engineering, but most software engineers don't understand
science or math. Also, programmers who belong to their own organizational
structure can't keep up with requirements that could change from one hour to
the next. Despite "agile," realistically most software development is
comfortable with timelines of months or years.

Don't get me wrong, I'm an R&D scientist at an industrial business that makes
commercial software. The programmers are brilliant, and the software they make
is amazing. But I have no choice but to be self sufficient for my own software
needs in R&D. It's two different worlds.

~~~
catalogia
If given a good spec, a clearly written formula, I don't think it's too
unreasonable to expect a seasoned programmer to implement it despite not
understanding the science that prompted the math. Maybe I'm wrong for some
kinds of exotic math, but I've certainly implemented a few mathematical
formulas that looked like Greek to me, at least when I started.

Moreover what I propose is that every researcher have a programmer available
to consult with, who would perhaps write small amounts of code (perhaps make
sure the code builds before it's published?) I don't suggest that researchers
not write code, I think they should write code. But I think an expert in
writing code should be available to help keep the standard of code high.

~~~
analog31
In my view, it would help a lot to provide scientists with resources for
learning how to write quality software. Most of us don't even know how to
write a specification, except by writing the code, and don't have an idea of
what we want until we see something begin to work. Today, a lot of the math we
use is written directly in code.

This stuff evolves over multiple iterations. Many of my programs never run
more than once. Testing them requires being connected to the equipment if any
kind of closed-loop control is involved.

I work in an R&D setting in industry. If something I make threatens to become
a product, the thing I hand over to the software team is a proof of concept,
that could include a hardware design and working code.

------
throwaway57023
Writing code in research is quite different from writing code in industry. For
starters, in research, absolutely no one will read your code. Not your boss,
not your peer reviewers, not your colleagues, not your tech-savvy users. No
one. They just care about getting the right results and will complain if they
don't, which _most of the time_ steers you into producing correct code.

Also, most programs usually have a maintainer team that's exactly _one_ (1)
person strong, and that person is usually an underpaid grad student or postdoc
with a billion other tasks at hand and not much time left for that issue you
opened three months ago.

On the other hand, hey, you never have to fear the dreaded 'code review', so
it's not all bad.

~~~
GordonS
> Writing code in research is quite different from writing code in industry

Hah, not always.

A short while back I was tech lead for an AI project, the goal of which was to
reduce the weight (and, ergo, cost) of large steel structures.

The megacorp consultancy I work for decided to staff this project _only_ with
AI people, about half of which were fairly fresh graduates. Now, they all had
a great grasp of AI, both old-skool (neural networks, genetic algorithms,
particle swarm optimisation etc) and more recent innovations. But these were
_not_ developers.

The customer was a Microsoft shop, and mandated that we use C# (which was
cool, it's my favourite language!), but it was immediately apparent that the
"devs" had only basic training in Java. The code was a f _cking mess, and we
had numerous issues around software engineering concerns such as source
control, DevOps and processes - the customer eventually canned the project due
to the bugginess of the platform. They really did have some brilliant minds on
the project, but those were not the minds of software engineers.

On a more recent AI project, unusually I was consulted first on staffing, and
I insisted on a mix of AI research types _and* software engineers -
unsurprisingly, this project was far more successful.

~~~
dEnigma
Seems like your formatting was messed up by putting an asterisk in "f*cking",
which caused everything to be rendered in italics between that word and the
"and" which you actually wanted to be in italics.

~~~
GordonS
Bah, on mobile, apologies :/

------
jacquesm
A couple of years ago I looked over the alignment software used in a bunch of
genetics studies and spotted a pretty trivial mistake that caused mis-
alignment of the strings. This in turn led to wrong conclusions about the
order in which mutations took place. This sort of thing is probably quite
common: biologists are not computer programmers (though the combination is for
sure occurring), they tend to treat the computer programs in roughly the same
way as they would treat lab equipment, stuff goes in, other stuff come out. If
it looks good and seems to work then it probably is good.

But the complexity of the software is such that you need to check _very_
carefully whether or not the software operates in the way you expect it to.

~~~
jaddood
You could also choose to use formally verified software for sensitive
research.

~~~
rhinoceraptor
Full-time developers don't (to a first approximation) use formal verification,
much less research scientists.

~~~
stevenhuang
Other way around. Research scientists are the ones that may (or should)
require additional rigor in their results. Formal verification would
presumably help in that regard.

~~~
rhinoceraptor
I don't disagree, I'm just saying that we as full-time software developers
have not made formal verification standard practice, or even approachable. I
would bet the average software developer has never even heard of it. So to
expect people who do not write software full-time to do so is a little crazy.

------
nanis
You have to dig really hard to find out what the specific problem was. It
turns out this was the fix:

    
    
        def read_gaussian_outputfiles():
            list_of_files = []
            for file in glob.glob('*.out'):
                list_of_files.append(file)
            list_of_files.sort()
            return list_of_files
    

That is, the original code never sorted the files and instead it trusted that
a simple `glob` would give it files sorted in whatever is the desirable way.

It is a reasonable mistake for a total noob to make (just like the hundreds of
simulations which use the built-in `rand` function of whatever language they
are using without even so much as mentioning this pertinent fact), this should
be called a "programming error" instead of a "code glitch". Also, the error
should be publicized up-front.

~~~
mxcrossb
So many comments in this thread are saying that scientists are bad
programmers, but even a great programmer can make mistakes like this. The
solution isn’t related to better programmers, but instead for users to treat
codes as instruments, and hence a verification and calibration step.

~~~
fauigerzigerk
I would say it depends on the thought process that caused the bug. There are a
couple of possibilities I can think of:

Did the author of the code simply forget to sort? That's equally likely to
happen to anyone.

Did they assume that the sort order of glob.glob() was reliable across
different OSs and file systems? I don't think this would have happened to an
experienced software developer. At least there would have been enough doubt to
go read the first line of the docs.

Did they not care whether the code worked anywhere outside their own personal
setup? This is perhaps slightly more likely in an academic environment but I'm
not completely sure about that. We'd have to know more about the specific
circumstances.

What I find more astonishing is that it took so long for this bug to be found.
It's not exactly an edge case or a rounding error. It must have caused wildly
incorrect results in many cases.

So I think the more important differences may be on an organizational level
rather than a question of individual competence.

------
boulos
It’s too bad that the code and correction seem to be locked down, but the
linked Twitter discussion has an explanation [1]: the code expects Python’s
glob to return a sorted list of files, but it doesn’t guarantee this (the main
fix is to sort the result). I had thought reading the Vice article that this
would be about case-sensitive vs insensitive filesystems.

[1]
[https://mobile.twitter.com/bmarwell/status/11818211438352015...](https://mobile.twitter.com/bmarwell/status/1181821143835201536)

~~~
smnrchrds
Are they locked down? The article links to the paper [0], which has two zip
files to download, first of which is the code, including an explanation of the
corrections made [1].

[0]
[https://pubs.acs.org/doi/10.1021/acs.orglett.9b03216](https://pubs.acs.org/doi/10.1021/acs.orglett.9b03216)

[1]
[https://pubs.acs.org/doi/suppl/10.1021/acs.orglett.9b03216/s...](https://pubs.acs.org/doi/suppl/10.1021/acs.orglett.9b03216/suppl_file/ol9b03216_si_002.zip)

~~~
boulos
Ahh, I failed to click those. They seemed large enough to be “not the code,
but the full dataset”. Thanks for the correction!

------
saboot
Oh man, a sorting error of files due to filesystem differences.. A very
similar issue happened to me while doing astronomy during undergrad. This was
a software pipeline for a specific telescope and I followed the guide they
provided. Essentially there were two directories, one for images, and another
for error estimates. Files in both directories had the same filename. The
software needed text documents of the images and errors. The guide listed to
do a basic 'ls images/ > images_list.txt` `ls errors/ > errors_list.txt`.

HOWEVER HYPERTHREADING WASN'T ACCOUNTED FOR. Between the images_list.txt and
errors_list.txt about every 20 or 30 lines the file order would be swapped.
Invalidating the analysis and producing poor images.

It only took this undergrad two months to learn what was happening.

~~~
liability
'Piles-of-files' is itself a bug in an era where sqlite exists and has
bindings for virtually every language. Concurrent writing can almost always be
reasonably avoided by keeping the processing parallel but serializing the
writes. In cases where that earnestly isn't sufficient, a 'proper' RDBMS is a
good option.

Pile-of-files is 1960's era tech. We've learned much since then and our
hardware is much more capable. Unless you're shooting for the retro-UNIX
aesthetic for artistic reasons, it should be avoided.

~~~
kardos
I don't think this is one size fits all advice. Surely once your results are
measured in TBs this will fall part.

------
609venezia
This is another great example illustrating the need for reproducible research
practices even in the hard sciences, in this case so that the papers in
question could be easily checked after Luo's excellent finding.

For an earlier discussion on the same topic, referencing other discussions:
[https://news.ycombinator.com/item?id=17819420](https://news.ycombinator.com/item?id=17819420)

(Top comment: "This article harkens back to discussions of the
"reproducibility crisis" in science, as discussed extensively here just
recently (see link below). Where, in this case, not coughing up the code used
in the simulations in a timely manner led to an apparently unnecessary multi-
year dispute.")

------
BeetleB
Copy pasting a comment I left almost a year ago:

> A few weeks ago I had a conversation with a friend of mine who is wrapping
> up his PhD. He pointed out that not one of his colleagues is concerned
> whether anyone can reproduce their work. They use a home grown simulation
> suite which only they have access to, and is constantly being updated with
> the worst software practices you can think of. No one in their team believes
> that the tool will give the same results they did 4 years ago. The troubling
> part is, no one sees that as being a problem. They got their papers
> published, and so the SW did its job.

My own experiences when I was in grad school (engineering, not CS): No one
cares about code or code quality. In those days, no one used version control.
The attitude all my fellow grad students had was: "I don't need to learn this
stuff. Just need to publish this paper. When I become a PI, I'll simply make
it my student/post doc's job"

------
veezbo
I've always felt like there should be a way to connect experienced software
developers with research groups- something like this, for example, should have
been never happened with code review.

For example, one could imagine a program that connects developers with with
research groups to help develop good code, or with conferences to test
submitted code.

Does anyone know if there already exists a program like that?

~~~
currymj
there is an organization called Software Carpentry that is basically like
this.

however, the real problem has more to do with incentives and funding than it
does knowledge of software best practices. right now, if you take the time to
write thoroughly-tested clean good code, all you're doing is handicapping your
research career against people who can churn out papers faster than you. you
can see this because code from computer science departments is nearly as bad
as in other fields, even from students who have had real industry jobs and
learned how to do the right thing.

it does seem like in this case, though, because the scripts were actually
intended as a tool to be used by others, the investment in more careful
testing might have been warranted.

------
peter303
Most scientists are not software engineers so would not know about
comprehensive testing. This was an eye opener to me when I moved from academia
to a scientific software company.

The situation is improving somewhat as many grad students open source their
software in github. Then have include their makefiles, documentation, and some
tests. In earlier days we we software engineers would roll our eyes when
management suggested using 'free' university code to save money. Sometime
bringing such software up to professional industry standards took more work
than starting from scratch from published papers.

------
nabla9
The Reinhart-Rogoff error is probably the most famous Exel error in economics
with huge consequences.

Their 2010 a paper, Growth in a Time of Debt
[https://www.nber.org/papers/w15639](https://www.nber.org/papers/w15639)
convincing dataset to show that when external debt reaches 60 percent of GDP,
annual growth declines by about two percent.

This only paper was used as top level politicians in the US and Europe to show
that austerity is the only option. It started widely damaging pro-cyclical
austerity policy during regression. The corrected result don't show any change
in growth when debt to GDP ratio goes above 60 or 90 percent.

~~~
dmix
The paper claimed a -0.1% decline at 90%, not -2% at 60% as you mentioned, but
only in advanced countries, data in emerging countries was much worse. The
Keynesian economists who countered showed their data when extended over a
longer timeframe and without weighting was actually +2% at 90%, this too seems
be limited to advanced economies. High debt ratios in emerging markets still
seems to have a negative correlation AFAIK.

Regardless, in practice, it looks like the study was completely ignored in the
US anyway:

[https://fred.stlouisfed.org/fredgraph.png?g=FHS&nsh=1&width=...](https://fred.stlouisfed.org/fredgraph.png?g=FHS&nsh=1&width=600&height=400&trc=1)

Around 2010 the debt only dramatically increased above 100% instead of
declining, and has continued to grow.

In the EU, comparing debt to gdp in 2012 vs 2019 shows that there has been
little decline there either:

[https://i0.wp.com/factsmaps.com/wp-
content/uploads/2018/02/d...](https://i0.wp.com/factsmaps.com/wp-
content/uploads/2018/02/debt-to-gdp-ratio-european-
countries.png?fit=2400%2C1800)

[https://www.statista.com/graphic/1/269684/national-debt-
in-e...](https://www.statista.com/graphic/1/269684/national-debt-in-eu-
countries-in-relation-to-gross-domestic-product-gdp.jpg)

~~~
nabla9
That's because the Democratic Party had majorities in both chambers. The paper
was in the center of Rebublican policy led by Paul Ryan.

~~~
dmix
Oh apologies I misunderstood your comment:

> to show that austerity is the only option. It started widely damaging pro-
> cyclical austerity policy during regression

I’ve found much of the fear mongering around austerity to be pretty overblown.
It’s been pretty rare to find examples of austerity except for some cases in
the EU where attempts to reign in debt in countries like Greece where it was
180% (and still is today). Otherwise almost all of the advanced countries have
kept it around 60-100% even almost a decade since the recession.

Moderate Keynesian and monetarist policy remains incredibly popular following
the recession and in many places higher debt only became more popular even
during the good times.

Even looking at the American right, besides Paul Ryan’s failed proposal 9yrs
ago, there’s little evidence of the debt being lowered even during republican
majorities in Congress and Senate.

Yet if you follow Krugman and the like you’d think austerity has been a giant
problem in the west.

The world’s a lot more boring and moderate than the fear mongerers like to
admit.

~~~
nabla9
> Yet if you follow Krugman and the like you’d think austerity has been a
> giant problem in the west.

I don't follow Krugman closely since he is not macroeconomist, but Krugman's
back of envelope calculation was correct in the retrospect.

Obama's stimulus was roughly of what needed and half of what was needed was
the result. (tax benefits are not proper stimulus). According to CBO the US
economy was 6.8 percent below its potential translating into $2.1 trillion of
lost production. Lives were destroyed permanently due to prolonged recession.

------
Gimpei
There was a rumor when I was in econ grad school that there was a huge bug in
Stata in the nineties affecting standard error calculations, rendering a large
fraction of the studies from that time invalid.

~~~
Dayshine
Stata doesn't even version its packages, so it's literally impossible to
reproduce research as you can't retrieve the packages as they were then.

The maintainer of a popular package could choose to change their api today and
all code using it would break with no way to revert.

Stata is insanity.

~~~
Gimpei
I agree. And it's expensive whereas R is free and a much much much better
language.

------
burntoutfire
When I mentioned here some time ago, that maybe we shouldn't be trusting our
climate science models so much, as they consist of millions of lines of poorly
tested code, I was heavily criticized...

~~~
moultano
Not trusting them is reasonable, but then where does that leave you? You don't
know then whether they're underestimating or overestimating things. I'm
strongly in favor of building a prior from the geological record and leaning
on that as much or more than the models, but that doesn't lead to a very
different conclusion.

~~~
burntoutfire
I guess that, at least compared to society in general, I'm a cognitive
nihilist? I highly doubt that it is in our power to predict with useful
accuracy something as complex as climate. If I'm not mistaken, we're still
having problems modelling stuff a billion times less complex than climate
(such as an air flow in turbines etc.)...

~~~
moultano
Generally, more uncertainty around climate change means you should be more
supportive of climate change mitigation rather than less, because it means we
can't rule out things that look like the P-T extinction.

~~~
burntoutfire
There's infinite number of events which we can't rule out. My approach is
basically the opposite - if the evidence for something is super-sketchy, maybe
we shouldn't spend trillions (or even quadrillions?) on preparing for it.

~~~
moultano
The evidence is completely airtight. The problem is with the precision of
predictions, not the evidence for the mechanism and rough magnitude.

------
crocowhile
I am a neuroscientist and a good portion of my laboratory's scientific output
is code. I can offer a perspective.

Every few months, a case like this one emerges. A published software contained
a scientifically relevant bug and this may have had an influence on other
studies that used that software. The reaction is always the same: blaming the
naivete of the researchers because they are not professional coders.

While I would certainly love to see more interaction between professional
coders and scientists, I don't agree with the overall sentiment. I find it
counterproductive and even naive for several reasons:

1) mistakes happen. Wikipedia has a long and certainly not complete list of
software bugs that had major consequences (
[https://en.wikipedia.org/wiki/List_of_software_bugs](https://en.wikipedia.org/wiki/List_of_software_bugs)
) and most of them come from professional coders in all kind of industries.
There is no doubt whatsoever that code produced by scientists is in generall
less streamed, less reviewed, and way uglier but is it on average more buggy
_where it matters_? Unless we answer this question, the discussion is moot.

2) Scientists who produce software by themselves do this according to Open
Source philosophy meaning that the code can be scrutinized by anyone. I'd
rather use a possibly buggy but open-source software over one that I cannot
even scrutinize. I think published results should only be accepted if they
rely on open-source software: we would never accept a figure of a paper unless
the protocol was 100% disclosed. This is one aspect where we can all
immediately act. As a reviewer, I will never accept a paper that made use of
new software unless the code is an integral part of the paper. I urge all my
colleagues to do the same. Also, as authors, if you have to write software
from scratch, take code readability into account exactly for this reason.
Python is a good choice for this reason. R much less so.

3) we must encourage scientists to take on more coding, and not the opposite.
If we keep telling them that they are not going to be good enough, we are not
helping. A the moment the trade-off is very clear: either not-professionally
coded software or no software whatsoever. Fields like crystallography have
shown that the former option can still be extremely favorable.

------
todd8
I've been asked to consult on a number of programs used for scientific
research or environmental modeling.

In one case a science student was expected to work on a C++ simulation for his
dissertation. The existing code base had been written by other grad students
from previous years. It was clear that no one involved had any experience with
C++ and it wasn't going to be possible for him to do a good job with the
existing source code. Fortunately, in this case, the C++ based simulation was
eventually dropped.

Another time a team of professional civil engineers couldn't understand why
their model was able to make predictions that perfectly matched their
experimental data. It predicted a poorly understood physical phenomena (algae
blooms) so perfectly that they asked me for help. In a couple of minutes I was
able to explain to them how they had overfitted the model to the small sample
of experimental data and that the model likely had no predictive value at all.

I was asked once to debug a large program used to understand the dynamics of
floods. The program contained some of the worst coding practices I've ever
seen. For example, the same global variables were reused for completely
different purposes in different parts of the program, sometimes X2DDRATE meant
one thing and later on some paths it contained a completely different kind of
measurement with entirely different units. This was done in an effort to "save
memory" by not having too many variables.

A different ecological model I helped repair needed to compute the integral of
a function at one point and used the following method: pretend the function
was a straight line so that the integral could be computed as the area of a
triangle. _The function was a simple, uncomplicated exponential, something
that a high school math student could integrate exactly._

------
Tistel
Researchers should consider making their papers runnable. Jupyter notebooks
being a great tool for it. Also, check out Knuth’s literate programming.

~~~
gwd
I was going to say that would almost make things worse: right now these sorts
of bugs are discovered by attempting to _re-implement_ the described
algorithm, and discovering that the described algorithm was wrong. If the code
was available but wrong, most people would just run the code and find the same
wrong answers as the original researchers.

But actually I think that's wrong; far more people would run the code than
currently try to re-implement the code; and it's likely that _someone_ will
notice something fishy about the code and report it.

~~~
erichocean
Well, we don’t have to wonder since this was done with PBRT.

Result? An incredibly high quality codebase with many hundreds of subtle fixes
over time coming from readers and researchers.

------
WalterBright
> “We all kind of assume that a computer program always spits out the correct
> answer.”

I'm a bit surprised at this.

~~~
_Wintermute
I'm not at all. Googling your problem and running your dataset through the
first random R package on github you managed to install seems to be an
accepted approach in the biomedical fields. If you're lucky they might even
list which package it was in the paper's methods.

------
mjcohen
Retraction Watch has many, many, many examples of this type of thing.

[https://retractionwatch.com](https://retractionwatch.com)

------
vinni2
This simple glitch in the original script calls into question the conclusions
of a significant number of papers on a wide range of topics in a way that
cannot be easily resolved from published information because the operating
system is rarely mentioned,”

In computer science it is a common practice to mention the experimental setup
along with OS hardware configuration and third party libraries used for
reproducibility purposes. Still many papers suffer from reproducibility. Now
the conferences are demanding docker containers with all dependencies and
configs. Which improves a bit better. But still lack of data is another reason
why we can’t reproduce papers.

------
cameldrv
On this specific bug, I really wish Python just returned a sorted directory
listing on every platform. It goes back to Python's original philosophy of
passing through OS level behavior rather than standardizing like Java.

------
TallGuyShort
Most of the comments on here talking about repeatability, but I'm concerned
about correctness too. It's $40 to have 48-hour access to the paper, and I'm
not going to do that right now, but hear me out. The order in which the
operating system listed the files made a difference. So just having it be the
same everywhere doesn't necessarily fix the problem. What if it's the same
everywhere, but still isn't the correct order for what the researcher
intended? Everyone gets the same error, and that error isn't necessarily 0.

------
sunstone
Oh don't tell me it's a spreadsheet app.

~~~
TallGuyShort
It says clearly it was a Python script

------
dr_j_
Should have used MATLAB.

------
jaddood
Why not use formally-verified software?

~~~
contravariant
Usually because the people capable of writing formally-verified software do
not understand the specific subject area, and the people that understand the
subject area are not generally capable of writing formally-verified software.

~~~
Nzen
For a perspective of a mathematician that came to evangelize theorem provers,
I recommend Kevin Buzzard's MS 2019-09 presentation [0] about LEAN. He
highlights cultural misunderstanding and apathy on both sides of the domain
divide. He also references the idea that the people who might make the
appropriate tools may not have stayed in academia. So, he's structured his
courses around using LEAN with the indirect consequence that power users
(undergrads) may choose to become open source committers.

[0] [https://www.youtube.com/watch?v=Dp-
mQ3HxgDE](https://www.youtube.com/watch?v=Dp-mQ3HxgDE) One hour of
presentation and then 15 min of q/a. My favorite is around 1:04:00 when
someone asks a second time why he disprefers coq, and Buzzard complains that
it can't represent some advanced quotient type that he'd have to work around.
I'm reminded of [1]

[1] [https://prog21.dadgum.com/160.html](https://prog21.dadgum.com/160.html)
Dangling by a trivial feature

~~~
abstractcontrol
Having been inspired by that video by Kevin Buzzard and finally finding
something that would be worth formalizing, I am trying out Lean at the moment.
I have about 2-3 months of Coq experience, so I can say that even without the
quotient type, Lean is much better designed than Coq. I can't vouch for how it
will do at scale since I've only started it out, but from what I can see, Lean
fixes all the pain points that I had with Coq while going through Software
Foundations.

It has things like structural recursion (similar to Agda), dependent pattern
matching (the biggest benefit of which would be proper variable naming),
unicode, `calc` blocks, good IDE experience (it actually has autocomplete)
with VS Code (I prefer it over Emacs and the inbuilt CoqIDE is broken on
Windows), mutually recursive definitions and types, and various other things
that are not at the top of my head.

If I were to sum it up, the biggest issue with Coq is that it does not allow
you to structure your code properly. This is kind of a big thing for me as a
programmer.

------
m3kw9
Is kind of like if there is a CPU glitch

~~~
shagie
From the article:

> Luo’s results did not match up with the NMR values that Williams’ group had
> previously calculated, and according to Sun, when his students ran the code
> on their computers, they realized that different operating systems were
> producing different results. Sun then adjusted the code to fix the glitch,
> which had to do with how different operating systems sort files.

------
ch33zer
I feel like, where feasible, scientists should adopt Kubernetes or something
so that the software they use is repeatable.

~~~
fwip
There's definitely a big push in that direction in the scientific community
right now. More and more tools and pipelines are getting distributed as
containerized workflows.

Big projects have realized the need to make their code available and versioned
just as they do their input data, side by side with hashes recorded all the
way along and reproducibility made as simple as possible. Now we're starting
to see it trickle down into less organized/large/disciplined projects as well.

~~~
ryukafalz
Containers as a technology are nice, but it's easy to fall back into the same
traps that make software non-reproducible using containers as well. You _can_
precisely specify all your dependencies, but it often takes a lot of effort to
make that happen.

I'm a fan of the approach Guix developers are taking for scientific computing,
because it makes reproducible software simple enough for people to use without
too many headaches: [https://hpc.guix.info/blog/2019/10/towards-reproducible-
jupy...](https://hpc.guix.info/blog/2019/10/towards-reproducible-jupyter-
notebooks/)

~~~
CamperBob2
Isn't the whole idea behind containers to eliminate external dependencies?

~~~
ryukafalz
In a sense. You're right that once a container is built, it has few external
dependencies. But you need to get those dependencies from _somewhere_ at
build-time, and if you're not careful it's easy to do that in a way that makes
it extremely difficult to rebuild that container in the future.

To use a slightly more concrete example: let's say you're using a library in
your container that has a severe bug. This bug results in incorrect
computations, so you would like to upgrade to a fixed version.

Now let's say that when you built that container initially, you installed
packages in the Dockerfile by running e.g. "pip install <package>". The
problem is that once this image is built, it's nontrivial to rebuild this
image and ensure you're using the same dependencies you were the first time.
In a sense, you've lost that information (though you can probably start to
figure it out with close inspection of the image).

Yes, there are usually ways around this with the language-specific package
managers; Node has package-lock.json, Python has Pipfile.lock, etc. But it's
not even close to being the default.

