
Software updates: the “unknown unknown” of the replication crisis - minikites
http://blogs.lse.ac.uk/impactofsocialsciences/2018/06/07/software-updates-the-unknown-unknown-of-the-replication-crisis/
======
adamson
I'll concede that the story is different when you're relying on commercial
software that you've purchased to implement an algorithm correctly, but in the
limited case of replicating results (whether they're flawed or not), the onus
really should be on the user to record what version of software they're using
and how they configured it. The article does a lot of gymnastics to make
software correctness and research reproducibility feel like the same issue.

~~~
s-shellfish
Versioning/Configuration in industry that has automated some aspect of their
build process to include external dependencies leads to label accumulation
(I'm intuiting that this is greater than exponential growth - assuming there
exist dependencies for a system A that are dependent on other dependencies
that system A is dependent on).

The problem is you can't document everything while also trying to automate
everything. It's always a balance of figuring out which information is most
important, and if 'oneself' were to attempt to construct a model that
represents everything, 'oneself' would simply be stuck in a catatonic state of
analysis paralysis, in perpetuity and invariably, until the heat death of the
universe.

Yes, the onus is on the user, but who is the 'user' in the industry of
development? How do you maintain structure in records for information systems
that are fundamentally, at core, completely opened%? I believe this problem
maps back to the science of collecting and verifying the veracity of
information, up to isomorphism (quite literally, this is the definition of the
graph isomorphism problem).

% The intrinsic nature of science and discovery is 'we discover new things
because we assume there is more to know (~ we don't know everything)'.
Therefore, all components of such a system are subject to change. There is
nothing that can be said to be permanent in knowledge discovery systems where
the premise of discovery rests on 'assertions' that can not be provably
demonstrated to be factual AND absolute.

~~~
emiliobumachar
"Versioning/Configuration in industry that has automated some aspect of their
build process to include external dependencies leads to label accumulation"

Okay, but most software shops shove all those labels into their product
version, usually something simple like "ToolFoo 3.4.11".

I think that this numbering and basic "here's the binary, run at your own risk
and don't call me" legacy support are enough on the vendor's part. And I agree
with gp that if the researcher fails to document the top-level tool version
they're using, plus the changes they made to default configuration, then it's
the researcher's fault.

Don't you agree?

~~~
rspeer
There are many software stacks that are not versioned only at the top level,
especially in scientific software.

As an example: If I tell you I calculated a result on SciPy 1.1.0, do you know
what version of what code is doing the linear algebra? No, because there's no
single answer. It depends how SciPy was built and what's installed on the rest
of my system. It could be OpenBLAS, ATLAS, MKL...

So if there were a bug in, say, a particular version of OpenBLAS, and it
manifests itself in SciPy, there's no way to tell whether the bug is present
based on the SciPy version.

------
_bxg1
People are latching onto different individual takeaways from this, but it
really requires a multi-prong approach (and can never be fixed completely):

Companies making scientific software should test at a higher level of rigor
than others. There's no such thing as bug-free software, but that on missile
launch systems and NASA satellites comes pretty close, so scientific software
can too.

For algorithms that go beyond just math, researchers should consider writing
their own code.

As applicable, software version numbers (and perhaps binary hashes) should be
listed as part of the reproduction conditions in papers.

------
s-shellfish
> Furthermore, perhaps before releasing a new version of the software for a
> broader usage, these companies should ensure it is bug-free by pre-testing
> it and thus guaranteeing the correctness of the produced estimations.

This is f-king ridiculous. This will never happen. There will always be bugs.
Bug free.

If you don't want bugs, write the code yourself, or read all of the code
yourself. After you do that, try to figure out whether your mind can still
parse information 'correctly' relative to your respective science.

~~~
_bxg1
Suggesting bug-free software demonstrates an ignorance of how software
engineering works, but requesting more rigorous testing not just for absolute
correctness, but perhaps even for consistency with past versions (a novel
concept) is perfectly reasonable.

~~~
s-shellfish
> not just for absolute correctness, but perhaps even for consistency with
> past versions (a novel concept)

Sorry, can you explain what you mean by this further?

~~~
_bxg1
Normally, software functions are only tested against what is considered to be
the universally correct outcome. The wiggle room - i.e., undefined aspects of
the behavior - within that definition of "correct" is not tested. It would be
interesting for testing of scientific software to incorporate a second metric:
"how close is it to what it was last time (assuming last time was also
'correct')?" For the sake of reproducibility, maybe it's meaningful whether or
not the current version of correctness is in line with the previous version of
correctness.

There's virtually no other domain in which you'd want to test software this
way, which is what makes it novel and interesting.

For example, maybe a floating point operation rounds the very last digit down.
Then through code changes, it starts truncating the very last digit instead.
Your tests may still consider this result "correct", but it might be worth
considering the fact that it differed from the previous "correct" result.

~~~
perl4ever
I have the impression that issues with a software version not matching a
previous version are not only not novel, but have been plentiful with any kind
of software since the beginning of time. Microsoft is proverbial for Windows
being under the constraint of maintaining past behavior however imperfect.

To wit, it's such a common thing that there is an xkcd for it:

[https://imgs.xkcd.com/comics/workflow.png](https://imgs.xkcd.com/comics/workflow.png)

P.S. I was momentarily repressing my first hand experience, but in a past job,
there were _many_ times I wanted to fix things and wasn't allowed to, because
if the results changed, it would upset someone. The cardinal, CYA rule, was if
we've been reporting something wrong for years, it better be consistent
forever. Which is really painful if you care about correctness and
compulsively investigate it.

~~~
_bxg1
That sounds awful.

------
tpeo
If the statistical significance of your results is algorithm-dependent,
shouldn't they be regarded as suspect? Perhaps it might be just a failure of
imagination on my part, but I find it odd to think that changing a software
package might budge estimates far enough to push them outside the zone of
statistical significance unless they were only marginally significant in the
first place.

~~~
_bxg1
I could imagine algorithmic differences adding bias in error margins. Both
versions might be accurate approximations of the answer, but one might lean
towards one end of the error space and one might lean towards the other.

It's like when fixing a bug in library code breaks application code. Usually
it's because there was some undefined behavior in the library - which wasn't
part of the contract - which the application (knowingly or unknowingly) relied
upon, and then the updated version produces a different undefined behavior.

------
recursive
> Yet, the company does not justify which version of the program is the
> correct one to use in order to get as close as possible to the underlying
> true relationship.

Even if they did that's still not good enough. I guess it's my own ignorance
of how science is done, but I'd expect a higher standard of certainty for
which algorithm you're using than "some vendor said so".

~~~
RobertRoberts
One project I have worked with was analyzing data in the education industry.
Mainly public schools, think student grades and educational performance in
general.

You'd think a simple percentage grade would have meaning, but it doesn't.
Often times federal and state rules alter how data is to be "perceived" and
therefore, analayzed and reported on.

There are a few major players in the data collection and analysis game for
public school system, and they all not only have their own way of determining
skill level (eg, reading skill level) with their own testing systems, they
also analyze their custom data with their own algorithms. Thus leading to
quite a variation in their ultimate recommendations/conclusions between
vendors.

Then, take into account some schools have many students with English as a
second language, and based on how the state views that school, it may be
analyzed as "all equal" or take into account the language barriers.

Then mix in that you have to analyze students over multiple years, with
students entering and leaving districts, sometimes with data vendors changing,
etc... Then there is state and federal laws mixing with multiple vendors data
collection _and_ analysis.

So the next time you hear a particular school district is under/over
performing based on these vendor's analysis (especially related to budgets),
keep in mind how variant the actual reality is to what is being reported.

Maybe every complex system has the same core problems?

------
nonbel
>"The replication crisis is largely concerned with known problems, such as the
lack of replication standards, non-availability of data, or p-hacking. One
_hitherto unknown_ problem is the potential for software companies’ changes to
the algorithms used for calculations to cause discrepancies between two sets
of reported results." \- June 7th, 2018

They didnt know about this?

~~~
s-shellfish
No, they expect developers to know everything, apparently. At least to know
everything that doesn't give them a dopamine hit when two result sets match,
validating their theories/beliefs.

This is what developers deal with all the time, so this is obvious. Code
should work this way, but it doesn't always. Yes, we can all make heroes of
ourselves by thinking of ourselves as saviors for those edge cases of
catastrophic failure, but we don't, because we are aware, that bugs happen -
they are a mathematical certainty assuming one doesn't have access to all
information, including every single use case of a program in the future.

~~~
nonbel
I mean almost the first thing that happened to me when I tried to use
commercial stats software was being unable to reproduce a calculation
manually. Eventually found out they had a known bug in their code that was
left in for ~20 years and when they fixed it they called it the "enhanced
version". The calculation was based on the original version of a paper that
was retracted and corrected. I would never trust commercial stats software
after that.

I really mean this was close to the very first thing, so I would assume it has
to be very common.

------
CyberDildonics
It seems to me a lot could be solved by making data available and compiling
software to webasm, then packaging up a single web page that demonstrates any
algorithms used.

------
djmips
Maybe they should also be required to publish their software environment as a
container or other virtual package.

------
interfixus
> _Speaking in 2002 about weapons of mass destruction, United States Secretary
> of Defense Donald Rumsfeld infamously distinguished between the “known
> unknowns” and the “unknown unknowns”_

Rumsfeld rather _famously_ distinguished, as the world has proved by endlessly
quoting and paraphrasing him.

The _infamous_ appellation here is presumably based solely on the authors'
dislike of Rumsfeld. I stopped reading.

