

Can we make accountable research software? - urish
http://bytesizebio.net/index.php/2012/08/24/can-we-make-research-software-accountable/

======
bravura
I've commented on this topic before (open notebook science).

Academic code should be released. In fact, I believe that you should host
everything on a public github/bitbucket from day 1.

I don't like the proposal in this blog post, i.e. that we insist that code
must work in a distributed fashion under some consortium guidelines.
Increasing the friction to releasing code is the wrong approach.

I believe that open notebook science should be incentivized by being aligned
with _career_ goals. Publications and conferences should have separate tracks
where you cannot submit unless you will release your code. Then, academics who
want to release their code have the edge, because these tracks are lower
competition.

The feeling that you _must_ release polished code is the wrong idea. Pushing
everything to github removes the friction in sharing code, and has been
successful in the past.

The author complains that neither he nor his students have the time to polish
code or write documentation. Here's my solution: I've released research code
under a "pay it forward" support license. If you ask me for support, and I
give you support over email, please document what you learned through your
investigation and discussion with me, and then push your changes to me.

[edit: formatting]

~~~
etal
We should be clear which of two kinds of scientific code we're talking about:

1\. A program that implements a new technique which forms an important part of
a research project. Maybe a program that _is_ the research project, which will
be described in a paper.

No doubt this code should be included with the publication, no matter how
"ugly" it is. Some journals, e.g. Bioinformatics, already require that an
article about software must include the software itself. This is the stuff the
Bioinformatics Testing Consortium would run a smoke test on, because
amazingly, a lot of programs that have been written up as journal articles
just don't compile or work at all on somebody else's machine; many articles
don't include the source code, and some don't even say how to get a
redistributable binary. That's wrong, and we can fix it.

2\. The mountain of single-use scripts and shell commands that are used in a
research project that's not really about software at all, only a small
fraction of which produce some output that the scientist follows up on.

Key points: (1) this code is very unlikely to work on anyone else's machine
as-is; (2) crucial parts of these pipelines are lost in the Bash history, or
were executed on a 3rd-party web server, or depend on a data set on loan from
a collaborator who is not ready to release the data yet; (3) almost all of the
code is dead; (4) whatever comments or notes exist are usually misleading or
completely wrong.

As an example of what can go wrong when this code is released as-is, remember
when the East Anglia Climate Research Unit "hide the decline" stuff hit the
fan? It wasn't clear which code was dead, the comments made no sense, and
people freaked because they couldn't be sure how the published results came
out of that godawful mess. The eventual solution, way too late, was to make a
proper open-source, openly developed software project out of the important
bits. That, in a nutshell, is why scientists won't release ALL the code --
even the hard drive itself is not the whole story; the scientist still needs
to be available to explain it and navigate over the red herrings. And getting
code into a state where it's self-explanatory takes time.

~~~
anamax
> That, in a nutshell, is why scientists won't release ALL the code -- even
> the hard drive itself is not the whole story; the scientist still needs to
> be available to explain it and navigate over the red herrings.

If said scientist can't do that, how does anyone know what was actually run?

~~~
etal
That's why we write papers. Plain English can be more coherent than a pile of
code.

~~~
anamax
> That's why we write papers. Plain English can be more coherent than a pile
> of code.

"Plain english" doesn't analyze data - software does.

If the software is a mess, how likely is it that the "plain English"
description is correct? How do you know? Why should anyone believe that the
description is correct?

Code is truth.

~~~
etal
Right, which is why the novel parts should get more attention and undergo code
review, which is the goal of the Bioinformatics Testing Consortium.

To be clear, I'm all for open science and even open notebooks where it's a
good fit for the project. I just don't think a pile of single-use scripts is a
sufficient replacement for a clear English description of the analysis
workflow and the reasons for each step. If I can't understand how an analysis
was done from the article itself and the documentation for any associated
software, I would not trust the article. Including more code, particularly the
code further down the Pareto curve of relevance to the final article, does not
make the article more correct -- most journal articles are wrong or flawed in
some way, even if the code works as advertized.

------
freyrs3
The barrier is convincing researchers that their code doesn't have to be
beautiful to open source it, it just has to be checkable by someone else with
the same domain knowledge.

Fortunately, a lot of the scientific Python community seems to be getting
behind the idea of building IPython Notebooks that document scientific
workflow. That seems to be a step in the right direction.

------
bo1024
Some of these are simply excuses. Publishing your code doesn't mean you have
to support it or maintain it. Just make the code available and be done with
it. The important thing is that it's out there.

------
tikhonj
To me, it seems that many of the problems with producing reasonable code could
be remedied by writing better code the first time. In my experience, writing
somewhat less hacky code is usually not _that_ much slower than writing
completely hacky code. Moreover, it often _saves_ time--there have been many
occasions when I spent far longer dealing with a stupid bug than I would have
writing cleaner code in the first place.

Now, learning to do this well _does_ take some time. However, if you're going
to be writing a fair bit of code, I think this time will more than pay for
itself. The amount of effort you can save down the road by spending a little
bit more time at the very beginning is not negligible.

So: I think researchers _should_ try writing cleaner code (rather than
cleaning up poorly written code at the end). I would not be surprised in the
least if, after they've gotten used to it, they're actually _more_ efficient
than before!

------
randlet
There's an aspect that this article neglects to mention; rightly or wrongly,
some scientists believe keeping your source code private can be a competitive
advantage over other researchers.

Unfortunately, science is an extremely competitive business and providing your
competitors with tools to help them publish more papers faster doesn't always
make good business sense.

