
Publish your computer code: it is good enough - pama
http://www.nature.com/news/2010/101013/full/467753a.html
======
whyenot
This is such a fundamentally important issue in the sciences. Without the
source code, you can not truly do peer review, and you may not be able to
replicate someone else's work. With no peer review or replication, you are no
longer doing science.

I've seen source code for software used to produce results in major
publications like Nature that was so poor quality it is surprising it even
compiled. You don't generally get tenure for producing well written software,
so there needs to be another incentive for scientists to spend the time to
write well thought out and well documented code. I think sunlight helps
tremendously.

Warren DeLano, the author of PyMol, the very popular molecular visualization
software, was an early voice about the importance of open source software in
the sciences. Unfortunately, he is no longer with us, but it is hearting to
see others now taking up the cause.

 _The only way to publish software in a scientifically robust manner is to
share source code, and that means publishing via the internet in an open-
access/open-source fashion._ \- Warren DeLano (2005)

~~~
jonhendry
I'm not sure sharing code is such a great idea. If you use the other guy's
code, you might be replicating the other guy's mistakes. And having the code
available encourages that.

It would be better to describe your methods, so someone else can implement it
in their preferred tools. If they still replicate your results, that strikes
me as being much stronger than just rerunning the same, possibly buggy, code.

I mean, if you're working with monkeys, other scientists can't reuse your
monkey. They have to follow the procedures you describe on their own animals.

Another issue is that the code might not be very useful to other labs as-is.
The code might be for unique, custom-made hardware, or an unusual
configuration of equipment, or in an unusual language.

~~~
randomtask
Sharing code doesn't prevent someone else from reimplementing an idea and
indeed they should if they rely on results from another publication. If
everyone open sources code they wrote during their research then it ought to
be easy to tell if someone reimplemented an idea correctly (if their results
differ). If they copied the original implementation this should be evident
too.

~~~
almost
But people are people and when there's an easy way...

~~~
randomtask
I agree, but at least having implementations in the wild should bring more
attention to an idea and scrutiny of that code than none?

~~~
jonhendry
If the same results are obtained using different code (or
subjects/animals/chemical batches) does it matter as long as the code is doing
the same things?

I get the impression that software types are inclined to put way too much
significance on the source code. I'm not surprised the author of the linked
item in Nature is a software engineer.

I wonder, did scientists in the 1930s publish their scratchpads full of
calculations?

~~~
nervechannel
There is a massive difference between a scratchpad of calculations, and a
simulation involving thousands or millions of data points.

------
bravura
I debate about this constantly with my office-mate, who--unfortunately--
represents the conservative status quo in science. His concerns are that he is
worried about being scooped, and that it's bad science to release things
before they are finished products.

I am a proponent of open notebook science, and strongly believe that the
benefits outweigh the negatives. Besides the feel-good arguments that this
advances science and reproducibility, I'll point out a selfish motivation for
releasing code: _It makes it more likely that other researchers actually try
your methodology and cite you._

I used to get mired down in trying to do formal releases. Now, I realized that
releases are a hindrance, and by default all my research code lives in a
github repo from day one:

<http://github.com/turian>

For example, here is the page I published on my word representation research,
with links to my github code: <http://metaoptimize.com/projects/wordreprs/>

~~~
nervechannel
Whether or not he's an open science fan, that's no excuse for him to refuse to
publish code when the paper is out.

You should point out that refusing to publish code is like being intentionally
hazy about your experimental protocol.

~~~
rue
I am not sure where the idea that you release the source before publication
comes from, but it does seem oddly prevalent given the solution is so easy.

------
substack
One time I worked on a power grid simulation that wanted to release its code
but it couldn't because it used algorithms from «Numerical Recipes: The Art of
Scientific Computing» which has a pretty asinine copyright policy with respect
to openness : <http://www.nr.com/com/info-permissions.html>

Even worse are the researchers who keep their models and data sets secret due
to paranoia that some colleague will publish first, which admittedly does
happen.

Cultivating a spirit of genuine cooperation and sharing in academia would do
wonders for the progress of the sciences, but there are so many hurdles that
need to be removed. It's not just a matter of knowing that it's good to
release source code or feeling confident in including it as the article
suggests.

~~~
Locke1689
IANAL but as far as I understand copyright law it is impossible to copyright
"an algorithm." What one can copyright is source code but as long as you do
not copy the source verbatim it is not copyrighted. I would suggest you
translate the algorithms from the book into mathematical expressions and then
implement your own code from there. This should be legal to publish.

~~~
substack
In this case it was a derivative work based upon source code from the book,
which is permitted under the terms but precludes open dissemination. Further,
the rest of the codebase was very tightly coupled to the algorithms since
since the algorithms updated state in-place instead of returning values, so
dropping in an open version of these algorithms was very non-trivial.

------
jesusabdullah
I'm currently an engineering masters student, and the code I've written is
available on github, if not properly licensed (I'm seriously considering
slapping the CRAPL onto it now that I'm aware of it). I don't really care too
much about polish, since, I mean, it does what _I_ want, I have the excuse of
working around other peoples' software, and at worst someone finds out that I
did something wrong, and we all benefit. Plus, having it on github makes it
really easy to work on it from different locations, and version control helps
me document just where all my friggin' time went.

The biggest concerns I've heard from other researchers (ie, professors) has to
do with being beaten at a publish. I think that as long as your paper _itself_
isn't easy to find pre-submission, then you're okay, especially if nobody will
really understand what you're doing anyway. So, I'm not too nervous about the
prospect. It's just code for now.

Unfortunately, even if scientists release their own code, it's often just a
small part of the big picture. In engineering, at least, MATLAB, Mathematica
and commercial Finite Element packages are everywhere. My own project uses
MATLAB and COMSOL in tandem, meaning that what I wrote only really serves to
glue a bunch of completely closed algorithms together. Personally, I'd love to
see a completely open, usable and documented FEA stack (which would ideally
include an adaptive mesher, some FEM algorithms, a post-process visualizer,
and both a gui and a not-shitty scripting API).

~~~
kd0amg
_I'm currently an engineering masters student, and the code I've written is
available on github_

Did you have to clear this with your university/advisor? AFAIK, my university
has copyright on code I produce on paid time or using university
facilities/equipment, which covers my research and even a good chunk of my
homework.

 _meaning that what I wrote only really serves to glue a bunch of completely
closed algorithms together_

This is consistent with what I've run across. It seems a lot of research
revolves around modifications to an existing system, but in CS, it seems the
established code base is more likely to be open source (e.g. Jikes RVM) or be
proprietary but still have source code readily available (e.g. SimpleScalar).

~~~
jesusabdullah
> Did you have to clear this with your university/advisor?

I don't know if I _had_ to, but I did discuss it with my advisor, mostly due
to the "someone could steal my thunder" issue, and we were basically in
agreement. I actually have a really cool advisor--I lucked out there.

~~~
kd0amg
Eh... good enough. I'd do it if I had my advisor's approval (I really doubt
he'd give an OK that would get me in trouble).

------
splat
While I strongly believe that scientists should publish their code along with
their papers, I do think it has one potential disadvantage. Suppose a
scientist writes a big piece of code to some complicated calculations, but
makes a subtle mistake somewhere in it (perhaps some parentheses are nested
incorrectly). If the code is not published and another scientist comes along
to extend the results from the first paper, the first thing he will do is try
to replicate the results of the original. If the code is unpublished, the
second scientist will have to write the code from scratch, and in doing so
will likely catch the original error. But if the code is published, the second
scientist will just use the original code and probably won't catch the error.
Consequently, the mistake will take much longer to catch, if it's ever caught
at all.

The advantages still outweigh the disadvantages, but it's important to
remember that there are always trade-offs.

~~~
code_duck
I think that to the contrary, that it would allow people to determine whether
the original results were due to flawed software by analyzing the code.
Subsequent efforts would remain free to create their own implementations from
scratch if desired.

~~~
splat
True, and that's the definite advantage to publishing code. My point is just
that if there's a lot of code and error is obscure, it will take a long time
to identify the error and the mistake will be propagated in further research.
Most scientists using the code will just skim it, see if it gets the results
published in the original paper, and then move on. This has been my experience
with my own research anyway.

~~~
jerf
I disagree, because your mental model of a large code base that only has _one_
error is absurd. Scientific code is not magically easier to write than real
code. Get two random code hacks to write two significantly-sized codebases,
say, about the size of a decent web framework, and you won't get one perfect
code base and one with a subtle bug that has ramifications down the line.
You'll have two code bases so shot through with bugs that they will never
reconcile, ever.

The quality issues aren't all that different than a web framework, either.
Release one or a small number of code bases, and have everybody pound on and
improve them, and you might get somewhere. Have everybody write their own code
bases from scratch every time and you'll get yourself the scientific
equivalent of
[http://osvdb.org/search?search[vuln_title]=php&search[te...](http://osvdb.org/search?search\[vuln_title\]=php&search\[text_type\]=alltext)
.

To be honest at this point when I see a news article that says anything about
a "computer model" I _almost_ immediately tune out. The exception is that I
read for some sign that the model has been verified against the real world;
for instance, protein folding models don't bother me for that reason. But this
is the exception, not the rule. When it became acceptable "science" to build
immense computer models with no particular need to check them against reality
before running off and announcing immense world-shattering results I'm not
exactly sure, but it was a great loss to humanity.

------
bioinformatics
I must admit that I have mixed feelings about this. Usually I post on my
Python blog some code that might not be perfect or good enough but does the
job for me. I post to show other people with no CS formation like me that they
can achieve things with the language. The programming crowd that visit my
blog, helps a lot, the science (biology, bioinformatics) crowd calls me dumb
(not directly). I don't know if I would publish my code, I know my limitations
and what in knowledge, but in a vain environment like the academic/scientific,
I prefer to hide my shortcomings.

~~~
jessor
Why don't you put your blog url in your profile?

~~~
bioinformatics
just did, if you want to check

python.genedrift.org

------
nervechannel
As a scientific programmer/data scientist at a university, I wish I could
upvote this twice. I'll be forwarding this one around my lab.

Another issue well as code quality, transparency, reproducibility etc. is
simply reuse. There's a lot of wasted effort in academia where people are
constantly reimplementing simple things from scratch that do basically the
same thing as their peers' code does.

Okay, this is true in other fields too, but in our field it's public money
getting wasted.

------
pama
In addition to the common excuses that Nick Barnes mentions in his article in
Nature, one reason why scientists did not freely provide their code was the
fear of potential misuse of their software (since erratic publications using
their software could harm their reputation). The common solution to this
problem was to impose a barrier to entry by charging a fee. Many of the early
examples of complicated scientific software used this policy:

<http://yuri.harvard.edu/>

<http://ambermd.org/#obtain>

[http://cms.mpi.univie.ac.at/vasp/vasp/How_obtain_VASP_packag...](http://cms.mpi.univie.ac.at/vasp/vasp/How_obtain_VASP_package.html)

Due to advances in computer literacy and the creation of competing projects,
the landscape has been changing in recent years:

[http://en.wikipedia.org/wiki/List_of_software_for_molecular_...](http://en.wikipedia.org/wiki/List_of_software_for_molecular_mechanics_modeling)

[http://en.wikipedia.org/wiki/Quantum_chemistry_computer_prog...](http://en.wikipedia.org/wiki/Quantum_chemistry_computer_programs)

~~~
jonhendry
Those packages are rather different from a given scientist's code for a given
paper. These are big, mature projects, probably multi-lab or department-wide,
long-term efforts, to develop a shared foundation for work in a given field.

The code actually corresponding to a given paper is likely to be a small
amount that builds on packages such as those above, or products like Matlab,
whatever. There could be code to implement an experiment, and Matlab code to
analyze the data.

~~~
pama
Agreed. Sharing these big, mature, foundation frameworks is essential for the
reproducibility of published research. The additional convenience scripts that
accompany a paper are often described or provided in the methods and
supplement. As a reviewer, I block papers that lack unambiguous information
allowing their reproduction, but I don't penalize the lack of explicit
convenience scripts, since these can be reproduced or provided from the
authors upon request.

------
_corbett
I go even further and am of the opinion that data provenance is the
overarching issue. Any series of results should be able to be re-generated
quickly (as measured in scientist-time not computer-time) based solely on
meta-data provided–and this is the key point–as part of the results
themselves. A few simple guiding principles could go a long way toward
achieving this goal.

    
    
       1. In the absence of a well defined standard, it’s the individual scientist’s/consortium’s responsibility to define and actively use an organized meta-data standard.
       2. If it’s not open source it’s not science.
       3. A snapshot of the source code used to generate results should be given/pointed to when the results are presented.
       4. Minimizing reproduction time is an integral part of science.
       5. Principles 1-4 should be be demanded, by funding agencies, program heads, and research advisors.

------
tel
See also the CRAPL: <http://matt.might.net/articles/crapl/>

~~~
devmonk
The pronounced acronym is appropriate for academia. It contains both words
that come to mind.

------
merraksh
Scientists sometimes do publish their software. An example: in Operations
Research, the COIN-OR initiative (<http://www.coin-or.org>) is a growing set
of (Open Source) tools for solving OR problems.

Also, in certain disciplines, not only papers, but also the software used to
obtain the published result is peer reviewed. For instance, the journal
Mathematical Programming Computation
(<http://www2.isye.gatech.edu/~wcook/mpc/index.html>) accepts papers
accompanied by the software used by the authors, which is tested and reviewed
by technical editors.

------
yummyfajitas
Some journals already require this.

[http://www.econometricsociety.org/submissions.asp#Replicatio...](http://www.econometricsociety.org/submissions.asp#Replication)

