
If you want reproducible science, the software needs to be open source - llambda
http://arstechnica.com/science/news/2012/02/science-code-should-be-open-source-according-to-editorial.ars
======
rgejman
A Github for scientific code doesn't go nearly far enough. The transition from
paper journals to electronic publications has only converted dead-paper into
"electronic" paper. With some exceptions, e.g. video recordings, animations
and supplementary figures/documents/spreadsheets/code, the document that you
download from any major science publisher is a PDF that looks almost exactly
like the printed publication. Most don't even include links to referenced
publications[1]!

Today, we know a lot about how to make documents that have complex formatting
(think micro formats, links) and even more about making abstract document
formats that can be presented and styled in different ways (think XML and
stylesheet type separation of data and presentation). Having a standardized
scientific publication format (with open-source user or publisher generated
extensions as needed) would completely change the way we produce and consume
the literature. Imagine the possibilities for meta-analysis!

Yes, code should (in most cases) be released together with a paper. But even
better would be if the code were released as part of a standardized data
format that would allow you to, for instance, selectively download raw data
and re-run the computational experiments on your own computer (think: re-
running simulations in Mekentosj's Papers as you read the paper).

Even simpler (and possibly more useful): provide both low and original (high)
resolution versions of figures that can be examined separately from the main
document. I can't tell you how many times I've been annoyed by the low quality
of the published images and wished I could zoom in to the level of detail I
know was in the original image. Even more frustrating: why should I have to
take screenshots of images in Preview to add to a figure in my lab meeting.
Separate the presentation and the data!

[1]although some now include intra-document links from a citation in the text
to the reference in the coda

~~~
toufka
biochemist here...

I would love to release my data in an open format. But the software I'm using
is proprietary - and the format is therefore closed.

Here's the problem: Scientific software sucks. It's stuck 20 years ago.
Usually a pain in the ass. Barely works. Crashes. Proprietary. Hacked
together. And worst of all - ridiculously expensive. And if it's not
proprietary, then it was written by me. The script is just good enough to do
exactly what I need it to. I've hardcoded file directories and tab-delimiters.

If I use a microscope to take images it can be in a Nikon file, a Canon file,
a Zeiss file, or a number of other files. None of which are interoperable.
TIF, in general, doesn't cut it - it doesn't communicate with the scope and
generally can't store all the meta-data that must be kept with the image
itself. If I were to dump my raw Nikon files it would help nothing - no one
could read them without the $10k software that comes with the scope.

Open standards, even for the more commonly used scientific formats, simply
don't really exist. And when they do, they're still hacky and ugly.

If someone would write some good open-source microscopy, genetic, and basic
mathematical software it would make thousands of grad-students' lives easier.
Someone PLEASE write a good plasmid viewer (an editor if you're feeling kind)
for genbank files. An iPhoto/iTunes for microscopy images. A GUI for
basic/common perl/python scripts. Some mathematical software that's not
impenetrable (a la MatLab/Mathematica). If these softwares were out there and
used, it'd be trivial to dump the raw data and allow anyone to use them.

(I'd love to have you guys make me some sweet software. It would take you a
relatively short time - these things are standard and already spec'ed out. But
sorry - we have no money - we can't really pay you.)

~~~
yummyfajitas
Just curious, some of these seem like really interesting potential OSS
projects. In particular, I'd consider writing a couple of them just for
practice - i.e., I might write your plasmid viewer/editor just as a project to
learn clojure. I imagine I'm not the only one.

Why don't you post a more detailed list? The OSS community might surprise you.

~~~
toufka
I guarantee that the best open-source plasmid editor/viewer will become a
ubiquitous piece of software in every university. And eventually, even, homes.

<http://www.mekentosj.com/science/enzymex> \- guys who made papers made part
of it. Nicest interface, but missing lots of features.

<http://biologylabs.utah.edu/jorgensen/wayned/ape/> \- is what everyone uses.
It's old. Functional, but missing SO much of what's possible.

<https://www.dna20.com/genedesigner2/> \- made by the best in the industry.
Look at that shit.

<http://bioclipse.net/screenshots> \- opensource, but look how difficult. How
not right.

The rules are straightforward. It's a lot like writing out an API. A good
manipulator of well-documented data structures.

~~~
yummyfajitas
Please write to me, I'd love to discuss this with you in more detail. Contact
info is in my profile.

------
reitblatt
There is a difference between reproducibility and repeatability. Reproduction
is an independent experiment producing commensurate results. Repetition is the
same lab repeating the experiment and finding the same results. Sharing code
actually reduces the independence of experiments. Worse, sharing buggy code
introduces systematic errors across "independent experiments". Scientists
already deal with similar issues due to a small number of vendors of various
tools, but software is pretty different. Systematic measuring biases can be
detected and calibrated, but software bugs rarely lend themselves to such
corrections. Because science depends upon independent reproducibility and NOT
repeatability, there's an argument to be made that blindly sharing code is
actually detrimental to scientific reproducibility.

The real question we should be asking is whether opening and sharing these
code bases will result in an increase in quality that offsets the loss of
experimental independence.

------
jgrahamc
Nice to see that my (co-authored) paper is top news on Hacker News. Direct
link to the paper:
[http://www.nature.com/nature/journal/v482/n7386/full/nature1...](http://www.nature.com/nature/journal/v482/n7386/full/nature10836.html)

------
rflrob
There is the factor that a lot of the code scientists write is hacky, one-off,
and fragile. The kinds of people who care about releasing their code also feel
at least a little embarassed about the code quality. There's at least one
license that recognizes and embraces this fact:
<http://matt.might.net/articles/crapl/>

~~~
_delirium
A more serious problem with that than embarrassment is that such code
sometimes really _shouldn't_ be uncritically reused, if we care about
reproducibility. An independent reimplementation that reaches the same results
is more convincing to me than a 2nd scientist getting the same results when
they re-run the 1st scientist's hacky code. It's even more of a problem in the
case of code that gets passed around and slowly accumulates ad-hoc additions
because nobody wants to reimplement it.

~~~
mjwalshe
Precisely you have to be able to reproduce the experiment with out using the
first ones code

~~~
bo1024
This is true on its own, but seems like a bad standard to hold science to.
With normal experiments, it's not enough to say "we ran an experiment that
tested the pliability of different materials and found X is most pliable",
then say, "you should be able to find that X is most pliable without reusing
our method."

The point of publishing a method is so it can be critiqued; I think the same
should hold with source code. This should _not_ at all excuse people from
trying to reproduce simulations with separate code.

Also, source code lies kind of halfway between experimental measures and
mathematical proofs. Again, you are usually expected to give proofs of non-
obvious mathematical results, at least in the supplementary section.
Similarly, saying "there exists code which produces this result" shouldn't be
sufficient unless it's very obvious.

~~~
jonhendry
"The point of publishing a method is so it can be critiqued; I think the same
should hold with source code. "

Except that source code can sometimes obfuscate the intent.

It's probably better to provide pseudocode. Don't provide source code for your
binary sort, say you sorted the data and say on what it was sorted, and let
other people use their own preferred sort implementation.

Especially since other labs may not use the same equipment, libraries,
languages, etc, so source code may be useless.

~~~
rflrob
"Except that source code can sometimes obfuscate the intent."

The source code, no matter how opaque and poorly written, can never really
make things less clear. That's because it must be able to be interpreted by a
computer. Good pseudocode and high level descriptions can help illuminate the
code, but, as the saying goes, "If the code and the comments disagree, then
both are probably wrong."

------
thomasballinger
After three years of writing CRAPL-worthy code at an academic institution, I'm
convinced this needs to be required of academic research. I've made plenty of
mistakes that could have dramatically upset experimental conclusions - I
assert that I've caught all the important bugs, but the odds will always say I
haven't.

------
luriel
And the journals it is published in should be open too.

I know it is offtopic, but it makes my blood boil that we allow scientific
research, in great part paid for with tax dollars, to be locked up in what
basically are proprietary journals only a few privileged have access to while
they should be freely accessible to absolutely everyone.

------
Irishsteve
I've a few publications out there and if I had to release my code I would.
However the reason I don't instantly publish the code is because its kinda
embarrassing. My code works and it has some level of unit test coverage to
make sure numbers make sense etc. But the code itself has a number of
inefficiencies or ridiculous variable names... or in some cases serious
example of breaking DRY.

However if everyone had to publish their code, I know the elements of my code
which cause me distress would be nothing compared to a variety of other
implementations people create.

Oh also trying to reproduce someone else's algorithm from a paper is so
painful. There are a number of experimental values that exist which aren't
really mentioned in papers as they are deemed trivial so you've to do no
amount of tinkering to get similar results.

~~~
AkThhhpppt
Linked to above, and sounds perfect for you: the CRAPL licence. ",)

<http://matt.might.net/articles/crapl/>

------
larsberg
The soon-to-be-released data retention policies for the NSF's CISE (basically,
the arm of the National Science Foundation that funds all research) will most
likely require complete and free access to not only the code for your
implementation but also all scripts, input data, and configuration settings
required to completely reproduce the experiments.

I can't wait. I've been doing some GPGPU research, and less than 10% of the
authors of _published_ papers are willing to release their code or even a
binary for benchmark comparisons.

------
eykanal
It's worth noting that there already _are_ many open-source research packages.
My graduate and postdoc work was using magnetoencephalography in neuroscience,
and the majority of the packages are open source. The authors were happy to
welcome bug reports and source code contributions, and any code used for an
analysis can be easily re-used.

By way of example, my postdoc work was all completed using FieldTrip
(<http://fieldtrip.fcdonders.nl/>), free for both MATLAB or Octave. All the
source code is on Github (<https://github.com/eykanal/EEGexperiment>), and
anyone could reproduce the majority of my analysis on their dataset.

------
spitfire
Fantastic! I'll do that just as soon as someone gives me an open source
Mathematica, ansys, risk modelling packages.

On a serious note, I agree source should be available. But it isn't, because
these sorts of specialized packages are very, very hard to write.

~~~
ec429
Try R. It's a powerful open source statistical language, also fairly good at
linear algebra.

------
altxwally
I'm not a researcher myself, but I have seen the efforts done in this field by
the org-babel project very interesting. It is a literate programming mode for
Emacs that attempts to make conducting this style of research more
straightforward.

I attach here some links and example works done in the reproducible research
style of org-mode.

"A Multi-Language Computing Environment for Literate Programming and
Reproducible Research"

<http://www.jstatsoft.org/v46/i03/paper>

"THE EMACS ORG-MODE" Reproducible Research and Beyond

[http://www.warwick.ac.uk/statsdept/user-2011/TalkSlides/Cont...](http://www.warwick.ac.uk/statsdept/user-2011/TalkSlides/Contributed/16Aug_1115_FocusI_4-ReportingWorkflows_3-Leha.pdf)

Example work: <https://github.com/tsdye/hawaii-colonization>

Org-babel wiki: <http://orgmode.org/worg/org-contrib/babel/uses.html>

------
bipolarla
Health and codes are much like any business. The talent wants to maintain
their edge. Many health companies, universities and even non profits find
value in holding onto what they create. I understand we want everyone to be
healthy but it will never work that way. So much cost and competition is
involved that it will never be a "free" open source world. I would bet if Bill
Gates and Warren Buffett each put 5 billion dollars toward find a cure or
sharing code and paying creators there would be more people willing to share.
Try asking Coca Cola for their secret recipe. Oh and tell them you won't pay
them and will be using this recipe to make your own sodas to compete against
them. I believe you will be waiting a long time for them to call with the
info.

------
docmarionum1
Open-sourcing the code would only be the first step. To make the experiments
truly reproducible, you also need to know the hardware and software
configurations used to run it. Different package versions could lead to
different results. And, for instance, if you're running your code on one of
those old Pentium chips with an error in the FPU, that needs to be know.

I'm currently working on a platform for scientific programming. One of the
ultimate goals is to include a provenance system which will be able to tell
you everything about what generated the final results, including that of input
data if it was derived on the system. That way you might be able to have a
complete history of where a particular result comes from.

------
antirez
Same code == less reliable independent verification. So open code is good but
independent verifiers should try to reimplement the software needed to verify
an experiment.

~~~
snowwrestler
Totally agree, and I'll add that in my experience non-scientists frequently
conflate "reproducible" with "verifiable". Simply downloading a data set and
code, and rerunning it, is not really a scientific endeavor. To verify a
scientific conclusion, other scientists need to design and run their own,
independent experiments aimed at testing the same hypothesis. That said, it
seems true that open source code can go a long way toward making that possible
by reducing ambiguity about what, exactly, was being tested, and how.

------
zerostar07
A successful example of code sharing: ModelDB
(<http://senselab.med.yale.edu/modeldb/>) , a database of neuronal models and
mechanisms. It contains lots of validated, reviewed simulations that are now
commonly shared in the comp-neuro community, making it extremely valuable.

------
Create
some of it is available, like <https://svnweb.cern.ch/trac/> or sometimes
locally, as a <http://gitorious.org/> instance combined with other tools, like
<http://dtk.inria.fr/>

------
singingfish
I deal with this problem in the social sciences, where the problem is even
worse. Data analysis by convention with an overwhelming reliance on expensive
propietary software ... I'm actually talking to a bunch of academics on this
topic later this week, so this article is very timely.

~~~
disgruntledphd2
I work in the social sciences, and I have to say: no one cares about
reproducibility or replication.

I write all my papers in LaTeX and R, using Sweave to ensure that my code
matches my analysis. Typically, when I send PDF's or tex files to anyone else,
they ask me for word files. No one ever cares about the code (even though I
send it every time).

In fact, I (and other colleagues) have been asked to replicate our analyses
done in R in SPSS as (apparently) R is open source, so it can't possibly be
right. The sad part is that i started using R because many of the most useful
psychometric models are not available in SPSS (and probably never will be).

To the second point, no one cares about replications. They aren't sexy enough,
so they don't get published. If you find something strange, you'll get
published in a good journal. The ten failed replications won't be published
anywhere nearly as good, so scientists don't bother to replicate.

------
bryanh
Is there a niche out there for the GitHub of science? My cofounder (mikeknoop
on HN) puts a lot of his scholarly research code stuffs on GitHub, but perhaps
a more specialized place with emphasis on peer review would be more
appropriate.

~~~
bbgm
There are many scientific groups where even today, no form of version control
is used, even for internal work, so Github is way ahead of what is current
practice in many places. There is a lot of good scientific code in various
repositories and I don't see why anything special is required. As computation
becomes even more important across many scientific areas, there is a lot of
need for discipline. If scientists learn how to use version control and
repositories just by default, that will go a long way towards reproducibility.

The Galaxy project does a great job of trying to foster such an environment:

<http://genomebiology.com/2010/11/8/R86>

<http://galaxy.psu.edu/>

<https://bitbucket.org/galaxy>

<http://wiki.g2.bx.psu.edu/Tool%20Shed>

~~~
kaarlo_n
> There are many scientific groups where even today, no form of version
> control is used, even for internal work ...

Where I work (government research lab) people think of Subversion as a
sporadic backup target, typically doing a commit every month or so, despite
making frequent changes to operational code.

Paradoxically, one scientist I spoke to was scared about overwriting code if
he did an incorrect commit, but he's perfectly happy to have mycode.py,
mycode2.py, mycode_this_one_works.py, mycode_this_one_works3.py, ...

------
cwhittle
Speaking as a scientist who deals with genomic data, I wholeheartedly agree
with many of the comments here. Code and raw data should be available at
publication. I shouldn't have to try and figure out what you did from the
three lines of text and poorly documented software you mention (that has been
updated several times since you used it (no mention of version). Personally, I
think pseudo-code would be most useful for reproducibility and for
illustrating exactly what your program does.

Let me add to a few points here about the practical obstacles to this.

1) Journals don't support this data (raw data or software).

* You can barely include the directly relevant data in your paper let alone anything additional you might have done. Methods are fairly restricted and there is no format for supplemental data/methods. Unless your paper is about a tool, then they don't want the details, they just want benchmarks. Yes, you can put it on your website, but websites change; there are so many broken links to data/software in even relatively new articles.

* As many people have said, lots of scientific processing is one-off type scripting. I need this value or format or transform, so I write a script to get that.

2) Science turns over fast or faster than the lifetimes of most development
projects.

* A postdoc or grad student wrote something to deal with their dataset at the time. Both the person and the data have since moved on. The sequencing data has become higher resolution or changed chemistry and output, so its all obsolete. The publication timeline of the linked article illustrates this. For an just an editorial article it took 8 1/2 months from submission to publication. Now add the time it took to handle the data and write the paper prior to that and you're several years back. The languages and libraries that were used have all been through multiple updates and your program only works with Python 2.6 with some library that is no longer maintained. Even data repositories such as GEO (<http://www.ncbi.nlm.nih.gov/geo/>) are constantly playing catch-up for the newest datatypes. Even their required descriptions for methodology for data-processing are lacking.

3) Many scientists (and their journals and funding institutions, which drive
most changes) don't respect the time or resources it takes to be better coders
and release that data/code in a digestible format.

* Why should I make my little program accept dynamic input or properly version with commentary if that work is just seen as a means to an end rather than as an integral part of the conclusions drawn. The current model of science encourages these problems. This last point might be specific to the biology-CS gap.

------
mjwalshe
erm as an ex technical programmer and research assistant for a world leading
rnd organization not sure I buy this for all experiments - an experiment needs
to be reproducible yes but…

Most science is based on physical observation of the experiment the code is
just a offshoot of the test equipment.

In the case where you are modelling some thing you do experiments to prove
your mathematical model. I once spent a sweltering afternoon in a bunny suit
and rubber gloves and mask helping prepare a dummy fuel rod from a Breeder
Reactor so that we would do experiments to see if our model of two-phase flow
was valid.

And surly saying you can reproduce my experiment but only using my code can
everyone not see the danger here - you would want to repeat the experiment and
implement ones own version of the maths behind it.

~~~
bo1024
> Most science is based on physical observation of the experiment the code is
> just a offshoot of the test equipment.

Even if we accept this as true, I don't see why it's an argument against
publishing code for that science which does directly depend on the simulations
you run.

> you would want to repeat the experiment and implement ones own version of
> the maths behind it.

That's a good point, but only valid if the exact mathematics and methods are
clearly explained elsewhere. But as the article states, usually there's
ambiguity. And if I try to reproduce your simulation and get different
results, it's very difficult for me to get enough confidence to call you out
on it (perhaps I'm the one who screwed up). If I find a bug in your code, it's
easy.

------
Craiggybear
All scientific software should be totally open and transparent -- indeed a lot
of it already is. For years, for example, people were erroneously making the
mistake of trusting Excel's statistical functions without being aware they
were deeply flawed.

Software that is in use for scientific purposes must be open to review and
assumptions about their efficacy or correctness should not just be taken for
granted. They need to be checked and their outputs verified for correctness.

Even when flaws in commercial proprietary code are found it can take years (or
never) before they are corrected. Chances are that if the same flaws show up
in OS software they be fixed sooner. Failing that, you can fix 'em yourself --
or at least be in a position to potentially detect them and alert other users.

