
The unsung heroes of scientific software - bootload
http://www.nature.com/news/the-unsung-heroes-of-scientific-software-1.19100
======
s_q_b
Personally I believe that open code, raw data, and cleaned datasets should be
requirements for publication in any peer-reviewed journal in which a paper
claims that computational mechanisms were involved.

Failure to distribute the data and code not only greatly devalues the
contributions of many scientists, it makes replication far more difficult, and
opens the door for outright fabrication.

If computation is a necessary element of the research, it ought to be a
necessary element of publication as well.

~~~
noname123
>Failure to distribute the data and code not only greatly devalues the
contributions of many scientists, it makes replication far more difficult, and
opens the door for outright fabrication.

Reasons why labs don't produce code:

(1) Fear of being scooped; suppose you put in tons of man-hours into a custom
population genetics association study for malaria in Africa, you don't want
your competitor lab to sequence a bunch samples in SE Asia and run your code
as-is and publish a paper when you have the sequencers to do the same.

(2) Fear of competitors not being able to replicate stochastic ML results;
some machine-learning and artificial neural network papers applied to Biology
are stochastic. Past authors have been accused of "cherry-picking" the most
"optimistic" runs that show positive results, e.g.,
([https://liorpachter.wordpress.com/2014/02/11/the-network-
non...](https://liorpachter.wordpress.com/2014/02/11/the-network-nonsense-of-
manolis-kellis/))

(3) Focus on science, not tool production; most labs' focus is to produce
publications, not open-source software. User base for scientific software is
very niche, unlike Web MVC frameworks; making the payoff calculus of packaging
and supporting external users not so great. Furthermore, most labs have custom
databases making integrating external software, externalizing internal
software difficult.

~~~
pvnick
Just to follow up on your last point, I have found that folks are often happy
to send code, but code written in research labs is generally horrendous. I
don't think that the "publish or die" culture and a culture of releasing all
relevant code is compatible. It takes time to refactor and produce clean code
- time that can be spent chasing the next publication.

~~~
kwhitefoot
Then requiring the code to be published and runnable would have a salutary
effect on the whole circus.

~~~
SapphireSun
Not really... I don't see how that would be anything more than adding one more
task to a million other tasks. It doesn't change incentives.

------
jcoffland
I've been developing research software for +10 years and I experience this all
the time.

As one example, over a decade ago I wrote a biological simulation program
called CompuCell3D it was the successor to CompuCell, a 2D simulator of cells
as cellular automata. Since then many papers have been written based on
research utilising this software. Not only have I not been credited in any of
these publications but my name has been removed from the software, the website
never mentions my original contribution and the current maintainers of said
software are not responding to my emails.

Granted this code has changed a lot over the years but there are still large
parts of the code which are still verbatim from my original software. The
researchers in this project see it as totally irrelevant that I wrote the
original code because they value code creation MUCH less than their research.

This particular case is especially egregious but it is a good anecdotal
example of the problem. Sometimes code is not as significant or as
breakthrough as raw research. It really matters what kind of coding you are
talking about. You don't necessarily credit the construction crew when
dedicating a new building but you do credit the architect. Coders are under
valued in today's research environment.

~~~
coliveira
> Coders are under valued in today's research environment.

Exactly because it is a research environment. Coders are valued in a tech
company environment because that's where they're the stars. In any other
organization, such as a transportation company or a government bureau, a coder
is just an assistant to the main tasks, and there is no reason it shouldn't be
different.

~~~
spacecowboy_lon
And in scientific computing the senior researchers are not paid very much so
the supporting developers/tecnhnicians are badly paid.

When I was a Research Assistant/Experimental Officer at a world leading Rnd
organization I was paid about 1/3rd of what other jobs with similar entry
requirements did.

~~~
jcoffland
That depends on who you work for. I've gotten paid well for research
programming but only from top level universities and research institutions
with money. Working for a state university usually won't pay well if at all.

------
dikaiosune
Very neat, but they're only indexing CRAN and PyPI right now. In my short time
doing scientific computing (staff member at a university right now), I've seen
a LOT of C and C++, but those don't have canonical package repositories. I
think that finding all of those codebases (even if just limited to GitHub) and
indexing those would be very interesting and would be necessary to have
representation from a few fields.

~~~
lutorm
I strongly oppose "just limiting to GitHub". If this is to be used as some
sort of merit factor, its inclusion can't be contingent on using a particular
centralized repo. For a start, there are several such providers. Larger
projects might well prefer self-hosting, too.

~~~
dikaiosune
I would oppose it too in the long run, but it would be the easiest low-hanging
fruit for getting the ball rolling on including a wider variety of research
software, and I was mostly making a suggestion for next steps in developing
their (pretty cool) tool.

You have to start somewhere, and if you get GitHub working then it would be
(hopefully) much easier to then include BitBucket, GitLab, etc. Indexing self-
hosted repos would be pretty tricky, I imagine, if only because you'd then
need to maintain a list of all of the servers to clone from.

Further, there's also the challenge of including software from all of the
research groups who don't even appear to use version control, or if they do
it's locked away on a private server or service. How does one attribute
authorship rights to those people without source history?

Anyways, this is a long ramble now, but I'm mostly trying to illustrate that
it's difficult to do what Depsy does with software when it's written in
languages without a canonical package repository. That doesn't mean they
shouldn't try to expand their reach just because "GitHub isn't enough."

~~~
lutorm
I was kind of figuring that people would register their own software, rather
than trying to somehow find them. It doesn't seem much harder accessing
different repos on different servers than on just github. You still have to
keep an url.

Of course, if it's not a public server then you can't do anything, but then
again in that case the authors can't really complain they don't get any credit
for what they do.

------
chrisdbaldwin
When I think about unsung software heroes I think of Rob Scharein[1]. He is
the creator or KnotPlot[2] which has enabled topological research to flourish.
He recieves regular acknowledgement among topology researchers, and he's an
awesome guy. Do yourself a favor and play with some knots.

[1] www.hypnagogic.net/rob/ [2] www.knotplot.com

------
jamesblonde
I fully agree with the principle that experiments should be reproducible. In
systems research conferences (SOSP, Eurosys, OSDI) very few of the systems
publish their algorithms. The worst culprits are the companies, such as
Google, Facebook, and Microsoft, who don't even publish all of their
algorithms or systems properties. Google have published significant papers
that leave out the key algorithms on how parts of their systems work (e.g., In
Borg how does the scheduler synchronize the cluster state with the replicas?
In Spanner, what are the properties of the underlying storage system,
Collosus?). I would love to see all distributed systems papers be able to
reproduce everything by just downloading a file that allows the paper's
results to be reproduced. We can do it with the help of either virtualization
or containers and software configuration frameworks (to parameterize the
experiments). We can specify the hardware and network programmatically, we can
install the software automatically, we can parameterize system software (with
Chef or Puppet attributes), and run the experiments reproducibly. Without huge
investment and cloud computing or Docker, we could specify systems that have
reproducible hardware/network/software. The first attempt I've seen at this is
www.karamel.io that allows you to design reproducible experiments by using
JClouds to spawn VMs and setup the virtualized hardware, then orhcestrating
Chef to install software that can be parameterized. I hope to see more of this
line of platform gain adoption in our community. It would be a boon for
research.

~~~
dekhn
Lots of people have proposed and implemented VM-based reproducibility
environments.

1) it's actually harder to get reproducible computational experiments than
they expected. For example, you can run same VM on a different processor and
get different results, which makes bitwise reproduction hard, and statistical
tests for nonbitwise equality are harder

2) developing and maintaining the VMs and the environment takes a fair amount
of effort from skilled people

3) the resulting improvements to science don't seem to exceed the cost
thresholds implied by #1 and #2, and nobody's volunteering their time.

My conclusion: good idea, but probably not critically necessary.

~~~
jamesblonde
That's good feedback. Although I'm considering mostly distributed systems,
where full reproducability is not possible. For single-threaded, deterministic
programs (discrete event simulations) where performance is not a measurement
point, you can do it with VMs. The nicest thing would be to be able to
parameterize experiments. Not just re-run them, but change the parameters for
different runs.

------
chenning
Soooo... should people get recognition, or not?

Re:
[https://news.ycombinator.com/item?id=10838166](https://news.ycombinator.com/item?id=10838166)

------
pdm55
My unsung software hero is Dr. Jeffrey Lewis Fox (deceased in 1999 at age 51),
Associate Professor, Department of Pharmaceutics and Pharmaceutical Chemistry,
University of Utah
[http://pharmacy.utah.edu/pharmaceutics/news/1999.html#fox](http://pharmacy.utah.edu/pharmaceutics/news/1999.html#fox)
He wrote MINSQ (now called Scientist), the only piece of software I have ever
fallen in love with. He sold the first version for $10 (if I recall correctly)
in 1990. It was so user-friendly that I was using it within 10 mins to solve a
system of differential equations that were part of my PhD research.
(Unfortunately, I don't find the more recent versions, "updated" by others,
anywhere near as user-friendly.) He called his company, at the University of
Utah, MicroMath. I assume he used the algorithms from "Numerical Recipes" to
write MINSQ - I know that was the alternative facing me till his excellent
software appeared. I really think the U of Utah should have kept the rights to
the software and maintained it in his honour.

------
abawany
I will use this opportunity to mention Maxima
([https://en.wikipedia.org/wiki/Maxima_%28software%29](https://en.wikipedia.org/wiki/Maxima_%28software%29))
that was developed by one of the best math professors that I have had the
privilege to learn from: Bill Schelter. It was forked from Macsyma and
released under the GPL and still appears to be actively maintained.

------
amelius
What about generic tools, such as webbrowsers? The web was invented at CERN by
a researcher, so I guess webbrowsers must have some value to the scientific
process :)

