

What's the point of writing good scientific software? - michaelbarton
http://www.bioinformaticszen.com/post/whats-the-point-of-writing-good-scientific-software/

======
sophacles
A large portion of my job, and my team's job, is programming for research at a
university. We get "researcher code" that was used to write a paper, and turn
it into something for the next set of researchers to build on. There are
several other groups at this university that do the same thing.

I'm starting to think, there may be a field of study here. I regularly take
software that "proves" some hypothesis or "shows some good results", tear it
down, and throw some engineering at it, only to find that the results are not
reproduced, or the benefits are severely diminished. Then I have to go track
down the why... because until I can show it, it is assumed my fault.[1] The
results of some of this are probably paper worthy themselves.

I think a useful field, or useful conference at least could be built for
people in this typeof postions, studying the meta effects of software on
research. How can we report the issues found, or the updates to the numbers,
without putting black marks on the reputations of people who actually are
doing good work?

Another interesting phenomenon that is worth study is that the
refactoring/rewriting process often gets real results, but it turns out the
mechanism for the improvement isn't what the original researcher
thought/claimed. It is something perhaps related, perhaps a side effect, and
so on. There needs to be a way to recognize both the original researcher, the
programmer who found the issues, and the follow-up researchers who did some
more determination of the problem.

[1] This isn't as antagonistic as it sounds. It is actually a nice check on my
own mistakes. Did the differences in what the researcher did and what I did
introduce some strange side effect? Did I remove a shortcut that wasn't
actually a shortcut and I misunderstood? A hundred other things on both
sides... Research has a large component of "we don't know what we're doing,
axiomatically so", and as such it is a decent way of finding out more info.

~~~
plg
case in point:

The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating
System Version on Anatomical Volume and Cortical Thickness Measurements

<http://dx.plos.org/10.1371/journal.pone.0038234>

------
Xcelerate
I've been debating about whether I want to open-source the scientific code
I've been writing. A lot of it could be useful to other people in the
molecular dynamics field.

I recently introduced my advisor to Github, and he thought it was a good idea;
however, there were a few hesitations. The first, most importantly, is the
likeliness of a bug. If you put your code on a very public website like
Github, there is a chance it's going to be scrutinized by everyone in your
field.

Now, unless you are one of the best programmers who has ever lived, there are
bound to be bugs in your software, and when someone discovers them, it could
have a deleterious effect on any journal articles you've written that used
that code. The issue is that even though most bugs do not lead to significant
changes in results, you would still need to redo all of your data to make sure
that is the case. The software industry has long recognized buggy software as
a reality, but I don't think the scientific community is as tolerant of it
(hence the reason a lot of people hide their code).

For my MD simulations, I use the well-known LAMMPS package. Bugs in it are
discovered all the time! (<http://lammps.sandia.gov/bug.html>). So I think
there needs to be a collective realization among the scientific community that
these are bound to occur and authors of journal articles can't be persecuted
all the time for it. A lot of computational work is the art of approximation
so I would just lump "human incompetency" under one of those approximation
factors.

Despite this risk, I think I'm still going to release my code at some point as
I would personally welcome critique and improvement suggestions. I'd like to
think I'm a better coder than most scientists since I've been coding since I
was twelve in multiple language paradigms and have won a major hackathon, but
eh, who knows. I'm quite sure my _environment_ isn't up to industry standards
because I've always coded solo rather than in a team.

~~~
plg
holy moly. It sounds like you don't want to release your code because someone
else might find a bug that you missed (and that might have a "deleterious"
effect on your journal articles).

You would rather leave potentially incorrect work standing than have a bug
corrected?

What the f*ck has happened to science.

~~~
nagrom
___What the fuck has happened to science._ __

This is fairly standard in the science industry (I use the term deliberately).
I get rated (in nuclear and particle physics) on how many papers I've
published, what indictions of recognition I get from my peers and at which
conferences I've been invited to speak. Short of something like a Millennium
Prize or a Nobel (one-in-a-million, and both still very political), there are
almost no direct rewards to accuracy or importance.

Academics tend not to be the people with the most understanding or the best
direct insight - they're the people with the most friends on committees and
who do the most fashionable things (usually badly). And so they appoint new
academics who are a little bit worse than them (no point in opening yourself
to competition - instead bring in people who are going to be grateful to you!)
and the cycle continues.

~~~
mgillett
This is what turns me off so much from academia. Is this really that typical
at most universities? I know that there's always been a large emphasis placed
on publishing in high-impact journals (and often), but isn't the open
publishing movement beginning to have an effect, albiet rather small?

~~~
nagrom
Not that I have seen. In fact, in my experience, even publications are not all
that important! In the UK and Germany both it seems to be 30% of what you
know, 60% of who you know and 10% luck.

------
ylem
I would say that you should look at what stage you are in your career and what
your goals are. If your goal is to become research faculty, you should focus
on getting high impact papers out of the door--software is a tool for helping
you do so.

If you find yourself re-using that bit of code, then it may be worth cleaning
it up and making it maintainable. If people start sending you requests for it,
then it may be worthwhile open sourcing it, documenting it, maintaining it,
etc.--but only if you have time.

I do make open source scientific software as part of my job, but I'm at a
later stage in my career and it's not something I would have a science postdoc
work on--it's just not fair to them and their career prospects within
science...

Recently, someone asked for some reduction code that I've developed and I
realized that while it was documented, I didn't have time to refactor it and
clean it up--finally, I just put it on github and told them to contact me if
they had questions--they were happy to have it as a starting point for what
they wanted to work on. So, if you believe that you've made something
worthwhile, but don't have the bandwidth to maintain it and other people might
find it useful, sometimes it might be better to just put it out there and let
people play with it--no guarantees, but it may help someone else get
started...

You can get a large number of citations in some subfields for writing commonly
used software--but it may or may not help your career. For example, I have
friends at various institutions around the world that tell me that their
management gives them no credit for developing useful software (complete with
lectures, updates, documentation, etc.)--they just release it because they
feel they should and most of them are also already tenured in their positions.

Good luck!!!

~~~
_delirium
I agree there's little direct credit given for developing software, but it can
be helpful for early-career researchers to gain name recognition. Even if
nobody knows your actual work, if you wrote some software lots of people use,
it makes people feel like they know you from somewhere.

Admittedly, it's tricky to do, since that name recognition only really matters
if you can _also_ manage to publish enough papers. Realistically grad school
is more likely for that than during a postdoc or as an assistant professor.
Some grad students manage to release some widely used software (well, usually
"widely" in a particular niche), which I think does help them build up more
prominence than someone in their career stage might otherwise have had.

On a different angle, having produced some reasonably decent software can be a
nice thing to have in your back pocket if you ever consider moving to
industry. Having N papers and one decent software package is probably a better
academia-to-industry transition CV than N+2 papers and no software packages.

~~~
michaelbarton
I agree that producing a piece of software many people use is great for your
career. I think, on average, most software is never downloaded or used. So,
given this prior, is it a good investment of my time to flesh-out
documentation and examples rather than just enough to get it published?

------
JohnBooty
"I have previously believed that converting any code you've created into a
open-source library benefits the community and prevents reinvention of the
wheel [...]

I have however started to realise that perhaps something I thought would be
very useful may be of little interest to anyone else. Furthermore the effort I
have put into testing and documentation may not have been the best use of my
time if no one but I will use it. As my time as a post doc is limited, the
extra time effort spent on improving these tools could have instead have been
spent elsewhere."

From a purely selfish perspective, I've found that documenting and cleaning up
my own code benefits me in the future. Even if it's a one-off, single-purpose
utility that I'll never use again in the future, I often find myself needing
to borrow bits of code from my old projects. ("Oh, I solved this problem
before. How did I do it? Let's dig up that old, old project...") At which
point, present-day me benefits if my past self bothered to actually document
things and make sure they're reasonably robust.

There are countless other reasons (moral and pragmatic) to document, test, and
open-source one's code, of course! Many of them more important than the
ability to crib one's old code, I'd argue.

But the author seems to have considered (and discarded) them...

~~~
michaelbarton
I am the author of this post. I don't disagree with anything you wrote.
Several posts on my blog say exactly this. The software I wrote however is
open source on github [1], has it's own website [2], has man pages for each
command [3], example projects [4] and three screencasts guiding how to install
and use it [5].

I did this because I wished that all bioinformatics software had this
attention to usability and documentation. However now I wonder what was the
point of all of this if no one ever ends up using it? I could have done the
minimum for publication then spent this time working on finishing other
manuscripts I have waiting.

As I wrote though, I agree with what you wrote in your comment. I just don't
think there is any incentive for post-docs in academia to prioritise writing
good software over pushing out additional papers.

[1]: <https://github.com/michaelbarton/genomer>

[2]: <http://next.gs>

[3]: [https://github.com/michaelbarton/genomer-plugin-
view/tree/ma...](https://github.com/michaelbarton/genomer-plugin-
view/tree/master/man)

[4]: [https://github.com/michaelbarton/chromosome-
pfluorescens-r12...](https://github.com/michaelbarton/chromosome-
pfluorescens-r124-genome)

[5]: <http://www.youtube.com/user/BioinformaticsZen>

~~~
JohnBooty
Instead of, "But the author seems to have considered (and discarded) them"

I ought to have written, "But the author has considered them and concluded
that their benefits don't outweigh their costs in his case."

I totally support what you're saying! IMHO it's definitely not worth it to
document/open-source/etc code at the cost of one's career or happiness,
especially when the code is of questionable utility to others.

------
tmarthal
I used to call the scientific software that I was writing, "paper-ware".

You aren't building a system for other users, you aren't really doing anything
other than one-off analysis to create charts, which will be explained in a
paper.

Things have changed somewhat since the early 2000's, but the concept remains
the same. Nowadays, for interesting or controversial results other scientists
want to be able to verify your results. However, that is usually more related
to your data and how you processed it, rather than your software algorithms
(which should be explained in the paper, and can be recreated from that).

So do these systems need to have reams of documentation? Probably not.
However, if you leave the system for two years and come back to work on it, or
figure out how it used to work, then you best have enough commenting with a
thorough readme about some of the decisions you made and why. It's more
analogous to scripting rather than software engineering.

~~~
shardling
On the other hand, if you put good software out that people use, _it counts as
a citation_. The most cited prof in the dept I graduated from maintained a
widely used program for astronomical simulations.

~~~
Xcelerate
Seriously? That's intriguing... If you wrote a very popular scientific
library, how much would that impact your career compared to, for example, a
Nature publication?

(I ask this because, for me, I think the former is much more likely than the
latter.)

~~~
epistasis
They serve somewhat different purposes, but an influential library will have
far far more citations. At least in bioinformatics, that's the only way to get
a large number of citations, because at their core bioinformaticians are tool
builders! For example, samtools has ~1400 citations [1]. The original BLAST
paper is around ~44k citations [2].

However, there are a ton of bioinformatics libraries released every year, and
almost none of them gain any traction. Nature publications are far more
frequent than important new libraries, and you need more political clout to
get the library popular than you need to get a Nature paper.

Really, you need to gauge usefulness and interest of your library before you
devote a ton of time to it. It's a lot like a startup's product.

[1] <http://scholar.google.com/scholar?q=samtools>

[2] <http://scholar.google.com/scholar?q=BLAST>

~~~
mgillett
I think this is more a comment on how the system is broken. Researchers should
be notified of new libraries in their area. At the very least, they should be
able to consult a single site that everyone uploads their code to (think
Github for science with more emphasis on exploration). Academic journals are
not the only channels carrying useful information.

~~~
epistasis
It's not the discoverability of the libraries that's the problem, it's that
the utility of these libraries is generally not that great for anyone except
the authors. One common type of library handles data transformation,
normalization, and maybe even workflows. These abound. But they are rarely
useful in other people's hands, because to extend them and actually get any
work done, you need to spend as much time learning them as it would take to
write it from scratch. And the advantage of writing it from scratch is that
you know it intimately, and all of its assumptions and flaws, which you don't
know about somebody else's code, even if it's extremely well documented. Take
something like Taverna [1], which is probably very useful to some people, and
had been recommended enthusiastically to me by many people, but after spending
three hours reading documents and searching the web, I could not get it to do
what I needed to do, so I wrote a simple one-off bash script that interfaced
with our cluster system. Alternatively I could try to hack in loops, but
that's going to take me 10x as long, will require me to interact with many
other people who obviously don't understand my problem since they did not
consider it a fundamental need, and may not even be accepted back into the
mainline, at which point I'm off on my own fork and lose the benefit of using
a common code base. Waiting 1-10 hours to hear back from the dev mailing list
is unacceptable when you're trying to get work done.

Is it more important to get the result, or to use other people's code?
Reinventing the wheel is a minor sin compared to not getting results.

[1] <http://www.taverna.org.uk>

~~~
mgillett
I just think that very much depends on the field and the problem domain.
Taverna seems like it's more targeted towards academics that don't know how to
code, and that most people that use it are comfortable staying within its
limits. I mean, you definitely are going to have a level of project
specificity that is much higher than say, that found in the web development
world. In science, many people are searching for the existence of new
problems, not just the answers. Why build a gem for email integration if the
next best method of communication will likely come out next week? The problem
with this thinking is that it perpetuates itself. I don't write the library
that only you would find useful because I don't think it's worth my time. In
return, I never receive anything useful because everyone else has adopted that
same mindset. As some others pointed out, I think the problem rests in the
lack of best practices and poor comp sci education among researchers. Teach
proper library construction and test-driven philosophy, and I think you'll see
a lot more people become comfortable writing and publishing libraries. Cobble
together some basic documentation, keep an eye on its use, and contribute more
accordingly. You're never going to escape writing custom scripts, but there
are more well-defined problems out there that could use standard solutions.

------
lmm
Like everything else in software, code quality should be feature-driven. Write
the minimum to do what you need to. If you find that your code's poor quality
is becoming a problem (whether because it's slowing your own development down,
or other people aren't using it and you want them to, or whatever reason), do
something about it then, but not before.

~~~
michaelbarton
I agree. As an individual this is what everyone should do. Just write the
software to get to the next publishable unit. I think however this leads to
poor quality software for the field as a whole.

------
abraxasz
"I have begun to think now that the most important thing when writing software
is to write the usable minimum. If then the tool becomes popular and other
people begin to use it, then I should I work on the documentation and
interface."

That. Like someone pointed out, I find that documenting and testing the key
parts (that is, those I know at least I will reuse) is always a good
investment of my time and prevents major headaches down the road. I've been
experimenting with project structures that clearly separates the set of tools
and functions that will be reusable, and those that are one shot. I focus all
my testing efforts on former, and cut myself some slack on the latter.

Btw, I speak from a "scientist" perspective, and nothing I say applies to
professional software engineering (I mean, I don't think it does).

------
roadnottaken
This is debatable, but IMHO your job as a post-doc is to learn new things
about biology and publish papers on what you've learned. If you can document
your code along the way, that's great. But if it's taking up a bunch of your
time then it's probably a misguided effort.

~~~
robotresearcher
No, your job is to contribute to the science of biology (or whatever). If you
can have a large impact building specialist tools, that's a legitimate
scientific contribution. This is commonplace in e.g. astronomy, where a PhD or
postdoc could be part of a team building an instrument. The student is long
graduated before the thing gets first light, so doesn't directly discover a
damn thing about galaxies or whatever, but their instrument is a fine
contribution. Science software is the telescope of tomorrow. (And today, but
it doesn't scan so nicely.)

My bona fides: I helped create a widely-used software system in my field, and
have received reasonable credit for it as a scientific contribution.

------
ozataman
A big concern for me has always been correctness. You're more likely to make
mistakes and miss edge conditions in sloppy code. There's nothing worse than
communicating some positive/inspiring results, only to find out later that you
had an elusive computational bug in there that invalidates the results.

This reminds me: A man was seen cutting a tree down with a dull bladed axe. A
bystander asked him "Why not sharpen your axe first?". The cutter responded "I
don't have the time!".

~~~
michaelbarton
I agree with respect to writing correct scientific software. I think in many
cases though bugs are not found because who is going to download the software
and test the conclusions? Unless it's a very big result, e.g. an arsenic
backbone for DNA, few academics will spend the time to validate others results
in this level of detail.

------
wallerj77
I'm curious, is there a place where you can submit your software to the
community and tag it as relevant for doing A, B, C. So that others can use it
to do the same or even build it further. I have limited experience with
software in your field - but it seems like there isn't a good way to find
tools already built to address your needs, or at least close enought? Am I
wrong or miss something?

~~~
michaelbarton
As far as I know, nothing like this exists. It could be useful though. There
are a few publications that do critical comparisons of scientific software.
The assemblathon is one example of this.

------
elchief
I was just thinking the other day about how good academic software is getting.
And how useful it is to society that masters and PhDs are making software for
the research.

Look at RapidMiner (developed at U. Dortmund), Stanford's CoreNLP, and the
brat rapid annotation tool. These are better than a lot of commercial tools.
They are more text-analytics than bioinformatics, but same diff.

------
gwern
Funny, I was just comparing the incentives for releasing scientific software
to those of releasing well:
[http://multiplecomparisons.blogspot.com/2013/02/making-
data-...](http://multiplecomparisons.blogspot.com/2013/02/making-data-sharing-
count.html)

And now I hear this questioning the value of writing up and polishing
scientific software!

