
The myths of bioinformatics software - bbgm
https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
======
roel_v
Much of this is equally applicable to other fields, but I very much disagree
with point 2. Every lab should have one or more _programmers_ , people who are
professionals at writing software, and who guide researchers in their software
development. For both efficiency and accuracy reasons.

But of course a software developer at a university is, at best, a 'lab
assistant', but more likely regarded to be on the same level as the janitor
(both in respect and in pay). With the result being thousands upon thousands
of shitty programs you wouldn't wish work on to your worst enemy.

But hey, cool with me, I carved out a consulting niche in cleaning up such
messes in exactly that environment. But man could a lot of money be saved, and
a lot of much better work be done, if only researchers (and the hierarchy
above them) would recognize that software development is both _critical_ to
pretty much any research today, _as well_ as something they cannot just pick
up on the side.

~~~
_Wintermute
Problem is that scientists get paid pretty terrible wages, though this is
expected and pretty much drilled into you since undergraduate 'If you wanted
to be paid for your brains, you should have gone into finance' and so on.

On the other hand, programmers and software developers are not likely to put
up with being paid peanuts when there are plenty of well paying jobs out
there. I can't see much reason why many of them would stick around.

~~~
roel_v
"Problem is that scientists get paid pretty terrible wages"

I can't claim to know the situation everywhere, but apart from the postdocs
(and I guess the 'adjunct' temp positions endemic in the US), assistant and
full professors actually get paid quite well (I know of Western Europe, AU/NZ,
Canada, and I think it's the same in most places in the US too - although I
don't know first hand there). Funding agencies _would_ fund software people.
There is no _will_ to change it though, because hey I wrote 5000 lines of C or
Fortran back in the 80's when I was a grad student, so why can't you do the
same today? I have a CD-Rom with a very good C++ compiler I've been using
since the early 1990's, I'll send you a copy, you should try it out! (I
paraphrased that last line, but I was actually told this less than 4 months
ago).

------
meeper16
And the most long standing myth, which first started after the Human Genome
was sequenced by Francis Collins and Craig Ventor (now working on human
longevity) over 14 years ago: Bioinformatics software will single-handedly be
responsible for discovering billion dollar drug targets. This is in large part
why most early bioinformatics companies failed - due to lack of deliver on
this front along with jangled software approaches that were being moved into
the commercial world from academia. The reality is that most bionformatics
software relies on old formal methods and is not geared toward true highly
innovative discovery and data interpretation.

I do think however that we are entering a new age of bioinformatics and its
associated data mining, interpreation, visualization and discovery tools which
hopefully will push the bounderies of being less formal and more experimental.
We need to make discoveries faster when it comes to Life Sciences.

I think new approaches in bioinformatics/datamining/data science/visualization
will have the greatest impact in the areas of extending human lifespan. This
is what Craig Ventor, Google Calico Labs, SENS, GenoPharmix, Buck Institute
are all working on now.

~~~
kodisha
I also think it has to happen on the web (open or closed), instead of the
desktop software.

Also, d3 wont be able to pull it off. If we do a comparison, i think that d3
is MooTools, wee need to get to ES6/angular/react kind of libraries.

~~~
collyw
What do you feel that ES6/Angular/React do that d3 won't? They are for
building user interfaces. d3 is for charting.

(Most bioinformaticians I know seem to use R).

~~~
kodisha
It was a comparison. Once MooTools was the library to use, now its dead.

------
coliveira
An important point in this article is the vital distinction between academic
software and "general purpose" software. The goal of research software is to
prove or exemplify a point explained in one or more papers. It should do so in
the most direct and economical way for the researcher(s). It makes no sense to
create multi-platform, software-engineering friendly software for this kind of
use. In the last few years we have seen an untold number of complains that
research software is not robust, user-friendly, etc., and it entirely misses
this important fact.

~~~
angersock
The problem is that the "direct and economical way" tends to basically enforce
bit rot and lack a reproducibility. It also tends to screw whoever inherits
the project.

~~~
coliveira
Reproducibility is an issue in all scientific fields, it is not something
unique to computer science. Can someone easily reproduce experiments performed
in a particle accelerator? What about an extremely complex chemical reaction?
The answer to these problems is to have more people doing research in that
area to make sure that the problem is well understood, instead of requiring
that scientists slow down the research because it needs to first achieve some
kind of "engineering reproduction" standard.

~~~
bbgm
Speaking for the chemical reaction - yes. That is exactly what should be
possible. Some reactions are hard to reproduce and labs spend years trying to
figure out how to do so. Without that you really don't have good science.

The key is .. can someone read your paper and reproduce your experiment and
the results. If not, it's not valid; it's just a paper.

~~~
dalke
That's not true. Some types of chemistry cannot (legally) be reproduced, but
they are still good and valid science. For a trivial example, consider the
many chemistry papers based on measuring fallout effects from nuclear testing.

More prosaically, some things can be verified even when a chemical process
cannot be reproduced. Protein crystallization is a tricky process with many
irreducible results. Yet if you have a crystal, and can use the crystal to
determine the protein structure, then you can verify that the crystal indeed
contains that protein - even if you are unable to reproduce the chemical
process used to create the crystal in the first place. Others can use the same
sample to re-verify the resulting x-ray structure even though they also might
be unable to reproduce the crystallization step.

------
arca_vorago
The first opportunity to comment on bioinformatics since my non-compete/nda is
over!

This seems like a very sloppily put together list of _myths_ , but I'll bite
anyway.

1\. Not true, but I think that's largely because much of the software used is
closed so the FOSS community is largely anemic in the bio world. For the tools
that are FOSS or BSD, I saw plenty of contributions, but the other thing to
keep in mind is that it's not just about the programming. You have to have a
certain level of understanding of the application domain to program a solution
for it properly, and there are very few of these people around. I predict a
huge uptick in demand and salaries for bioprogrammers.

2\. Is true. You need your own people on salary to program for your needs. I
was the sysadmin part of a phd, sysadmin, programmer team and we were doing
stuff that no-one else was going to do for us. You need to have your own
programmer, and a good sysadmin, full stop.

3\. Is also true. Picking the right license is important because many labs are
pretty tight on cash flow. Sure, they probably have millions going through
them a month, but operating costs are super high and margins are lower than
you may think. It was during my time in the genetics lab that I fully realized
why FOSS was so important, and I think it's the future. (with a few key
proprietary exceptions that no FOSS has matched yet, (think Elmer vs Comsol))

4\. Using a FOSS license makes this a moot point to address. Use GPLv3 code
people, stop using BSD!

5-9: not worth addressing.

Anyway, my overall view of the field is this: with sequencing getting cheaper,
the problem is in managing the levels of data being generated (sysadmin issue)
and in interpreting the data for meaningful results (programmer/phd issue).
Personally, I think that machine learning is going to be the right
breakthrough to follow and apply to bio, and once we do that I expect it to
take off to crazy levels. I'm talking sequencer in every doctors office, and
artificial genetic manipulation becoming much easier and with more accurate
predictions.

Also, the other thing everyone underestimates is the microbiome as an entity.
You are more the bacteria that lives in you than you are you. Of course, I
struggle to understand the science sometimes, I'm just a sysadmin, so take
what I say with a grain of salt.

~~~
dalke
1\. "the FOSS community is largely anemic in the bio world"

I'm shocked by that. I think of the bio world as having a lot of FOSS
software. With BOSC it even has its own yearly (satellite) conference. By
comparison, I work in chemical informatics, and think that FOSS availability
in chemistry is rather less than biology. I've blamed money - chemists have a
longer history of making money from their research than biologists.

Which research field do you think has a robust FOSS community and where there
are also for-profit companies developing commercial in the same field?

2\. "You need to have your own programmer, and a good sysadmin".

You say that myth is true, while Pachter thinks it's false. Your disagreement
is really more one of team size. Pachter's myth concerns "a large team." You
have a team made of a handful of people. That is, Pachter wrote: "I agree with
James Taylor, who ... stated that ”... Scientific software is often developed
by one or a handful of people.”", which is what you have.

I can agree that it's not clear from the text that the myth concerns, having
~6 or more people in the team.

3\. Could you explain why your explanation makes the myth true?

Pachter gave the example of the UCSC genome browser. From its web site, "The
Genome Browser, Blat, and liftOver source are freely downloadable for
academic, noncommercial, and personal use. For information on commercial
licensing, see the Genome Browser and Blat licensing requirements." Were you
working in a for-profit genetics lab when you realized that FOSS was so
important? Otherwise, how would a restriction like the UCSC one have affected
you?

4\. I actually think the FOSS vs. licensing cost are somewhat orthogonal
issues. I distribute my software under the MIT license, but only to those
people who pay me about US $25,000. This is something the FSF encourages, as a
way to bring revenue to a project. However, it only works because I'm not
trying to establish some sort of "community" but prefer a vendor/customer
relationship.

BTW, I think points #1, #2, #3, and #5 can be recast as "you need to build a
community around your project." I think that's a myth in its own right.

~~~
infinite8s
"4\. I actually think the FOSS vs. licensing cost are somewhat orthogonal
issues. I distribute my software under the MIT license, but only to those
people who pay me about US $25,000. This is something the FSF encourages, as a
way to bring revenue to a project. However, it only works because I'm not
trying to establish some sort of "community" but prefer a vendor/customer
relationship."

Why the MIT license then? Do you figure that people who paid $25k for the
software aren't going to go upload it to github for all to see?

~~~
dalke
Correct. My clients are drug development companies. Very few of them
distribute any sort of software, in part because often they would need to get
legal to sign off on it. Even fewer provide support for the software.

------
cjbprime
As
[https://twitter.com/madprime/status/619503684838387716](https://twitter.com/madprime/status/619503684838387716)
points out, the argument that you're cheating the US Government out of public
money by releasing without a non-commercial clause is bizarre -- everything
the US Government releases is required by law to be released into the public
domain.

~~~
maaku
Except that's not true? Most things release "by the government" had a
contractor involved somehow, and government contractors have a different set
of rules.

~~~
i000
Are you suggesting scientists are "goverment contractors"? Never thought of
myself as one.

~~~
dekhn
What did you think you were doing when you got a grant from the NIH, NSF, or
DOE? You entered in a research contract with the government.

------
jerven
Your data is more important than your code. Is the often neglected fact in
bioinformatics. Whatever you do document your file formats.

------
danieltillett
As someone how actually makes a living selling bioinformatics software, the
problem is mainly due to how scientist view software. The code you write is
seen the same way lab books are - basically raw data. Nobody publishes their
lab books and all too often software is thought of as just an electronic lab
book. It would be great if this changed, but it needs a change in how
scientist look at software.

------
bmir-alum-007
Disclaimer: I used to work at a Stanford bioinformatics shop.

There's a clear need of AWS-like features for bio/biomedical informatics
specifically enabling sharing, security, reuse and anonymization of data
(PHI), libraries (like R's bioconductor) and infrastructure (IaaS/PaaS/SaaS).

The issue is that some labs archive still archive their data on actual hard
drives (USB and bare drives), making their data much less useful than
somewhere readily available and sharable.

I think it's a huge (billion+) opportunity where the right execution would
need loads of smart, consultingish customer service reps (huge overhead costs)
to help researchers with coding, sysadmining and bio to some degree.
Basically, a full-service (with self-service, a-la carte features) hosting
company for bio / medical.

This space is only going to grow deeper and wider as more is discovered and
confirmed about each gene, protein, pathway and each accompanying expansion in
nosology. This sort of research knowledge is vital and unlikely to shrink. The
main issues are that it would be a cash-intensive and undefensible business
model because it requires paying lots of consultant/scientist brains and
anyone can copy the model.

~~~
macarthy12
> There's a clear need of AWS-like features for bio/biomedical > informatics
> specifically enabling sharing, > security, reuse and anonymization of data
> (PHI), libraries > (like R's bioconductor) and infrastructure
> (IaaS/PaaS/SaaS).

I tried to build a startup like this, basically a Heroku for bioinformatics,
with a bunch of experienced biologist / genome folks. They just didn't get it
and on my part, I guess I couldn't sell it to them. Part of it was snobbery,
and institutionalized thinking. It was a big disappointment. Some one will do
it, but until then it will crappy bioperl scripts, with no version control
etc.

------
rch
The article mentions code quality and license issues, and one of my favorites
(MEME) seems to suffer a bit from both. I believe the first aspect is simply
the result of being developed in a sequential fashion by different
contributors (which is reasonable given the environment).

The main problem is that the license rules out using the software for
'commercial purposes' except under unspecified terms that would need to hashed
out with the tech transfer office. I completely support the spirit of that
construct, but it makes it difficult to advocate for in practice. At least in
this case, GPL or LGPL would be a significant improvement.

~~~
dalke
The MEME commercial license is at
[http://techtransfer.universityofcalifornia.edu/NCD/Media/MEM...](http://techtransfer.universityofcalifornia.edu/NCD/Media/MEME%20suite%20site%20license%20template%2004June2015.pdf)
. How is this "unspecified terms that would need to hashed out with the tech
transfer office"? The license is US$2,500 for a license, which as Pachter
correctly points out is a small cost for most companies.

Also, while it would be a significant improvement to you, would it be a
significant improvement to the science?

For example, I can't speak to MEME but I know of a couple other projects where
the software is at low/no cost to academics and has a license fee for
commercial use. This money is used to fund future development, which gives a
funding source that is independent of grant funding. Pachter also points out
this possibility.

It can be frustrating if some people cannot use a package due to license terms
or costs. But it can also be frustrating to use a package where no one is
available to answer questions or fix bugs - which is something that funding
can address.

~~~
bbgm
If you are bootstrapped it is a huge barrier. And Meme is not that bad. There
are others that are far worse. I once tried using Modeller purely for hobby
projects (science was not even my day job) but was denied. That's a problem.

~~~
dalke
Not all business models are economically viable, and others are not obligated
to make it easy to support your choice of a bootstrap business model.

My experience is that people who get a piece of software, even if for free,
often want some support. If you don't give them support, some will complain. A
company may decide that it's easier to deal with complaints about the lack of
a free version for hobbyists than to deal with complaints about a user not
getting sufficient (unpaid) support.

~~~
shiggerino
That's news to me, most companies seem to understand full well that if they
want support on a free program they have to pay for it.

~~~
dalke
bbgm made two comments; one about a company under bootstrap, the other about
"using Modeller purely for hobby projects".

My first paragraph was a response to the first comment. My second paragraph
was about the second.

~~~
shiggerino
Oh, you were talking about the hobbyists?

That's easy. Tell them to RTFM, and if something doesn't work, have them
submit a patch. That's not a heavy burden.

~~~
dalke
There are very few people who do bioinformatics as a hobby. The term
"hobbyist" generally refers to people who are in a related field and want to
dabble with a different set of tools, with no firm intention to do serious
research in that field.

However, these hobbyists tend to be experts in their (related) field, and talk
with actual customers or potential customers of the company. Word of mouth is
important, and if the company tell someone to RTFM and submit a patch, then
this may lead to a bad vibe. Or it might not. But it's easier to say "only
paying customers" and have _no_ burden than to take on even the light burden
of responding to non-paying users.

Also, some hobbyists will use their "hobby" status to do research on the
cheap. (For a related example, some professors have their own company, doing
work related to their research, and use their educational status to get
software that ends up being used for work that helps the company.) They might,
for example, have an idea that they want to "play around with", which might be
competitive with the product. It's not a serious idea, "just a hobby", but
they are curious to see if it's something interesting, so get a copy of the
software, and use that to judge if the idea should be investigated more
seriously. Or start a competitive company. Or publish a paper where they
demonstrate that when the tool is used poorly, by someone who doesn't the
tool, then it produces poor results.

Is that really a hobby, or is it self-delusion used to avoid committing
oneself to a project?

I can't tell, and neither can a company. A policy of "paying customers only"
makes it easy to decide. There are certainly other solutions, but is there a
concrete advantage to taking on that burden, no matter how light?

------
shiggerino
Insisting bioinformatics software be non-free is pretty rules out any
possibility anyone is going to build on your code. If this is the case, that's
regrettable, but why seal the fate?

If they are afraid of companies using and abusing the software, just put it
under the GPL and they will at least have repay the favour to the users and
the community.

~~~
dalke
The essay gave an example of non-free project that others have built upon:

> One of the most widely used software suites in bioinformatics (if not the
> most widely used) is the UCSC genome browser and its associated tools. The
> software is not free, in that even though it is free for academic, non-
> profit and personal use, it is sold commercially. ... As far as development
> of the software, it has almost certainly been hacked/modified/developed by
> many academics and companies since its initial release (e.g. even within my
> own group).

Therefore, by demonstration, using a non-free license does not "seal the fate"
and rule out others from building on your code.

Also, GPL does not require anyone to "repay the favor." There's no requirement
to distribute modifications upstream or to "the community."

~~~
shiggerino
>There's no requirement to distribute modifications upstream or to "the
community." No, that would obviously be a onerous requirement. The GPL strikes
a reasonable balance between the individual user and the community of users.

I'm just saying the customers should be allowed to get the derived software on
the same generous terms as the company received the original on. The customers
are obviously not required to redistribute anything, but it encourages good
behaviour.

~~~
dalke
If it's "obviously [an] onerous requirement" then what does "they will at
least have repay the favour to the users and the community" mean?

You clarify that "customers should be allowed to get the derived software on
the same generous terms as the company received the original on", but that
assumes that the companies _have_ customers. Very few companies that use
academically produced bioinformatics software have downstream customers of
that software.

For the vast majority of companies that only use the software in-house, which
is likely 99+% of all companies, how does using the GPL or any other free
software license lead to the company repaying the favor? What's wrong for
asking for payment in cash instead of other more nebulous contributions?

~~~
shiggerino
>If it's "obviously [an] onerous requirement" then what does "they will at
least have repay the favour to the users and the community" mean?

The users will get the sources and the permission to use those sources so they
do not have to be powerless and dependent on the company's good will to
provide patches. I can't imagine why this is a difficult concept to
comprehend.

>You clarify that "customers should be allowed to get the derived software on
the same generous terms as the company received the original on", but that
assumes that the companies have customers. Very few companies that use
academically produced bioinformatics software have downstream customers of
that software.

Yes, if they have customers, no matter how few, those customers should have
all the permissions associated with free software. I don't see the problem
here. It's not an unreasonable assumption that a nonzero number bioinformatics
companies have a nonzero number of customers.

>For the vast majority of companies that only use the software in-house, which
is likely 99+% of all companies, how does using the GPL or any other free
software license lead to the company repaying the favor? What's wrong for
asking for payment in cash instead of other more nebulous contributions?

If nothing else, the burden of maintaining a private fork, and integrating
upstream changes with your own, can be an incentive to just send your changes
upstream.

~~~
dalke
I have studied how this happens in my field, which is cheminformatics, not
bioinformatics.

One company decided to 'sell' a GPLv2 product, that is, provide the product
for free and sell a support contract. They had about 12 paying support
customers. They ran into a problem in that some of their customers made local
changes in order to add new features. This is one of the well-known advantages
of having the source code.

However, it was difficult for the customers to contribute the code upstream.
One reason is that some of the changes were considered, or could be
considered, proprietary. This means they would have to deal with legal in
order to get permission to send upstream, and they didn't want to do that.

As a result, when the vendor distributed new releases, the customer would
actually integrate the changes each time. Or rather, _not_ integrate the new
changes at all, because it was too much work each time. They ended up with
code that was out of date (and buggy) because of the decision to make local
changes that were difficult to integrate upstream.

The "right" solution would have been to work with the vendor to change the
code and/or provide new APIs for what the customer wants in order to integrate
the proprietary methods, but without fully integrating those methods. However,
when you have the code Right There it's very easy to just go ahead and do
everything. Talking with upstream to get consensus on changes, even from a
vendor that you are paying, is more difficult than cranking out code.

As you can tell, this short-term decision to get code done NOW can to long-
term problems. Frankly most of the customers don't have the software
development experience to handle those problems. They are chemists who learned
to program, not software developers.

The vendor polled all of their customers and found that while the customers
_liked_ having the source code but had no real _need_ for the source code, and
were more willing to use the traditional vendor/customer route to add new
features than the free software/contribute upstream/"community" route. Another
way to say it is that they had the money to pay for support, but didn't have
the time to go through the community process.

