
A farewell to bioinformatics (2012) - emcl
http://madhadron.com/?p=263
======
kevinalexbrown
John Graham-Cumming (jgrahamc here) co-authored a piece on making scientific
code open. It was received well-enough that Nature published it [0]. This
approach has inspired others to do better work by describing a concrete
problem, then outlining steps to fix it on an individual and institutional
level.

When someone finds fault with the way a field conducts itself, I would implore
them to constructively influence that field. You might be surprised how many
are actually sympathetic to your concerns.

I'm not dismissing this author's concerns: to do that would really require
knowing the molecular biology field (which is more than sequencing, it turns
out). I do neuroscience right now, and programming can be a problem for some.
But a constructive suggestion to change can have much more impact than a long
rant.

[0]
[http://www.runmycode.org/data/MetaSite/upload/nature10836.pd...](http://www.runmycode.org/data/MetaSite/upload/nature10836.pdf)

~~~
chewxy
Off topic, but since you mentioned jgrahamc's article in Nature,
interestingly, this was what I read last night on Simply Statistics:
[http://simplystatistics.org/2013/01/23/statisticians-and-
com...](http://simplystatistics.org/2013/01/23/statisticians-and-computer-
scientists-if-there-is-no-code-there-is-no-paper/)

It's a similar issue. I think statisticians are taking constructive steps to
correct their path, since you know, ML is the new sexy thing. Bioinformatics
could take a much longer time to self-correct though.

Although, as I mentioned in an earlier comment, Fred seems to be in a prime
position to disrupt the bioinformatics field since he seems to know all the
problems that afflict it

~~~
troymc
Regarding "ML is the new sexy thing," check out these graphs:

[http://books.google.com/ngrams/graph?content=machine+learnin...](http://books.google.com/ngrams/graph?content=machine+learning&year_start=1950&year_end=2012&corpus=15&smoothing=3&share=)

<http://www.google.com/trends/explore#q=machine%20learning>

~~~
nopinsight
From your second graph, Iran and Pakistan have stronger interests in Machine
Learning than the US. (I am not surprised about India, South Korea, and China
though).

Is the interest in advanced Info Tech that widespread in those countries or
simply because the only people who could use Google in those countries are
government-sanctioned researchers? Anyone familiar with the reason could shine
light for the rest of us?

~~~
abdullahkhalids
I am not sure how you based your conclusions.

Pakistan's internet is generally open (except youtube and pornography). But
there is no widespread interest in ML particularly. Only a few companies -
most of them outsourcing from the US.

------
zerohp
> the software is written to be inefficient, to use memory poorly, and the cry
> goes up for bigger, faster machines! When the machines are procured, even
> larger hunks of data are indiscriminately shoved through black box
> implementations of algorithms in hopes that meaning will emerge on the far
> side. It never does, but maybe with a bigger machine…

I spent five years working in bioinformatics, and this is exactly the attitude
of both the researchers and the other developers on the projects I worked on.
It was very frustrating.

~~~
michaelhoffman
Hi, I'm a bioinformatics researcher. Apparently I work for this guy's
ex(?)-employer although I have never heard of him before.

My single most limited resource is programmer time. My time and the time of
other people who work with me. I have access to loads of computers that sit
idle all the time, even if it is on nights and weekends. There is zero
opportunity cost to me in using these computers more fully. I have enough
human work to do that I can wait for the results without having any wait
states.

There can be a big opportunity cost in trying to rework a workflow so that it
is more efficient and then test it thoroughly ensure correctness. Doing this
may seem more appealing to someone who is interested primarily in
computational efficiency. But I am more interested in research efficiency, and
so are my employers and funders.

~~~
epistasis
>There can be a big opportunity cost in trying to rework a workflow so that it
is more efficient and then test it thoroughly ensure correctness.

Hi, I recognize your name as a legit bioinformatician, am a huge fan of the
lab that you're currently in, and others should listen to you.

I'd like to add that for many projects, general reusable software engineering
is not necessarily a huge advantage. Instead of verifying a single
implementation, it's often better for somebody to reimplement the idea from
scratch; if a second implementation in a different language written by a
different programmer gets the same results, this is a much more thorough
validation of the software than going over prototype software line by line.

Also, I've seen way too many software engineers come in with an enterprisey
attitude of establishing all sorts of crazy infrastructure and get absolutely
no work done. If Java is your idea of a good time, it's unlikely that you'll
be an effective researcher (though it's not unheard of), because it's not good
at maximizing single-programmer output, and not good at maximizing I/O or CPU
or string processing. In research it's best to get results, fail fast fast
_fast_ , and move on to the next idea. If you're lucky, 1 in 20 will work out.
Publish your crap, and if it's a good idea, it will be worth polishing the
turd later, but it's better to explore the field then to spend too much time
on an uninteresting area.

The only time you worry about efficiency is when it enables a whole other
level of analysis. So, for example, UCSC does most of their work in C,
including an entire web app and framework written in C, because when they were
doing the draft assembly of human genome a decade ago on a small cluster of
computers that they scrounged from secretaris' desks over the summer, Perl
wouldn't cut it.

~~~
michaelhoffman
Software engineering is important for bioinformatics, in my opinion. But it's
important to identify the things that are important and aren't:

Reproducible code: extremely important. Correct code: extremely important.
Readable code: very important. Efficient code: often not as important.

Even today, the UCSC Genome Browser is an example where efficient code is
important. It is interactive software, has _many_ human users who can work
much efficiently when the browser is responsive. And with projects like
ENCODE, there are now incredible amounts of data available from the browser
that would not be easily possible with a less efficient system.

Very different from an analysis system that will be run a handful of times in
batch mode.

~~~
jk4930
>Reproducible code: extremely important. Correct code: extremely important.
Readable code: very important. Efficient code: often not as important.

You want Haskell. :)

------
MattRogish
I have some experience working at a genomics research company and I'll broadly
+1 Fred's experience about the industry, although in less negative terms. I
got out before I got jaded, so my perspective is a bit more "oh, that's a
shame" than his. I really like genetics, bioinformatics, hardware, deep-
science, and all that but the timing and fit wasn't right.

The tools are written by (in my experience) very smart bioinformaticians who
aren't taught much computer science in school (you get a smattering, but
mostly it's biology, math, chemistry, etc.). Ex:

[http://catalog.njit.edu/undergraduate/programs/bioinformatic...](http://catalog.njit.edu/undergraduate/programs/bioinformatics.php)

[http://www.bme.ucsc.edu/bioinformatics/curriculum#LowerDivis...](http://www.bme.ucsc.edu/bioinformatics/curriculum#LowerDivisionRequirements)

[http://advanced.jhu.edu/academic/biotechnology/ms-in-
bioinfo...](http://advanced.jhu.edu/academic/biotechnology/ms-in-
bioinformatics/course-requirements/index.html)

The tools themselves are written by smart non-programmers (a very dangerous
combination) and so you get all sorts of unusual conventions that make sense
only to the author or organization that wrote it, anti-patterns that would
make a career programmer cringe, and a design that looks good to no one and is
barely useable.

Then, as he said, they get grants to spend millions of dollars on giant
clusters of computers to manage the data that is stored and queried in a
really inefficient way.

There's really no incentive to make better software because that's not how the
industry gets paid. You get a grant to sequence genome "X". After it's done?
You publish your results and move on. Sure, you carve out a bit for overhead
but most of it goes to new hardware (disk arrays, grid computing, oh my).

I often remarked that if I had enough money, there would be a killing to be
made writing genome software with a proper visual and user experience design,
combined with a deep computer science background. My perfect team would be a
CS person, a geneticist, a UX designer, and a visual designer. Could crank out
a really brilliant full-stack product that would blow away anything else out
there (from sequencing to assembly to annotation and then
cataloging/subsequent search and comparison).

Except, I realized that most folks using this software are in non-profits,
research labs, and universities, so - no, there in fact is _not_ a killing to
be made. No one would buy it.

~~~
ank286
Why wouldn't anyone buy your product? If it is easy to use, and SPEEDS UP
RESEARCH TIME, your researcher/PI who is spending thousands on computing
clusters will buy your software for their graduate students. Hell, my PI keeps
asking me if I need a faster computer so I can run Matlab better/quicker.
Really, if I had a software that helped me perform research
faster/better/quicker and compare my results to ground truth or gold-
standards, that is a much more useful tool than a bunch of hardware for my
research. You push out papers fast.

So I disagree with you on your very last sentence (agree with the rest)

~~~
gabeiscoding
Ahh the efficiency argument.

The trick is, academics often have excess manpower capacity in the form of
grad students and post-docs. Even though personell is usually one of the
highest expenses on any given grant, they often don't look at ways to improve
the efficiency of their research man-hours.

That's not a blank rule, as we have definitely had success with the value
proposition of research efficiency, but in general, a lot of things business
adopt to improve project time (like Theory of Constraints project management,
Mindset/Skillset/Toolset matching of personel et) is of no interest to
academic researchers.

~~~
jk4930
After researching this field (biomedical R&D) a bit, I found that the mindset
and workflow is mostly pre-computers. The relevant decision makers in the labs
usually don't see a need to change something because "it works" and "it's done
always this way".

~~~
ank286
"its always done this way" is the ultimate motivation of any startup. We
wouldn't have any competing startups if everyone just accepted that, probably,
not have any entrepreneurs or have a better world for that matter. The fitness
function of the world will flatline.

------
aheilbut
I sympathize with the author, but this piece fails because many of the
specific criticisms are off-base, and he's not trying to be at all
constructive.

For example, it isn't true at all that microarray data is worthless. The early
data was bad, and it was very over-hyped, but with a decade of optimization of
the measurement technologies, better experimental designs, and better
statistical methods, genome-wide expression analysis became a routine and
ubiquitous tool.

The claim that sequencing isn't important is ridiculous. It's the scaffold to
which _all_ of biological research can be attached.

However:

There is a great deal of obfuscation, and reinventing well-known algorithms
under different names (perhaps often inadvertently). There's also a lot of
low-quality drivel on tool implementations or complete nonsense. This is
driven largely by the need in academia to publish.

The other side of this problem is that in general, CS and computer scientists
don't get much respect in biology. People care about Nature/Science/Cell
papers, not about CS conference abstracts. Despite
bioinformatics/computational biology not really being a new field anymore, the
cultures are still very different.

~~~
east2west
No kidding about reinventing wheels. I once saw a manuscript based entirely on
dot-product as 1-D least-square. I don't know what happened to it, but one
reviewer called it a seminal event in GWAS.

Bioinformatics is hard, but too many careerists take advantage of difficulties
and uncertainty to publish as many papers as they can get away with.

------
vsbuffalo
I agree with him, and have been complaining about the same shit for ages (I
work in bioinformatics too). Sadly, biologists don't care. We're treated as
the number crunchers. The real problem isn't that we waste computational
resources, it's that many biologists download programs, run their data through
it, and if it spits out an answer rather than an error, they trust it. Since
that program probably has zero unit test coverage, and the results may be fed
into pharmaceutical decisions, disease diagnostics, etc, you're basically
fucked if something went wrong. Lots of us have said this[0].

Minor quibble: genome assembly is definitely still an open problem that's
computationally difficult. So is robust high dimension inference, but that
falls more under statistics.

I've wanted to leave at least a dozen times too, for the better pay, for
working with programmers that can teach me something, and to not have my work
be interrupted by academic politics. But the people pissed at the status quo
are the ones that are smart enough to see it's broken and try to fix it, and
if we all leave, science is really fucked.

[0] [http://www.johndcook.com/blog/2010/10/19/buggy-simulation-
co...](http://www.johndcook.com/blog/2010/10/19/buggy-simulation-code-is-
biased/)

------
FreeKill
If you really want to get a feel for how deluted the Bioinformatics community
is, look for a job in the field as an outsider. It's not uncommon to see
requirements like:

"Must be an expert in 18 technologies" "Must have a PHD in Computer Science or
Molecular Biology" "Must have 12 years experience and post doctoral training"
"Pay: $30,000"

It's delusional because they apply the requirements it took for themselves to
get a job in Molecular Biology (long PHD, post doc, very low pay for first
jobs) and just apply it carte blanche to all fields that may be able to aid in
their pursuits. Especially when it comes to software engineering where it can
often be extremely difficult to explain why you did not pursue a PHD.

~~~
michaelhoffman
I assume these are separate requirements. I have not seen any doctoral-level
positions advertised for a salary of $30,000. The minimum NIH salary for
postdoctoral _trainees_ is more than that.

It's only delusional if they can't find people to fill the jobs. The idea
that, as an outsider, you know what requirements they should use in their
hiring process better than they do is perhaps more delusional.

~~~
FreeKill
I'm not an outsider and the 30K was a bit of an exaggeration, and I apologize
for that. The point I was trying to make was that if you look in as an
outsider, you would see the requirements being extremely daunting compared to
what you might see elsewhere with a pay scale that is very low and unappealing
to anyone who might match it. Unless, of course, you just finished your degree
in some biological discipline where the jobs are scarce. They are absolutely
delusional (and so am I, most likely) because in most cases what they really
need to solve the problems they have, is the same type of person most
companies would need in a similar situation, a quality software engineer with
experience building quality applications that are both extensible and
maintainable.

I worked in bioinformatics for more than 10 years before I moved on, and In my
experience they do have a lot of trouble finding people to fill positions,
especially outside of massive government funded groups like the NIH. This
often results in passing on competent software engineers with a B.Sc. that
don't meet the requirements in favor of PHD level biology graduates who have
taken a year or so of undergrad computer science courses. In my experience,
this leads to many of the problems discussed (and exaggerated) by the OP.
While some of these people are smart and produce good work, much of the time
they produce poor quality software that gets the job done, but as
inefficiently as possible and they leave a code base that is virtually
unusable. Overall, I mostly just wanted say that it's a mindset they REALLY
need to get past for the long term success of the industry.

~~~
kevin_rubyhouse
If 30k is the inaccurate number, what's the accurate one? I'm curious as to
what the realistic requirements are from your experience with the field.

~~~
FreeKill
I've seen a lot of job listings, at very large companies and academics for the
45-50 range. Keep in mind, these are jobs requiring a PHD, 10 years of
experience, and a dozen or so technologies.

It's not really the money that's skewed, it's their idea about the person they
need for the job. They don't need someone with that background (most of the
time), they just need a junior level software engineer in which case the pay
scale may not be too bad. There's a problem in realizing this, however, when
the standards for your own field (molecular biology for example) are extremely
high, so you expect it of all others as well...

------
stiff
This is pretty hilarious, from my brief experience with bioinformatics I can
very well imagine someone writing the opposite rant, about CS people getting
into bioinformatics not knowing sh*t about biology. I mean, browse through
bioinformatics textbooks, those are either written by computer scientists and
those are little more than string algorithm textbooks or by biologists and
then the layer of jargon for someone coming from CS is just impenetrable. Same
with bioinformatics teachers, I come from a CS background, but spent one solid
month seriously trying to understand the basics of molecular biology and my
bioinformatics seminar instructor sometimes seemed to know less about it than
me. Terrifying, no wonder nonsense results are produced.

~~~
sampo
My friend said: Bioinformatics means that computer scientists – who don't know
mathematics and don't know biology – are trying to do mathematical biology.

------
chris_wot
I always feel awkward reading these rants, mainly because I've burned my
bridges before and it really wasn't worth it. Even if it is true, it's better
to leave it and move on.

If you really feel strongly about something, write it dispassionately
(normally some time after the event) and treat it like a dissertation, backed
with case studies and citations.

------
jostmey
Basic science moves forward slowly limited by the pace of fortuitous
discoveries. I have found that many people from the field of computer
programming have unrealistic expectations of what can be done in biology and
other sciences.

------
jmspring
Sounds like a fed up academic with a stick up his backside.

Sh*tty data? Comes from the community. If the data and algorithms are so poor,
and the author so superior, he should have been able to improve the
circumstances.

This whole screed reads like an entitled individual who entered a profession,
didn't get the glory, oh and yeah, academia doesn't pay well.

In the realm of bioinformatics, lets ignore the work done on the human genome
and the like.

~~~
gwern
> Sh*tty data? Comes from the community. If the data and algorithms are so
> poor, and the author so superior, he should have been able to improve the
> circumstances.

Why? Aren't you assuming a lot about the incentives? What if the ground truth
is simply that all the results are false due to a melange of bad practices? Do
you think he'll get tenure for that? (That was a rhetorical question to which
the answer is 'no'.) Then you know there's at least one very obvious way in
which he could not improve the circumstances of poor data & algorithms.

~~~
michaelhoffman
He's not getting tenure because he doesn't have a PhD. According to LinkedIn,
he has a master's degree awarded after four years of study [1], which often
indicates someone who did not complete a PhD.

[1] <http://www.linkedin.com/pub/frederick-ross/13/81a/47>

~~~
droithomme
According to his 2009 CV he was working a PhD in biology back then and
expecting to finish in 2011.

Given that he is not a professor it is not clear why he would be expected to
be seeking tenure.

------
ChristianMarks
My experience working as a scientific programmer is this: my colleagues aren't
forthcoming. I could list case after case of failure to document or
communicate crucial details that cost me days, weeks and even months of
effort. But I won't, until I have another job lined up. If I were in the
author's position (I'm in another field), I would insist that my colleagues--
all of them, in whatever field I ended up working, were forthcoming about
their work. This is non-negotiable. Being over-busy is no excuse. (It may be
an excuse for not being forthcoming, but right or wrong, I couldn't care less
--I would not work with such people if I could avoid it, for whatever reason.)

Academia rewards journal publication and does not adequately reward
programming and data collection and analysis, although these are indispensable
activities that can be as difficult and profound as crafting a research paper.
At least the National Science Foundation has done researchers a small favor by
changing the NSF biosketch format in mid-January to better accommodate the
contributions of programmers and "data scientists": the old category
_Publications_ has been replaced with _Products_.

Naming is important to administrators and bureaucrats. It can be easy to
underestimate the extent to which names matter to them. Now there is a
category under which the contribution of a programmer can be recognized for
the purpose of academic advancement. Previously one had to force-fit
programming under _Synergistic Activities_ or otherwise stretch or violate the
NSF biosketch format. This is a small step, but it does show some
understanding that the increasingly necessary contributions of scientific
programmers ought to be recognized. The alternative is attrition. Like the
author of the article, programmers will go where their accomplishments are
recognized.

Still, reforming old attitudes is like retraining Pavlov's dogs. Scientific
programmers are lumped in with "IT guys." IT as in ITIL: the platitudinous,
highly non-mathematical service as a service as a service Information
Technocracy Indoctrination Library. There is little comprehension that
computer science has specialized. For many academics, scientific programmers
are interchangeable IT guys who do help desk work, system and network
administration, build websites, run GIS analyses, write scientific software
and get Gmail and Google Calendar synchronization running on Blackberries. It
is as if scientists themselves could be satisfied if their colleagues were
hired as "scientists" or "natural philosophers" with no further qualification,
as opposed to "vulcanologist" or "meteorologist" (to a first order of
approximation).

~~~
drosophila
Right now experimentalists generate data and then try to find computer people
to analyse their data. However, in the not too distant future computer models
will drive experimental research as hypothesis generation tools. Then the
computer people will be seeking biology people ( or robots) to run experiments
to validate their hypothesis and there will be more respect for the field.

~~~
ChristianMarks
This seems to presume that scientific programming is merely a service to the
important and more deserving persons who generate scientific hypotheses, from
whom it can be decoupled and isolated, instead of being the collaborative
effort that it is--if elevating the professional standing of scientific
programmers must wait for the widespread adoption of automated hypothesis
generation software. For example, the computation of ecosystem service
indicators--what you might call the interface between biogeophysical models of
Earth systems and economic and policy modeling--is an interdisciplinary and
collaborative activity that relies heavily on computational technique and
technology.

------
CrLf
"I’m leaving bioinformatics to go work at a software company [...]"

"[bioinformatics] software is written to be inefficient, to use memory poorly,
and the cry goes up for bigger, faster machines! [...]"

Well, the author is heading for a very bitter surprise...

------
kylemaxwell
You know, I'd be more inclined to listen to him if he didn't also completely
decry almost all of modern biology, which (in my view) has been to the late
20th and early 21st centuries what physics was to the late 19th and early to
mid 20th centuries.

------
skittles
I spent a year in a bioinformatics PhD program and got the feeling I was
studying to be science's version of the business analyst. Not knowing enough
about the biology or computation, but expected to speak the language of both.
And what would my research consist of in such an applied science? Luckily I
had another opportunity and became a software developer (which I'm happy
with). The worst thing about the experience was listening to so many research
presentations where I could tell the presenter didn't understand the science
and could barely explain it.

------
chrisamiller
Some thoughts on this article:

\- This guy clearly has a limited understanding of the field. This quote is
laughable: "There are only two computationally difficult problems in
bioinformatics, sequence alignment and phylogenetic tree construction."

\- As a bioinformatician, I feel sorry for this guy. Just like any other
field, there are shitty places to work. If I was stuck in a lab where a
demanding PI with no computer skills kept throwing the results of poorly
designed experiments at me and asking for miracles, I'd be a little bitter
too.

\- Just like any other field, there are also lots of places that are great
places to work and are churning out some pretty goddamn amazing code and
science. I'm working in cancer genomics, and we've already done work where the
results of our bioinformatic analyses have _saved people's lives_. Here's one
high-profile example that got a lot of good press.
([http://www.nytimes.com/2012/07/08/health/in-gene-
sequencing-...](http://www.nytimes.com/2012/07/08/health/in-gene-sequencing-
treatment-for-leukemia-glimpses-of-the-future.html?pagewanted=all&_r=0))

\- I'm in the field of bioinformatics to improve human health and understand
deep biological questions. I care about reproducibility and accuracy in my
code, but 90% of the time, I could give a rat's ass about performance. I'm
trying to find the answer to a question, and if I can get that answer in a
reasonable amount of time, then the code is good enough. This is especially
true when you consider that 3/4 of the things I do are one-off analyses with
code that will never be used again. (largely because 3/4 of experiments fail -
science is messy and hard like that). If given a choice between dicking around
for two weeks to make my code perfect, or cranking out something that works in
2 hours, I'll pretty much always choose the latter. ("Premature optimization
is the root of all evil (or at least most of it) in programming." --Donald
Knuth)

\- That said, when we do come up with some useful and widely applicable code,
we do our best to optimize it, put it into pipelines with robust testing, and
open-source it, so that the community can use it. If his lab never did that,
they're rapidly falling behind the rest of the field.

\- As for his assertion that bad code and obscure file formats are job
security through obscurity, I'm going to call bullshit. For many years, the
field lacked people with real CS training, so you got a lot of biologists
reading a perl book in their spare time and hacking together some ugly, but
functional solutions. Sure, in some ways that was less than optimal, but hell,
it got us the human genome. The field is beginning to mature, and you're
starting to see better code and standard formats as more computationally-savvy
people move in. No one will argue that things couldn't be improved, but
attributing it to unethical behavior or malice is just ridiculous.

tl;dr: Bitter guy with some kind of bone to pick doesn't really understand or
accurately depict the state of the field.

~~~
Inufu
Out of curiosity, what other computationally difficult problems are there?

I'm very interested in bioinformatics, but sadly don't know as much about the
field as I'd like.

~~~
zmmmmm
One that comes immediately to mind is genome assembly, which is a hugely
complex problem, and essential to a variety of fields that rely on re-piecing
together the genome without a reference (or with a reference that is highly
divergent from the sequence data).

~~~
sampo
Genome assembly relies heavily on sequence alignment. So: Is genome assembly
hard just because sequence alignment is hard? Or would genome assembly present
separate algorithmic problems even if there was a super-efficient solution to
sequence alignment?

~~~
alephnil
It is far more difficult than sequence alignment. Sequence alignment has
quadratic complexity, while fragment assembly is NP-hard. Se for example

[http://scholar.google.com/scholar?cluster=131745416915434219...](http://scholar.google.com/scholar?cluster=13174541691543421945&hl=en&as_sdt=0,5)

~~~
dalke
Yes, for pairwise sequence alignment. The globally optimized multiple sequence
alignment problem is NP-complete.

------
adambratt
Really makes me want to learn more about molecular biology.

Any solid factual resources besides the references mentioned in this justified
rant?

~~~
BioGeek
Biostars.org is a stackexchange-like site for bioinformaticians.

See there for answers to your question, eg:

* Best resources to learn molecular biology for a computer scientist. [1]

* What are the best bioinformatics course materials and videos (available online)? [2]

[1] <http://www.biostars.org/p/3066/>

[2] <http://www.biostars.org/p/10766/>

------
singingfish
Also, yes molecular biologists with few exceptions know little more than fuck
all about ecology. Hence the mostly gung-ho attitudes to GM of crop foods for
example. Honestly. I've done real molecular biology work (simple commercial
protein chemistry and molecular phylogenetics of mitochondrial DNA) and tried
to start a PhD in ecology (failed due to funding issues and realising it was a
dead end job wise).

------
sciencerobot
There are a lot of problems in bioinformatics. Mainly, lack of reproducibility
(ie "custom perl scripts"), poorly organized and characterized data and plenty
of wheel reinvention (I heard Jim Kent, who first assembled the human genome,
created his own version of wc [word of mouth, citation needed]).

The fact of the matter is that through high-throughput sequencing,
microarrays, what have you, generation of biologically-meaningful results is
possible.

There are a lot of problems in bioinformatics that need to be solved. Github
has helped. More of bioinformaticians are learning about good software
development practices, and journal reviewers are becoming more enlightened of
the merits of sharing source code.

------
BioGeek
Also see the discussion at the bioinformatics subreddit:
[http://www.reddit.com/r/bioinformatics/comments/179e9k/a_far...](http://www.reddit.com/r/bioinformatics/comments/179e9k/a_farewell_to_bioinformatics_since_i_am_about_to/)

------
Agathos
Interesting to read since I made the same career move last year. I agree with
about half of it but don't see a lot of value or useful advice here.

I find it curious that he stops to salute ecologists, since I was in an
ecology lab. I liked my labmates and our perspective, but we didn't have any
magical ability to avoid the problems he aludes to here.

I think a lot of his frustration comes down to not being more involved in the
planning process. That's not a new problem. R.A. Fisher put it this way in
1938: “To consult the statistician after an experiment is finished is often
merely to ask him to conduct a post mortem examination. He can perhaps say
what the experiment died of.”

Perhaps the idea that we can have bioinformatics specialists who wait for data
is just wrong. Should we blame PIs who don't want to give up control to their
specialists, or the specialists who don't push harder, earlier? Ultimately the
problem will only be solved as more people with these skills move up the
ranks. But the whole idea that we need more specialists working on smaller
chunks of the problem may be broken from the start
(<http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1183512/>).

------
sbassi
OK, I agree that there are some shitty work on this field, but he can't think
they we all in the same boat. For example "Irene Pepperberg’s work with Alex
the parrot dwarfs the scientific contributions of all other sequencing to date
put together." this is not true. Bioinformatics is not just blinding
sequencing new DNA, but analyzing data and almost every new breakthrough in
medicine is based in a direct (or indirect) bioinformatics analysis. I used to
work in an agrobiotech company and the sequencer was the first source of data
for any breeding program. Bioinformatics was used to design primers for PCR to
find molecular markers. There is bad software out there? Yes, but I see this
as an opportunity than a problem. And the cause is not the need to hide
something, but the lack of ability of biologists with no CS background in the
field.

------
neilk
Maybe overblown, but it echoes complaints I've heard from other bioinformatics
people.

Surely this means there's a goldmine waiting there for someone to produce a
non-broken toolchain for bioinformatics?

Or is it even possible to produce standard tools? Maybe all the labs are too
bespoke?

------
jerryhuang100
i totally disagree on Fred's negative view of Bioinformatics. as "software is
eating the world", it's actually bioinformatics is eating biology. today's
main-stream biology is dealing with exploding amount of data from modern
instruments, images or clinical data collected every day and mostly machine
readable. to stay up-to-date a modern biologist / bioinformatist need to think
biological problems in a "big-data" (i know, cliche) way, then try to gain
some insight from the data with (computational) tools. today it's the
algorithms, mathematical models and software packages on top of databases to
pinpoint cancer SNPs and drive drug discovery. and today it's these same
algorithms and math models driving how web bench works are designed. if you
think biological data are "shitty", i guess you never see other kind of
unstructured data out there. so many scholars in other fields envy biologist
and medical scientists for something called "PubMed". on the other hand, for
those purely wet bench "biologists" who think computers are magic boxes to
give answers, insights, models with one push of the button, i do feel sorry
for them. they are so last-gen as they just don't have the essential
techniques nowadays (just like a molecular biologist not knowing pcr).

------
lemming
This is a little discouraging - BioInformatics was my top choice for a
Master's program I'm planning to start this year. The program at Melbourne Uni
looks really good (accepts from three streams, Math/Stats, Biology or
Computing and tailors the course based on your background). Maybe I should go
for a more generic Machine Learning one and try to apply that to healthcare in
some other field if things are really this bad.

~~~
chrisamiller
As someone in the field, let me assure you: This article does not accurately
reflect the state of the field.

~~~
lemming
Thanks for the reply. I wasn't basing this just on the article, there seem to
be a fair number of comments here supporting a less-extreme version of what
he's saying.

------
dderiso
Some things are going to suck in academia, as this guy points out. But, its a
necessary step and todays progress is almost always going to be tomorrows
shit. So quit bitching.

Biologists are almost never good coders, if they can code at all. But thats
not what they do, they signed up for pipettes, not python.

Its the programmers who wrote said shitty code that are to be blamed, but you
can't hate under-paid and over-worked phd students who write this code even
though it usually has nothing to do with their thesis (the math/algorithm is
the main part, the deployable implementation is usually not the most
important).

If you want good code and organized/accountable databases, go to industry.
Theres nothing new about this transition. The IMPORTANT part, is that industry
gives back to academia. So when you get an office with windows and a working
coffee machine, remember to help make some phd student's life a little easier
by making part of your code open source.

------
SilasX
Where does the Rosalind project (rosalind.info) fit into all of this, I'm
wondering? It seems to be written by people who have actual understanding of
the mappings between biology and informatics, with clear explanations of
problems in terms of the programming challenge involved.

Surely they can't get that far without having some kind of sensible method?

------
dinkumthinkum
Why is this on the front page or why is it relevant? It's kind of a rant. I
did some work on a publication in this field and was published once; I don't
think it is a horrible research program. There may exist some of the issues in
bioinformatics described here but I don't think it is terribly productive.

------
ascotan
Having working in the bioinformatics industry as an SE for 9 years I can both
agree and disagree.

1\. I agree that SE standards and good coding practice are completely absent
in the bioinformatics world. I remember being asked to improved the speed of
some sequence alignment tools and realized that the source code was originally
Delphi that had been run through a C++ converter. No comments, single
monolithic file. The vast majority of the bioinformatics code I worked with
was poorly written/documented Perl. In addition a lot of bioinformatics guys
don't understand SE process and so rather than having a coordinated
engineering effort, you end up with a lot of "coyboy coding" with guys writing
the same thing over and over.

2\. I agree that productivity is very slow. This is a side product of research
itself though. In the "real world" (quoted) where people need to sell
software, time is the enemy. It's important to work together quickly to get a
good product to market. In the research world, you get a 2/5 year grants and
no one seems have much of a fire under them to get anything done (Hey we're
good for 5 years!). You would think that the people would be motivated to cure
caner quickly (etc), but it's not really the case. Research moves at a snail's
pace - and consequently the productivity expectations of the bioinformatics
group.

3\. I disagree that research results from the scientists are garbage. Yes it's
true that some experiments get screwed up. However, if you having a lot of
people running those experiments over and over, the bad experiments clearly
become outliers. Replication in the scientific community is good because it
protects against bad data this way. Somehow the author must have had a
particularly bad experience.

4\. Something the author didn't mention that I think is important to
understand: most scientists have no idea how to utilize software engineering
resources. The pure biologists, many times are the boss, and don't really
understand how to run a software division like bioinformatics. Many times
PHD's in CS run a bioinformatics group, who have never worked in industry and
don't know anything about good SE practice or how to run a software project. A
lot of the problems in the bioinformatics industry is directly related to poor
management. Wherever you go you're going to have team members that have
trouble programming, trouble with their work ethic, trouble with following
direction. However, in a bioinformatics environment where these individuals
are given free reign and are not working as a cohesive unit, you can see why
there is so much terrible code and duplication.

------
caseybergman
This piece seems to have touched a nerve in the bioinformatics community,
though I have no idea why. Much of what is said here is obvious to anyone
working in academic research that requires programming expertise.

Yes, industry typically pays more than academia. Yes, most molecular
biologists cannot code and rely on bioinformatics support. Yes, biological
data is often noisy. Yes, code in bionformatics is often research grade
(poorly implemented, poorly documented, often not available). These are all
good points that have been made many times more potently by others in the
field like C. Titus Brown
(<http://ivory.idyll.org/blog/category/science.html>). But they are not
universal truths and exceptions to these trends abound. Show me an academic
research software system in any field outside of biology that is functional
and robust as the UCSC genome browser (serving >500,000 requests a day) or the
NCBI's pubmed (serving ~200,000 requests a day). To conclude from common
shortcomings of academic research programming that bioinformatics is
"computational shit heap" is unjustified and far from an accurate assessment
of the reality of the field.

From looking into this guy a bit (who I've never heard of before today in my
10+ years in the field), my take on what is going is here is that this is the
rant of a disgruntled physicist/mathematician is a self-proclaimed
perfectionist (<https://documents.epfl.ch/users/r/ro/ross/www/values.html>),
who moved into biology but did not establish himself in the field. From what I
can tell contrasting his CV
(<https://documents.epfl.ch/users/r/ro/ross/www/cv.pdf>) to his linkedin
profile (<http://www.linkedin.com/pub/frederick-ross/13/81a/47>), it does not
appear that he completed his PhD after several years of work, which is always
a sign of something something going awry and that someone has had a bad
personal experience in academic research. I think this is most important light
to interpret this blog post in, rather than an indictment of the field.

That said, I would also like to see bioinformatics die (or at least whither)
and be replaced by computational biology (see differences in the two fields
here: [http://rbaltman.wordpress.com/2009/02/18/bioinformatics-
comp...](http://rbaltman.wordpress.com/2009/02/18/bioinformatics-
computational-biology-same-no/)). Many of the problems that apparently Ross
has experienced come from the fact that most biologists cannot code, and
therefore two brains (the biologist's and the programmer's) are required to
solve problems that require computing in biology. This leads to an abundance
of technical and social problems, which as someone who can speak fluently to
both communities pains me to see happen on a regular basis. Once the culture
of biology shifts to see programming as an essential skill (like using a
microscope or a pipette), biological problems can be solved by one brain and
the problems that are created by miscommunication, differences in
expectations, differences in background, etc. will be minimized and situations
like this will become less common.

I for one am very bullish that bioinformatics/computational biology is still
the biggest growth area in biology, which is the biggest domain of academic
research, and highly recommend students to move into this area
([http://caseybergman.wordpress.com/2012/07/31/top-n-
reasons-t...](http://caseybergman.wordpress.com/2012/07/31/top-n-reasons-to-
do-a-ph-d-or-post-doc-in-bioinformaticscomputational-biology/)). Clearly,
academic research is not for everyone. If you are unlucky, can't hack it, or
greener pastures come your way, so be it. Such is life. But programming in
biology ain't going away anytime soon, and with one less body taking up a job
in this domain, it looks like prospects have just gotten that little bit
better for the rest of us.

------
ejain
I agree that a lot of effort that is put into bioinformatics is wasted. But
it's silly to say that bioinformatics hasn't contributed much to science, and
naive to think that dysfunctional software development is less widespread
outside of bioinformatics.

------
julienchastang
Fascinating HN thread. I work in the geoinformatics domain where many of the
same comments apply. I agree scientists turned programmers are often poor
software developers. Moreover, this group often belittles industry established
best practices in software development. But in truth, the "pure" software
engineer/computer scientist lacks sufficient domain expertise to accomplish
something useful. Learning fluid dynamics requires many years of education.
Ideally, you would like these two groups to work closely together and with
mutual respect.

------
iharris
I largely agree with Fred's opinion on the shortcomings of bioinformaticians
and the general attitude in the industry, but my personal experience was
actually pretty positive. My past research was on building visualizations of
the complicated biochemical processes, for use in educating undergrads. It was
certainly more interesting than slogging through mounds of crappy data.

Just another data point for someone contemplating a career in BINF, although
some purists might say that my work did not really fall under the same
category.

------
chewxy
Spelling error: 'technically apt', not 'ept'.

"Ept" means effective. As in "inept"

I don't understand this part:

> No one seems to have pointed out that this makes your database a reflection
> of your database, not a reflection of reality. Pull out an annotation in
> GenBank today and it’s not very long odds that it’s completely wrong.

In fact this entire article seems to be a rant on why bioinformatics as a
field is rotting. But instead of ranting, surely something can be done about
it?

Shouldn't we as hackers see this as an opportunity to revolutionize the field?

~~~
saraid216
As a general rule, the people on the short end of the stick are the people
least capable of producing change. Worse, change that they bring about tends
to be good from a strict, technical viewpoint but has huge negative side
effects that go unnoticed or deliberately ignored until it becomes difficult
to distinguish the resultant system as a better one.

Rants like this, and providing interviews to third parties, are actually one
of the more positive things that he could bring to the table: it provides
information to people who aren't aware and inspires motivation in people who
aren't entangled.

~~~
chewxy
I don't know, but I think Fred is in a prime position to disrupt
bioinformatics. He knows all the flaws, he knows all the problems. If I were
him, I'd have seized the opportunity and work on a hard problem.

Then again, I am in no position to judge what Fred should or should not do

------
mvanveen
Say for the purposes of argument that this thesis were true. What is there (if
anything) to be done about it? I ask as a naive interested party with a CS
background.

------
ElliotH
That's a shame. I just finished a uni module about bioinformatics. It seemed
like a cool field where progress was being made, and as an undergraduate I
could generate meaningful looking results by following very recent papers. I
hope the field has some saving graces even if this is all true. The idea of
CompSci folk working with biology folk to solve human problems inspired me a
lot.

------
jmgao
The author is exactly right about the quality of data in bioinformatics. There
are datasets with genes named MAR1, DEC1, etc. getting mangled to 1-Mar,
1-Dec, because of Microsoft Excel autoformatting.

[http://nsaunders.wordpress.com/2012/10/22/gene-name-
errors-a...](http://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-
excel-lessons-not-learned/)

------
pjotrp
The bio in bioinformatics is the important bit. Informatics plays second
fiddle, even in the name. Very few will appreciate your beautiful code, but
many will appreciate you finding a cure for cancer. That is the reality of
bioinformatics, most of the code has a short shelf life. If you luck out, your
software may live longer, as is the case with samtools. That samtools code is
crappy is true, still the much cleaner code alternatives, sambamba and
bamtools, are not much used! Go figure.

Maybe bioinformatics is not the place to aim for great informatics. We do
bioinformatics because of love of science first and foremost. This is frontier
land, the wild west, and it pays to play quick and dirty. I would suggest to
hang on to some best practices, e.g. modularity, TDD and BDD, but forget about
appreciation. Dirty Harry, as a bioinformatician you are on your own.

To be honest, in industry it is not much different. These days, coders are
carpenters. If you really want to be a diva, learn to sing instead.

------
thornad
molecular biology has been dead for years now, but the amount of money poured
into it makes it impossible to publish its death certificate. Here is why and
how it happened (among other things):
<http://www.youtube.com/watch?v=Y0b11S1FjXY>

------
datz
Come work with me in my genomic interpretation company. Fun application
building, no data mess, big money!

------
mscarborough
>> I’m leaving bioinformatics to go work at a software company with more
technically ept people and for a lot more money.

More money, good on you. Starting off your critique of your former colleagues
with "technically ept people'...not going to get a lot of sympathy for the
correctness of your work.

~~~
aheilbut
Everyone is jumping on that, but (while I had to look it up too) 'ept'
actually is a real word:

from the OED:

ept, adj. Pronunciation: /ɛpt/ Etymology: Back-formation < inept adj.

    
    
      Used as a deliberate antonym of ‘inept’: adroit, appropriate, effective.
    

1938 E. B. White Let. Oct. (1976) 183, I am much obliged..to you for your
warm, courteous, and ept treatment of a rather weak, skinny subject.

1966 Time 30 Sept. 7/1 With the exception of one or two semantic twisters, I
think it is a first-rate job—definitely ept, ane and ert.

1976 N.Y. Times Mag. 6 June 15 The obvious answer is summed up by a White
House official's sardonic crack: ‘Politically, we're not very ept.’

~~~
christiangenco
That was…surprisingly thorough.

~~~
gwern
That _is_ the point of the OED: to be comprehensive and include real usages.

~~~
saraid216
That's the point of any half-decent dictionary.

The OED is a gold standard, though.

~~~
bdr
James Murray was the true Scotsman.

------
retrogradeorbit
Someone's got a bad case of God Complex.

------
helloamar
i'm not into bio, but read articles on latest development. my sister also took
bioinformatics but the scope in India is very less it seems.

have you checked out synthetic biology? will it be easy to understand when you
have a degree in bioinformatics?

