Hacker News new | past | comments | ask | show | jobs | submit login
A farewell to bioinformatics (2012) (madhadron.com)
337 points by emcl on Jan 27, 2013 | hide | past | favorite | 170 comments

John Graham-Cumming (jgrahamc here) co-authored a piece on making scientific code open. It was received well-enough that Nature published it [0]. This approach has inspired others to do better work by describing a concrete problem, then outlining steps to fix it on an individual and institutional level.

When someone finds fault with the way a field conducts itself, I would implore them to constructively influence that field. You might be surprised how many are actually sympathetic to your concerns.

I'm not dismissing this author's concerns: to do that would really require knowing the molecular biology field (which is more than sequencing, it turns out). I do neuroscience right now, and programming can be a problem for some. But a constructive suggestion to change can have much more impact than a long rant.

[0] http://www.runmycode.org/data/MetaSite/upload/nature10836.pd...

Off topic, but since you mentioned jgrahamc's article in Nature, interestingly, this was what I read last night on Simply Statistics: http://simplystatistics.org/2013/01/23/statisticians-and-com...

It's a similar issue. I think statisticians are taking constructive steps to correct their path, since you know, ML is the new sexy thing. Bioinformatics could take a much longer time to self-correct though.

Although, as I mentioned in an earlier comment, Fred seems to be in a prime position to disrupt the bioinformatics field since he seems to know all the problems that afflict it

From your second graph, Iran and Pakistan have stronger interests in Machine Learning than the US. (I am not surprised about India, South Korea, and China though).

Is the interest in advanced Info Tech that widespread in those countries or simply because the only people who could use Google in those countries are government-sanctioned researchers? Anyone familiar with the reason could shine light for the rest of us?

I am not sure how you based your conclusions.

Pakistan's internet is generally open (except youtube and pornography). But there is no widespread interest in ML particularly. Only a few companies - most of them outsourcing from the US.

The problem is people who would have the experience / knowledge to really make it better, are not tempted to go in and fix it, because there's so much political / non-technical work involved in doing that. If someone wants to solve hard computational problems, they might as well go into another field. If they really care about doing biology, chances have been that they aren't the greatest programmers (I understand this may be changing, but still seemed to be the case 4 years ago when I left bioinformatics). This leaves the people who are happy with the status quo, staying in bioinformatics, and the people who are dissatisfied, going to other fields where they feel their work can have more of an impact.

In my experience, what happens is that biologists define the science, and they depend on the computer scientists / engineers to implement solutions to their computational problems. The computational people depend on the biologists to validate whatever results they produce. The iteration cycle can be painfully slow, especially for people used to telling machines what they want them to do, and getting results immediately. The proposition of changing that dynamic is not alluring to most people, but I still hope there will be some who try.

> the software is written to be inefficient, to use memory poorly, and the cry goes up for bigger, faster machines! When the machines are procured, even larger hunks of data are indiscriminately shoved through black box implementations of algorithms in hopes that meaning will emerge on the far side. It never does, but maybe with a bigger machine…

I spent five years working in bioinformatics, and this is exactly the attitude of both the researchers and the other developers on the projects I worked on. It was very frustrating.

Hi, I'm a bioinformatics researcher. Apparently I work for this guy's ex(?)-employer although I have never heard of him before.

My single most limited resource is programmer time. My time and the time of other people who work with me. I have access to loads of computers that sit idle all the time, even if it is on nights and weekends. There is zero opportunity cost to me in using these computers more fully. I have enough human work to do that I can wait for the results without having any wait states.

There can be a big opportunity cost in trying to rework a workflow so that it is more efficient and then test it thoroughly ensure correctness. Doing this may seem more appealing to someone who is interested primarily in computational efficiency. But I am more interested in research efficiency, and so are my employers and funders.

>There can be a big opportunity cost in trying to rework a workflow so that it is more efficient and then test it thoroughly ensure correctness.

Hi, I recognize your name as a legit bioinformatician, am a huge fan of the lab that you're currently in, and others should listen to you.

I'd like to add that for many projects, general reusable software engineering is not necessarily a huge advantage. Instead of verifying a single implementation, it's often better for somebody to reimplement the idea from scratch; if a second implementation in a different language written by a different programmer gets the same results, this is a much more thorough validation of the software than going over prototype software line by line.

Also, I've seen way too many software engineers come in with an enterprisey attitude of establishing all sorts of crazy infrastructure and get absolutely no work done. If Java is your idea of a good time, it's unlikely that you'll be an effective researcher (though it's not unheard of), because it's not good at maximizing single-programmer output, and not good at maximizing I/O or CPU or string processing. In research it's best to get results, fail fast fast fast, and move on to the next idea. If you're lucky, 1 in 20 will work out. Publish your crap, and if it's a good idea, it will be worth polishing the turd later, but it's better to explore the field then to spend too much time on an uninteresting area.

The only time you worry about efficiency is when it enables a whole other level of analysis. So, for example, UCSC does most of their work in C, including an entire web app and framework written in C, because when they were doing the draft assembly of human genome a decade ago on a small cluster of computers that they scrounged from secretaris' desks over the summer, Perl wouldn't cut it.

Software engineering is important for bioinformatics, in my opinion. But it's important to identify the things that are important and aren't:

Reproducible code: extremely important. Correct code: extremely important. Readable code: very important. Efficient code: often not as important.

Even today, the UCSC Genome Browser is an example where efficient code is important. It is interactive software, has many human users who can work much efficiently when the browser is responsive. And with projects like ENCODE, there are now incredible amounts of data available from the browser that would not be easily possible with a less efficient system.

Very different from an analysis system that will be run a handful of times in batch mode.

>Reproducible code: extremely important. Correct code: extremely important. Readable code: very important. Efficient code: often not as important.

You want Haskell. :)

If Java is your idea of a good time, it's unlikely that you'll be an effective researcher (though it's not unheard of), because it's not good at maximizing single-programmer output, and not good at maximizing I/O or CPU or string processing.

FWIW, I have in the past gotten good results out of Java and C# (it's a lot easier in C#) by writing programs that generate bytecode at runtime, so they can use the JIT to further optimize performance. Getting the same results out of C would require a lot more work. This includes string processing - I wrote a regex compiler for Java at one point, easily outperforming java.util.regex.

And such things are not difficult - or at least, not difficult to me now, knowing all I know - perhaps 10 hours work for simple regex compiler. And that is how I would use tools like Java to optimize my own performance: adapt them to interpret or compile a language that is close to the problem domain. A slightly higher constant cost, with the aim of a much lower per-idea cost.

You describe N-version programming (though not by name). In actuality two different from-scratch implementations are likely to re-make the same mistakes, see the following. http://scholar.google.com/scholar?q=An+Experimental+Evaluati...

> and if it's a good idea, it will be worth polishing the turd later,

Which of the released turds do you consider to be polished?

Pretty much anything that gets used by many people ends up getting polished (the exception being the RNA-seq field, it's still pretty rough out there, but the research is still taking quite a while). And if you're writing software, your tool isn't going to get used until it's somewhat polished, or is so unique and essential in its purpose that people have to use it.

In terms of next-generation sequence analysis, Heng Li's BWA mapper and Samtools libraries are fairly good. His coding style is a bit terse for my tastes, but it keeps out people who don't know what they're doing, it's very clear code for semi-complicated algorithms, and BWA is some of the most reliable software I use everyday.

On the infrastructure side, Galaxy [https://main.g2.bx.psu.edu] is getting fairly good.

The BioConductor repository of R packages is extremely mixed. I don't like some of their architectural choices, but it's ended up working out OK.

I still use Michael Eisen's Cluster from a decade ago, along with Java TreeView.

Regarding samtools, it doesn't sound very good from what I'm hearing:

"Look at the disgusting state of the samtools code base. Many more cycles are being used because people write garbage. For a tool that is intimately tied to research, the absence of associated code commentary and meaningful commit messages is very poor. The code itself is not well self documenting either."

commit log:




I can't find that critique with Google.

As I said, the style is very terse, and I have my suspicions that this is by design to minimize the number of less-qualified programmers trying to submit sub-standard code back to the project. (Edit: since it's been 10 minutes and I still can't reply to tomalsky's comment, I should point out that my "suspicions" are a joke; read the linked code sample and judge its quality for yourself.)

I have dived deep into the samtools code, rewritten chunks of file I/O inside it, messed with alternate formats, and my personal experience is that it's been easier for me to change, adapt, and understand it then any other open-source C project I've tried to dive into, such as, say, GNU join.

If anybody can point to where samtools is using many more cycles than it has to, please let me know! The worst part about it is that the compression and decompression is not multithreaded, but that is being worked out, I believe.

> I can't find that critique with Google.

I didn't link the source because I am ashamed to admit that I clicked on the reddit link that was posted in this thread.


samtools is among the better software in sequencing-data analysis. It is also great in defining formally the file format it uses. The same cannot be said of many other tools. For my recent work in RNA-Seq, samtools is the most robust and trustworthy tool I looked at and used, and I looked at almost every popular tool, the major exception being RSEM. If only all bioinformatics tools are more like samtools.

It works just fine. And iterating through files is not rocket science, anyway; it's not hard to follow what is going on.

Well, I can call myself a bioinformatics researcher, I guess, as I have CS Ph.D working in genetics/genomics. I see your point of throwing computers at simple solutions as cheaper than throwing good programmers. I do that too. We are very fortunate in that we write run-once programs that only have to work in one environment using one inputs. However, bad programmers write incorrect programs, which give wrong conclusions that lead to faulty clinical trials (look up Duke University facing class-action law-suit). I have seen people parsing Gigabytes-files with one line of Awk. People seem to forget that good engineering practice is learned with blood. Is it any wonder academic research is looked with suspicion by the pharmaceutical companies?

>I have seen people parsing Gigabytes-files with one line of Awk

I feel exactly the opposite. I'm suspicious of anyone that does not use AWK (or other Unix text utilities) as a standard tool for checking the integrity of multi-gigabyte files, or generating summaries. AWK is super-fast, allows highly flexible checks, and allows quick and reliable interaction with huge amounts of data in the way that a script can not.

I love awk. I once had to search a multi-megabyte hunk of data that was made up of 25-bit data items packed into 32-bit words. Instead of doing bit packing and unpacking, I converted the words into 32 character strings of 1's and 0's. I ended up with a string 300,000,000 (that's three hundred million) characters long!!! Awk had no problems handling it.

To build the string, I had to concatenate 1024 of the 32 characters strings to an intermediate string, and then concatenate these into the final monster, because concatenate just the 32 character strings took too long - a reallocation after every concatenation.

That was fun.

I believe this is an example of the sort of thing that the essay author complained about.

Bit-packing is simple. You spent a lot of time working around problems that shouldn't have existed in the first place. Even when using the approach you described, here is Python code which does what you described:

    >>> byte_to_bits = dict((chr(i), bin(i)[2:].zfill(8)) for i in range(256))
    >>> byte_to_bits["A"]
    >>> as_bits = "".join(byte_to_bits[c] for c in open("benzotriazole.sdf").read())
    >>> as_bits[:16]
    >>> chr(int(as_bits[:8], 2))
    >>> chr(int(as_bits[8:16], 2))
    >>> open("benzotriazole.sdf").read(2)
This keeps everything in memory, since 300MB is not a lot of memory. If it was in the GB range then I would have written to a file instead of building an in-memory string.

The run-time was small enough that I didn't notice it.

The thing is, you succeeded in solving the problem, and are justly proud of your success. This is how a lot of scientists feel. But a lot of CS people look at the wasted work when there are simpler, better, more maintainable ways.

I wasn't appalled by AWK language but the blind switching of two lines which were assumed to be paired reads when many reads are not paired. It is precisely lack of checking that is a problem. I have nothing against AWK, although personally I use Python for massaging data.

The Duke situation to which you refer was fraud, not just a result of programmer error or poor engineering practice.

It was initially a programming error but the Duke researchers refused to acknowledge it and reanalyze their data, because that might mean retracting their prominent paper. From there it just snowballed. It was certainly fraud after the error was pointed out to them. There might also be other elements of fraud in their paper. I watched a presentation by MD. Anderson researchers who spotted the error and spent more than two years trying to call attention to it.

The smoking gun was an error, but there were something like 9 Potti papers that ended up getting retracted. There's no way that someone could have accidentally made that many mistakes...

Interestingly, the fraudsters were caught because of a false claim on a CV, and that finally destroyed their creditability.

It is intentional fraud, no doubt about it; they restarted halted clinical trials. I was just pointing out they did sloppy work too.

They were caught due to their bad behavior in the case you listed earlier. But Duke refused to do anything about it until Anil Potti's false claim of a Rhodes Scholarship came to light.

Wow, that's messed up.

Fundamental methodological error -> "Come on, these are competent people, you have to trust that whatever error they made didn't effect the final result."

False claim of accolade -> "How dare you fucking try to pass off this garbage as legitimate science?!?!?"

Welcome to academia

Why is that?

>We are very fortunate in that we write run-once programs that only have to work in one environment using one inputs.

If you work with that mentality, you're asking for trouble. Well, not so much asking for trouble, but sending Trouble a voicemail that says "We're over here, you lazy bastard, just see if you can mess something up!"

That was a tongue-in-cheek thingy. I write extensive tests for all my code. But when I look for a job, people counts papers not weight software quality. It is not easy.

I am a bit clueless here. What is bad about parsing large files with Awk?

Nothing. It is bad to not parse files but still changing content because you assume the file format mandates something when it doesn't.

How can you leverage all of us really good programmers with tons of time, who are dying to work on something "important" and meaningful?

If you want to improve the software engineering quality of bioinformatics software, then find an open source project you are interested in, and offer to submit patches to improve really unsexy but important stuff for bioinformatics user experience. Things like documentation, deployment, user interface, and testing. Some of these things require little domain knowledge but no one wants to do them.

Edited to add: some projects might even have a bug tracker that will already have problems you can tackle.

Where to start? Any list of this projects?

I don't do programming for fun, but I'll be visiting a local university soon, and could share this with students.

Our Genomedata[1] storage format/API should be readily comprehensible, and has a Google Code tracker:

[1] http://noble.gs.washington.edu/proj/genomedata/

[2] http://code.google.com/p/genomedata/issues/list

Got any examples?

Thank you!!

Part of the problem is grant money. Sometimes it's faster to buy more machines and get more results as opposed to rewriting entire algorithms. But the author does correctly identify, I think, some tendencies of some academic bioinformaticists.

I have enough experience to know if this is true or not. Many times it was faster to buy more machine, but often it was not. We already had 10000 cores.

I proposed, implemented, and tested an 8 line change to our alignment tool that saved 6% cpu time. It took me two days, most of which was my spare time at home. This one program was using 15 cpu years every month. Nobody cared. It never went into production. I started interviewing for a new job and left shortly after that.

How complicated was the bureocracy that you couldn't push the change into production yourself after verifying that it is a strict speed-up and doesn't break anything? I think such barriers are incompatible with the word 'research', where the first you need is freedom.

Research is highly competitive business mixed with industry involvement (or government involvement). You have to publish and fast. You have to develop your discoveries into something that can be monetized. You have to collaborate with industry to get funded. You have to cut costs to keep doing what you want to do. And so on. The idea of freedom in (fundamental) research seems long dead. How I long for the freedom in the research labs in the first half of the 20th century. To really explore an idea without regard for cost, returns, (publishable) results. A researcher can dream :-(

> The idea of freedom in (fundamental) research seems long dead

Is it so in the US? Or where? Here in Russia it is far from true, at least in the top institutes. As long as you produce publishable results, you may do virtually whatever you want, and nowadays pretty much anything is publishable. And this way you get funding, too, because the funding agency doesn't seem to want you to solve some particular problem, it just wants to be sure your science keeps up with the world.

The downside here is that the academy usually pays bad. Thus it seems most successful labs work like 70/30 on commercial projects and "pure science". Anyway, when you work on commercial projects you usually get much more interesting results than you'd care to publish.

Here in the Netherlands it is. I assumed it to be the same in the Western world, but those kind of generalizations often turn around to bite me in the ass. We, researcher in the Netherlands, have to produce as funding depends on it. Furthermore, as the government funds less and less, we have to get more funding from industry. And finally we have to try to market our research more. This all means that we can not afford to just do whatever we think is best for the only purpose of extending our knowledge. We have to think about our career and the sustainability of our research (strand) in the long run.

That does not mean that we're just lapdogs for industry or Mammon, but it does mean that we're selective in what we do and how we do it.

Some labs are conservative because they are worried that they will not be able to reproduce the same analysis. For example, imagine that a lab had been collecting samples and executing the current code as they came in. Now imagine, two years later, someone starts drawing some conclussions based on an aggregation of the results over a petabyte of that data. On one hand, you could just say- nope, we cant reproduce the same analysis, but we can use all of our computational power for a month and reanalyze all of the data using the current packages/code. On the other hand, a more conservative idea might be to try to record the entire state of the environment when particular samples are recorded, so that in theory you could replay all that analysis- fire up the vm from 2 years ago, install the same version of all the packages, install the code with the same tags, and analysis that data set, then do the same thing for every other data set. Smaller labs I think are just hoping that no one tries to replicate their studies or asks them if they can reproduce their results.

I dont think the problem is people (researchers, developers) but of the infrastructure for research. Researchers are constantly thinking about getting new grants and renewing old ones the way politicians are constantly worried about their corporate sponsors and getting reelected. The result is that we only get a little science and we only get a little good governance. The internal organizations that form as a result of this environment are artificial. In the lean times researchers make short term decisions aimed at generating marketing and taking mindshare. In the fat times researchers ensure that all computational and lab space are used and come up with new reasons for growth. A friend working in a large research institution once suggested a refactoring that would greatly improve efficiency of an application. Instead, she was handed back down a recommendation that would make the application less efficient with the same functionality. The reason was that the computational usage was about to be audited and the rule was that there would be no improvements in efficiency until after it was complete. The system is an old house with hundred year old plumbing. The people you pour through the system are going to flow through the pipes abiding by the laws of physics. Blaming them for a leak is about as useful as blaming water: while you may win the moral argument, you will not solve the problem. The best you can do is replace them with new people who will react largely in the same manner.

I have some experience working at a genomics research company and I'll broadly +1 Fred's experience about the industry, although in less negative terms. I got out before I got jaded, so my perspective is a bit more "oh, that's a shame" than his. I really like genetics, bioinformatics, hardware, deep-science, and all that but the timing and fit wasn't right.

The tools are written by (in my experience) very smart bioinformaticians who aren't taught much computer science in school (you get a smattering, but mostly it's biology, math, chemistry, etc.). Ex:




The tools themselves are written by smart non-programmers (a very dangerous combination) and so you get all sorts of unusual conventions that make sense only to the author or organization that wrote it, anti-patterns that would make a career programmer cringe, and a design that looks good to no one and is barely useable.

Then, as he said, they get grants to spend millions of dollars on giant clusters of computers to manage the data that is stored and queried in a really inefficient way.

There's really no incentive to make better software because that's not how the industry gets paid. You get a grant to sequence genome "X". After it's done? You publish your results and move on. Sure, you carve out a bit for overhead but most of it goes to new hardware (disk arrays, grid computing, oh my).

I often remarked that if I had enough money, there would be a killing to be made writing genome software with a proper visual and user experience design, combined with a deep computer science background. My perfect team would be a CS person, a geneticist, a UX designer, and a visual designer. Could crank out a really brilliant full-stack product that would blow away anything else out there (from sequencing to assembly to annotation and then cataloging/subsequent search and comparison).

Except, I realized that most folks using this software are in non-profits, research labs, and universities, so - no, there in fact is not a killing to be made. No one would buy it.

I live in this field, as a computer scientist learning the biology, and trying to make a living with a bootstrapped company.

I wrote a post about why GATK - one of the most popular bioinformatic tools in Next Generation Sequencing should not be put into a clinical pipeline:


In terms of your ideal software strategy, I can speak to that as well, as I am actually attempting to do almost exactly what you suggesting. My team is all masters in CS & Stats, with focus on kick-ass CG visualization and UX.

We released a free genome browser (visualization of NGS data and public annotations) that reflects this:


But you're right, selling software in this field is a very weird thing. It's almost B2B, but academics are not businesses and their alternative is always to throw more Post-Doc man-power at the problem or slog it out with open source tools (which many do).

That said, we've been building our business (in Montana) over the last 10 years through the GWAS era selling statistical software and are looking optimistically into the era of sequencing having a huge impact on health care.

> I wrote a post about why GATK - one of the most popular bioinformatic tools in Next Generation Sequencing should not be put into a clinical pipeline:

I've seen you link to your blog post a couple of times now, and I still think it's misleading. I do wonder whether your conflict of interest (selling competing software) has led you to come to a pretty unreasonable conclusion. (My conflict of interest is that I have a Broad affiliation, though I'm not a GATK developer.)

In your blog post, you received output from 23andme. The GATK was part of the processing pipeline that they used. What you received from 23andme indicated that you had a loss of function indel in a gene. However, it turns out that upon re-analysis, that was not present in your genome; it was just present in the genome of someone else processed at the same time as you.

Somehow, the conclusion that you draw is that the GATK should not be used in a clinical pipeline. This is hugely problematic:

1) It's not clear that there were any errors made by the GATK. Someone at 23andme said it was a GATK error, but the difference between "user error" and "software error" can be blurred for advantage. It's open source, so can someone demonstrate where this bug was fixed, if it ever existed?

2) Now let's assume that there was truly a bug. Is it not the job of the entity using the software to check it to ensure quality? An appropriate suite of test data would surely have caught this error yielding the wrong output. Wouldn't it be as fair, if not more so, to say that 23andme should not be used for clinical purposes since they don't do a good job of paying attention to their output?

Your blog post shows, for sure, a failure at 23andme. Depending on whether the erroneous output was purely due to 23andme or if the GATK had a bug in production code, your post shows an interesting system failure: an alignment of mistakes at 23andme and in the GATK. But I really don't think it remotely supports the argument that the GATK is unsuitable for use in a clinical sequencing pipeline.

On your first point, my post detailed that 23andMe confirmed it was a GATK bug that introduced the bogus variants and the bug was fixed in the next minor release of the software. There are comments on the post from members of 23andMe and the GATK team that go into more details as well.

On your second point. 23andMe had every incentive to pay attention to their output, but it is fair to say it's their responsibility for letting this slip through. But, it's worth noting in the context of the OP rant, that 23andMe probably paid much more attention to their tools than most academics who often treat alignment and variant calling as a black box that they trust works as advertised.

So what I actually argue in the post (and should have stated more clearly in my summary here) was that GATK is incentivised, as an academic research tool, to quickly advance their set of features with the cost of bugs being introduced (and hopefully squashed) along the way.

This "dev" state of a tool is inappropriate for a clinical pipeline, and GATK's teams' answer to that is a "stable" branch of GATK that will be supported by their commercial software partner. Good stuff.

Finally, I actually have no conflict of interest here as Golden Helix does not sell commercial secondary analysis tools (like CLC Bio does). I wrote this from the perspective of someone who is a 23andMe consumer as well as being informed as I give recommendations of upstream tools with our users (which I might add, I would still recommend and use GATK for research use, with the caution to potentially forgo the latest release for a more stable one).

You know though, the conflict of interest dismissal is something I run into more than I would expect. I'm not sure if some commercial software vendor has acted in bad faith in our industry to deserve the cynicism or if this is defaultly inherited by the "academic" vs "industry" ethos.

> So what I actually argue in the post (and should have stated more clearly in my summary here) was that GATK is incentivised, as an academic research tool, to quickly advance their set of features with the cost of bugs being introduced (and hopefully squashed) along the way.

Sure, I agree with that. And I would agree if you would say "Using bleeding-edge nightly builds of %s for production-level clinical work is a bad idea," whether %s was the GATK or the Linux kernel. I would be in such complete agreement that I wouldn't even feel compelled to respond to your posts if that's what you would say originally, rather than saying, "the GATK ... should not be put into a clinical pipeline". The former is accepted practice industry-wide; the latter reads like FUD and cannot be justified by one anecdote.

> You know though, the conflict of interest dismissal is something I run into more than I would expect.

Regarding conflict of interest, my point in trying to understand your potential interests, and also disclosing my own so that you can see where I'm coming from. That's not a dismissal, it's a search for a more complete picture. Interested parties are often the most qualified commenters, anyway, but their conclusions merit review.

Hopefully people wouldn't dismiss my views because of my Broad connection, anymore than they would dismiss yours if you sold a competing product.

They key is 23andMe was not using bleeding-edge nightly builds but official "upgrade-recommended" releases.

GATK currently has no concept of a "stable" branch of their repo (Appistry is going to provide quarterly releases in the future, which is great).

The flag I am raising is that a "stable" release is needed before it get's integrated into a clinical pipeline. Because the Broad's reputation is so high, it is important to raise this flag as otherwise researchers and even clinical bioinformaticians assume choosing the latest release of GATK for their black-box variant caller is as safe as an IT manager choosing IBM.

Good call. Much like a Ubuntu LTE, having stable freezes of the GATK (now that it's relatively mature) that only get bug-fixes but no new (possibly bug-prone) features is a great idea.

This is an old story. Every domain I've worked in featured a chasm between the domain experts and the software folks. Experts write terrible software that somehow mostly works. Software folks misunderstand the problem and create overwrought monstrosities.

In my experience, this applies to accounting software, sensor data, computer-aided design, print manufacturing, healthcare, etc.

I imagine there's phases of maturity, something akin to CMM/SEI. Eventually there's enough people with a foot on both sides to bridge the gap.

It just takes time.

Hrrm, I was in a genetics research lab myself and got annoyed at the inefficiencies myself. In particular, I got frustrating to write & use in-house scripts to run pipelines for compute clusters and then not know what the state of the execution is, where the files are, etc. It's sort of a meta-problem, but I decided to do a startup based on writing good software w/ a good UI to make the problem better (problem = running, monitoring, managing pipelines on clusters that have job schedulers like Grid Engine): http://www.palmyrasoftware.com/workflowcommander/

Maybe it's still in the early going, but I do see how it's going to be real difficult making a living doing this. OTOH, companies like CLC Bio seem like they're doing well for themselves...

Why wouldn't anyone buy your product? If it is easy to use, and SPEEDS UP RESEARCH TIME, your researcher/PI who is spending thousands on computing clusters will buy your software for their graduate students. Hell, my PI keeps asking me if I need a faster computer so I can run Matlab better/quicker. Really, if I had a software that helped me perform research faster/better/quicker and compare my results to ground truth or gold-standards, that is a much more useful tool than a bunch of hardware for my research. You push out papers fast.

So I disagree with you on your very last sentence (agree with the rest)

Ahh the efficiency argument.

The trick is, academics often have excess manpower capacity in the form of grad students and post-docs. Even though personell is usually one of the highest expenses on any given grant, they often don't look at ways to improve the efficiency of their research man-hours.

That's not a blank rule, as we have definitely had success with the value proposition of research efficiency, but in general, a lot of things business adopt to improve project time (like Theory of Constraints project management, Mindset/Skillset/Toolset matching of personel et) is of no interest to academic researchers.

I disagree with you. If there was excess manpower, graduate students wouldn't be stressed out with overwhelming work. Obviously, there is a lot more work to go around and less bodies to give it to. Most of the research man-hours is gone trying to implement other people's research-methods so you have a 'baseline.' A complete waste of time just to have one graph in the Results section of your publication. The height of research inefficiency is to replicate someone else's results and hope (finger's crossed) that you followed their 8-page paper (that took them 10 months to develop) meticulously. Academic researchers only care about results, it is the graduate students that need to be efficient. The efficiency software should be bought by the PIs for their graduate students.

After researching this field (biomedical R&D) a bit, I found that the mindset and workflow is mostly pre-computers. The relevant decision makers in the labs usually don't see a need to change something because "it works" and "it's done always this way".

"its always done this way" is the ultimate motivation of any startup. We wouldn't have any competing startups if everyone just accepted that, probably, not have any entrepreneurs or have a better world for that matter. The fitness function of the world will flatline.

I'd be happy for you to be right. At least back when I worked there it wasn't clear the total addressable market was there. It's not that they couldn't buy it, it's that they didn't see the need. Perhaps that has changed. :)

There are companies out there that offer commercial sequencing software. DNANexus is one.

As for whether there's "a killing to be made", it's kind of unclear so far.

I sympathize with the author, but this piece fails because many of the specific criticisms are off-base, and he's not trying to be at all constructive.

For example, it isn't true at all that microarray data is worthless. The early data was bad, and it was very over-hyped, but with a decade of optimization of the measurement technologies, better experimental designs, and better statistical methods, genome-wide expression analysis became a routine and ubiquitous tool.

The claim that sequencing isn't important is ridiculous. It's the scaffold to which all of biological research can be attached.


There is a great deal of obfuscation, and reinventing well-known algorithms under different names (perhaps often inadvertently). There's also a lot of low-quality drivel on tool implementations or complete nonsense. This is driven largely by the need in academia to publish.

The other side of this problem is that in general, CS and computer scientists don't get much respect in biology. People care about Nature/Science/Cell papers, not about CS conference abstracts. Despite bioinformatics/computational biology not really being a new field anymore, the cultures are still very different.

No kidding about reinventing wheels. I once saw a manuscript based entirely on dot-product as 1-D least-square. I don't know what happened to it, but one reviewer called it a seminal event in GWAS.

Bioinformatics is hard, but too many careerists take advantage of difficulties and uncertainty to publish as many papers as they can get away with.

I agree with him, and have been complaining about the same shit for ages (I work in bioinformatics too). Sadly, biologists don't care. We're treated as the number crunchers. The real problem isn't that we waste computational resources, it's that many biologists download programs, run their data through it, and if it spits out an answer rather than an error, they trust it. Since that program probably has zero unit test coverage, and the results may be fed into pharmaceutical decisions, disease diagnostics, etc, you're basically fucked if something went wrong. Lots of us have said this[0].

Minor quibble: genome assembly is definitely still an open problem that's computationally difficult. So is robust high dimension inference, but that falls more under statistics.

I've wanted to leave at least a dozen times too, for the better pay, for working with programmers that can teach me something, and to not have my work be interrupted by academic politics. But the people pissed at the status quo are the ones that are smart enough to see it's broken and try to fix it, and if we all leave, science is really fucked.

[0] http://www.johndcook.com/blog/2010/10/19/buggy-simulation-co...

If you really want to get a feel for how deluted the Bioinformatics community is, look for a job in the field as an outsider. It's not uncommon to see requirements like:

"Must be an expert in 18 technologies" "Must have a PHD in Computer Science or Molecular Biology" "Must have 12 years experience and post doctoral training" "Pay: $30,000"

It's delusional because they apply the requirements it took for themselves to get a job in Molecular Biology (long PHD, post doc, very low pay for first jobs) and just apply it carte blanche to all fields that may be able to aid in their pursuits. Especially when it comes to software engineering where it can often be extremely difficult to explain why you did not pursue a PHD.

As someone on a bioinformatics team in a public research institution, salaries range from $75k to $100K for developers on our team. This includes a number of people, including myself, who do primarily normal IT things (data management, small webapps for various clinical and research needs) and also for devs doing pipeline/workflow mgmt software, novel dev (e.g. new research code for sequencing), and variant calling work.

In my geographic area, this salary range is somewhat below corporate IT work (say 10% to 15%), but generally higher than the typical university software dev job listing. The university is really bad to list jobs and job requirements with laughable salaries. I have seen (in other departments) web app dev jobs that require significant front-end and back-end skillsets/experience and then pop a salary that is full 50% less than entry level jobs for CS undergrads.

One problem is that hiring departments in that position will find someone to hire at that rate, so they think it was correct. From personal experience, I can verify that "good on-paper" candidates with exceptional credentials (say MS in CS, bunch of experience) from other depts who look to join our team are unable to to write any code at the whiteboard at all (say a for loop in java to println something). But to be fair, a recent job interview cycle one of my teammates performed produced exactly two candidates out of 16 who could do this and only one of those could write a SQL statement that required a simple inner-join. Most of those folks were external, so it's not just a problem inside the institution.

I have a number of cynical and embarrassing opinions about this situation.

I don't disagree with most of your post, though I cannot resist commenting on the whiteboard interview test. Writing anything other than rough pseudo code or algorithm sketches on the whiteboard is a silly exercise. It's not reflective of any sort of reality, probably indicates to candidates that you are not working on any interesting problems, and people won't remember exact syntax or library functions for any language that they don't use fairly regularly.

The whiteboard is only useful as an aid in explaining an algorithm. If a candidate can do that without the whiteboard, even better.

I'm kind of with you on the whiteboard code issue (I was sitting in on the interview in question), especially for a "hard" coding exercise.

My bigger concern is that for a job that specifically highlighted the need for at least some SQL skills and some Java expertise, a candidate that can not, even after prompting, write a for loop in Java (or in any language, when offered the chance to do so in a "favorite" language") or write a SQL statement that joins two tables probably can't do much of anything, let alone work on interesting problems.

Here is the cold, hard truth - I know, both because of my own limitations and the opportunity of the job, that we are not going to get top % hackers. But if you apply to a job where the primary need is coding in blub, I think its fair to expect a simple question or two about basic blub constructs. I myself would be nervous about whiteboard coding for something complex, but also generally offer (in a cover letter) to provide some code examples to talk through at an interview ahead of time.

I think it behooves us all to have at least some baseline expectation to demonstrate some competence. Remember, I'm not thinking that whiteboard coding of an algorithms or anything.

I think a very fair (and concerning to me) insight might be: if you can use Google and an IDE, can you do all that this job requires?

I assume these are separate requirements. I have not seen any doctoral-level positions advertised for a salary of $30,000. The minimum NIH salary for postdoctoral trainees is more than that.

It's only delusional if they can't find people to fill the jobs. The idea that, as an outsider, you know what requirements they should use in their hiring process better than they do is perhaps more delusional.

I'm not an outsider and the 30K was a bit of an exaggeration, and I apologize for that. The point I was trying to make was that if you look in as an outsider, you would see the requirements being extremely daunting compared to what you might see elsewhere with a pay scale that is very low and unappealing to anyone who might match it. Unless, of course, you just finished your degree in some biological discipline where the jobs are scarce. They are absolutely delusional (and so am I, most likely) because in most cases what they really need to solve the problems they have, is the same type of person most companies would need in a similar situation, a quality software engineer with experience building quality applications that are both extensible and maintainable.

I worked in bioinformatics for more than 10 years before I moved on, and In my experience they do have a lot of trouble finding people to fill positions, especially outside of massive government funded groups like the NIH. This often results in passing on competent software engineers with a B.Sc. that don't meet the requirements in favor of PHD level biology graduates who have taken a year or so of undergrad computer science courses. In my experience, this leads to many of the problems discussed (and exaggerated) by the OP. While some of these people are smart and produce good work, much of the time they produce poor quality software that gets the job done, but as inefficiently as possible and they leave a code base that is virtually unusable. Overall, I mostly just wanted say that it's a mindset they REALLY need to get past for the long term success of the industry.

If 30k is the inaccurate number, what's the accurate one? I'm curious as to what the realistic requirements are from your experience with the field.

I've seen a lot of job listings, at very large companies and academics for the 45-50 range. Keep in mind, these are jobs requiring a PHD, 10 years of experience, and a dozen or so technologies.

It's not really the money that's skewed, it's their idea about the person they need for the job. They don't need someone with that background (most of the time), they just need a junior level software engineer in which case the pay scale may not be too bad. There's a problem in realizing this, however, when the standards for your own field (molecular biology for example) are extremely high, so you expect it of all others as well...

50k-60k starting out.

The field of bioinformatics will be fine even if there aren't any changes. We'll just continue to muddle through as we have. I'll agree that things would be much better if software quality were to improve, but changing that will require a change in incentives. Namely, journals or funding agencies will have to start requiring quality software.

Check the Sanger Institute's job page (https://jobs.sanger.ac.uk/wd/plsql/wd_portal.show_page?p_web...). They offer «£29,750 to £37,525» for a "senior bioinformatician", for example.

He's exaggerating about the 30k of course but it's true that these positions don't pay very well compared to what experienced programmers can get elsewhere.

I have seen this in many bio fields. As the biology research becomes more of a computational problem, requiring unique solutions for bleeding edge research, I imagine the field is going to have huge pains before actually paying for the work vs. using seniority and degree level as the sole determinate of pay scale.

This is pretty hilarious, from my brief experience with bioinformatics I can very well imagine someone writing the opposite rant, about CS people getting into bioinformatics not knowing sh*t about biology. I mean, browse through bioinformatics textbooks, those are either written by computer scientists and those are little more than string algorithm textbooks or by biologists and then the layer of jargon for someone coming from CS is just impenetrable. Same with bioinformatics teachers, I come from a CS background, but spent one solid month seriously trying to understand the basics of molecular biology and my bioinformatics seminar instructor sometimes seemed to know less about it than me. Terrifying, no wonder nonsense results are produced.

My friend said: Bioinformatics means that computer scientists – who don't know mathematics and don't know biology – are trying to do mathematical biology.

I always feel awkward reading these rants, mainly because I've burned my bridges before and it really wasn't worth it. Even if it is true, it's better to leave it and move on.

If you really feel strongly about something, write it dispassionately (normally some time after the event) and treat it like a dissertation, backed with case studies and citations.

Basic science moves forward slowly limited by the pace of fortuitous discoveries. I have found that many people from the field of computer programming have unrealistic expectations of what can be done in biology and other sciences.

Sounds like a fed up academic with a stick up his backside.

Sh*tty data? Comes from the community. If the data and algorithms are so poor, and the author so superior, he should have been able to improve the circumstances.

This whole screed reads like an entitled individual who entered a profession, didn't get the glory, oh and yeah, academia doesn't pay well.

In the realm of bioinformatics, lets ignore the work done on the human genome and the like.

> Sh*tty data? Comes from the community. If the data and algorithms are so poor, and the author so superior, he should have been able to improve the circumstances.

Why? Aren't you assuming a lot about the incentives? What if the ground truth is simply that all the results are false due to a melange of bad practices? Do you think he'll get tenure for that? (That was a rhetorical question to which the answer is 'no'.) Then you know there's at least one very obvious way in which he could not improve the circumstances of poor data & algorithms.

He's not getting tenure because he doesn't have a PhD. According to LinkedIn, he has a master's degree awarded after four years of study [1], which often indicates someone who did not complete a PhD.

[1] http://www.linkedin.com/pub/frederick-ross/13/81a/47

According to his 2009 CV he was working a PhD in biology back then and expecting to finish in 2011.

Given that he is not a professor it is not clear why he would be expected to be seeking tenure.

> In the realm of bioinformatics, lets ignore the work done on the human genome and the like.

He discusses this specifically in the rant. Are you saying he's wrong?

Depends. Subtle corruption of institutional research processes is unfortunately far too common. It means that there's nice low hanging fruit if you know where to look and have access to funding. But that, especially the latter is a tall ask in almost every field.

Perhaps the algorithms aren't within his grasps. They could very well be paying for an out-of-the-box solution.

he should have been able to improve the circumstances

Was anyone asking him to? Was anyone paying him to? No? Then it's an uphill battle and also not his responsibility. Leaving is saner.

My experience working as a scientific programmer is this: my colleagues aren't forthcoming. I could list case after case of failure to document or communicate crucial details that cost me days, weeks and even months of effort. But I won't, until I have another job lined up. If I were in the author's position (I'm in another field), I would insist that my colleagues--all of them, in whatever field I ended up working, were forthcoming about their work. This is non-negotiable. Being over-busy is no excuse. (It may be an excuse for not being forthcoming, but right or wrong, I couldn't care less--I would not work with such people if I could avoid it, for whatever reason.)

Academia rewards journal publication and does not adequately reward programming and data collection and analysis, although these are indispensable activities that can be as difficult and profound as crafting a research paper. At least the National Science Foundation has done researchers a small favor by changing the NSF biosketch format in mid-January to better accommodate the contributions of programmers and "data scientists": the old category Publications has been replaced with Products.

Naming is important to administrators and bureaucrats. It can be easy to underestimate the extent to which names matter to them. Now there is a category under which the contribution of a programmer can be recognized for the purpose of academic advancement. Previously one had to force-fit programming under Synergistic Activities or otherwise stretch or violate the NSF biosketch format. This is a small step, but it does show some understanding that the increasingly necessary contributions of scientific programmers ought to be recognized. The alternative is attrition. Like the author of the article, programmers will go where their accomplishments are recognized.

Still, reforming old attitudes is like retraining Pavlov's dogs. Scientific programmers are lumped in with "IT guys." IT as in ITIL: the platitudinous, highly non-mathematical service as a service as a service Information Technocracy Indoctrination Library. There is little comprehension that computer science has specialized. For many academics, scientific programmers are interchangeable IT guys who do help desk work, system and network administration, build websites, run GIS analyses, write scientific software and get Gmail and Google Calendar synchronization running on Blackberries. It is as if scientists themselves could be satisfied if their colleagues were hired as "scientists" or "natural philosophers" with no further qualification, as opposed to "vulcanologist" or "meteorologist" (to a first order of approximation).

Right now experimentalists generate data and then try to find computer people to analyse their data. However, in the not too distant future computer models will drive experimental research as hypothesis generation tools. Then the computer people will be seeking biology people ( or robots) to run experiments to validate their hypothesis and there will be more respect for the field.

This seems to presume that scientific programming is merely a service to the important and more deserving persons who generate scientific hypotheses, from whom it can be decoupled and isolated, instead of being the collaborative effort that it is--if elevating the professional standing of scientific programmers must wait for the widespread adoption of automated hypothesis generation software. For example, the computation of ecosystem service indicators--what you might call the interface between biogeophysical models of Earth systems and economic and policy modeling--is an interdisciplinary and collaborative activity that relies heavily on computational technique and technology.

"I’m leaving bioinformatics to go work at a software company [...]"

"[bioinformatics] software is written to be inefficient, to use memory poorly, and the cry goes up for bigger, faster machines! [...]"

Well, the author is heading for a very bitter surprise...

You know, I'd be more inclined to listen to him if he didn't also completely decry almost all of modern biology, which (in my view) has been to the late 20th and early 21st centuries what physics was to the late 19th and early to mid 20th centuries.

I spent a year in a bioinformatics PhD program and got the feeling I was studying to be science's version of the business analyst. Not knowing enough about the biology or computation, but expected to speak the language of both. And what would my research consist of in such an applied science? Luckily I had another opportunity and became a software developer (which I'm happy with). The worst thing about the experience was listening to so many research presentations where I could tell the presenter didn't understand the science and could barely explain it.

Some thoughts on this article:

- This guy clearly has a limited understanding of the field. This quote is laughable: "There are only two computationally difficult problems in bioinformatics, sequence alignment and phylogenetic tree construction."

- As a bioinformatician, I feel sorry for this guy. Just like any other field, there are shitty places to work. If I was stuck in a lab where a demanding PI with no computer skills kept throwing the results of poorly designed experiments at me and asking for miracles, I'd be a little bitter too.

- Just like any other field, there are also lots of places that are great places to work and are churning out some pretty goddamn amazing code and science. I'm working in cancer genomics, and we've already done work where the results of our bioinformatic analyses have saved people's lives. Here's one high-profile example that got a lot of good press. (http://www.nytimes.com/2012/07/08/health/in-gene-sequencing-...)

- I'm in the field of bioinformatics to improve human health and understand deep biological questions. I care about reproducibility and accuracy in my code, but 90% of the time, I could give a rat's ass about performance. I'm trying to find the answer to a question, and if I can get that answer in a reasonable amount of time, then the code is good enough. This is especially true when you consider that 3/4 of the things I do are one-off analyses with code that will never be used again. (largely because 3/4 of experiments fail - science is messy and hard like that). If given a choice between dicking around for two weeks to make my code perfect, or cranking out something that works in 2 hours, I'll pretty much always choose the latter. ("Premature optimization is the root of all evil (or at least most of it) in programming." --Donald Knuth)

- That said, when we do come up with some useful and widely applicable code, we do our best to optimize it, put it into pipelines with robust testing, and open-source it, so that the community can use it. If his lab never did that, they're rapidly falling behind the rest of the field.

- As for his assertion that bad code and obscure file formats are job security through obscurity, I'm going to call bullshit. For many years, the field lacked people with real CS training, so you got a lot of biologists reading a perl book in their spare time and hacking together some ugly, but functional solutions. Sure, in some ways that was less than optimal, but hell, it got us the human genome. The field is beginning to mature, and you're starting to see better code and standard formats as more computationally-savvy people move in. No one will argue that things couldn't be improved, but attributing it to unethical behavior or malice is just ridiculous.

tl;dr: Bitter guy with some kind of bone to pick doesn't really understand or accurately depict the state of the field.

" I could give a rat's ass about performance. I'm trying to find the answer to a question, and if I can get that answer in a reasonable amount of time, then the code is good enough"

This is the only bad point that a lot of people are aligned with.

The more time a program needs to finish, the more time you will need to run it again with some other dataset, and in turn - more time to find the right answer.

I really feel that people with scientific and mathematics background should learn proper programming (not take a course in some language - but have actual experience). Design patterns, data structures, best practices, memory consumption, are all things that should be known before a person starts submitting code for this kind of projects.

Spending time optimizing a program that you will use once is a waste. Sitting idle while waiting for a program to finish is also a waste. So I think it's reasonable to optimize for programmer time the first time, and then re-visit the design if you discover the code is getting reused and fed larger data sets.

Want to teach us? A bunch of us work right near AT&T park in Mission Bay and would love to learn. Even a long day or two from you guys would be awesome. But as was eluded to, we can't pay you - we're poor as shit - especially when compared with you all.

What's the backstory on the author's tangent about the human genome? It sounded like the human genome project didn't actually do what the name implies.

Tell that to the tens of thousands of researchers who make use of the human reference genome daily. I don't even know what the guy is talking about there - imagining modern genetics or genomics without it is pretty much impossible.

The problem with bioinformatics is not "prematured optimization", but rather no optimization at all.

Out of curiosity, what other computationally difficult problems are there?

I'm very interested in bioinformatics, but sadly don't know as much about the field as I'd like.

1. gene networks is a big one: some proteins turn genes on or off. Some of those genes get translated into other proteins that turn genes on or off. How can you infer the interactions from experimental data? How can you figure out what these complex networks DO? 2. Predicting gene expression: where do proteins bind to the DNA? How can you predict what these proteins do once they are bound ( add chemical tags to structural proteins, knock off structural proteins by bending DNA, etc)? How can you predict how frequently the gene will be transcribed? How does the 3D shape of the DNA effect this?

These are just two of many questions ( biased towards my research interests of course ). It is really funny that he mentions sequence alignment and phylogenetically as the two big problems, because people generally consider these to be boring, uncool, solved-well-enough-for-our-purposes problems nowadays and just trust the algorithms described by Durbin decades ago. It sounds like the writer really doesn't know bioinformatics that well...

One that comes immediately to mind is genome assembly, which is a hugely complex problem, and essential to a variety of fields that rely on re-piecing together the genome without a reference (or with a reference that is highly divergent from the sequence data).

Genome assembly relies heavily on sequence alignment. So: Is genome assembly hard just because sequence alignment is hard? Or would genome assembly present separate algorithmic problems even if there was a super-efficient solution to sequence alignment?

It is far more difficult than sequence alignment. Sequence alignment has quadratic complexity, while fragment assembly is NP-hard. Se for example


Yes, for pairwise sequence alignment. The globally optimized multiple sequence alignment problem is NP-complete.

These are different sorts of alignments, with different sorts of math behind them.

Genome assembly is the shortest common super sequence problem. It involves finding the best rearrangement and overlap of reads which minimize the overall sequence, given the expected errors in the read technology. It would still be hard even if all of the reads were perfect.

Sequence alignment looks at two or more sequences in their entirety, and does a best fit alignment using a given model of how substitutions and gaps can occur. This model may be based on chemical or evolutionary knowledge.

A "super-efficient solution to sequence alignment" doesn't lead to a way to tell how the reads should be assembled into a single large sequence, even ignoring possible read errors.

An extra difficulty with genome assembly is that DNA often has lots and lots of repeated junk sequences that can confuse the algorithms. I don't work with bioinformatics to know how they usually get around this though.

Repeats aren't necessarily junk (e.g. TAL Effectors http://en.wikipedia.org/wiki/TAL_effector#DNA_recognition). Resolving them requires long reads. PacBio is currently of interest as an alternative to Sanger sequencing for this, although the error rate of PacBio reads is a bit of an issue.

pacbio is dead, they just don't know it yet. BGI (or somebody, doesn't matter, BGI is just the obvious candidate) would need to buy 50 SMART sequencers a year just for PacBio to stay in business. That seems unlikely given the lower cost and complexity of Illumina and Life sequencers

I do PhD research in metabolomics -- one of the latest omics in bioinfo-- with the CS department in my university. At the moment, we're working on alignment and identification of metabolite data. The data is not big in the sense of genomics data, but messy and complex due to the nature of the instruments (mass spectrometer), which will not get better THAT much in the foreseeable future.

Definitely a computationally difficult problem because while naive approaches work, they produce crappy results, wasting the result of tens of thousands of dollars of experiments. I see a big move towards applying statistical/machine learning methods, and graph theory stuffs in our field.

A lot of the rants in the original article are correct, with regards to prototyping and throwaway codes. That's because researchers are rushing to get an MVP out. The truly good ones got turned into (usually open-source) products, where the code quality hopefully improves a fair bit.

If you're a CS person who's interested or considering a move into bioinfo, I wrote a blog post about it recently: http://www.joewandy.com/2013/01/getting-into-bioinformatics....

Protein folding is an interesting and computational challenging task. So challenging that some groups have sort of given up on it and move to other fields. Look up David Baker and Rosetta for more info. This is just an example, there are many many problems to work on. I feel sorry for the author of the post, bioinformatics is only getting more interesting as our capacity to make experimental measurements grow. There have been so many interesting findings that are just the product of bioinformaticians digging into existing databases and analyzing them to come up with new theories that have since then been experimentally validated.

any type of network reconstruction - gene - gene / protein - protein , gene - protein , interaction network are all very challenging and important computational problems in biology

Really makes me want to learn more about molecular biology.

Any solid factual resources besides the references mentioned in this justified rant?

Biostars.org is a stackexchange-like site for bioinformaticians.

See there for answers to your question, eg:

* Best resources to learn molecular biology for a computer scientist. [1]

* What are the best bioinformatics course materials and videos (available online)? [2]

[1] http://www.biostars.org/p/3066/

[2] http://www.biostars.org/p/10766/

If you're interested in Next Generation Sequencing (the new "technology" OP referred to to replace microarrays), I wrote a 3-part series on my blog:

"A Hitchhikers Guide to Next Generation Sequencing"

Part1: http://blog.goldenhelix.com/?p=423

Part2: http://blog.goldenhelix.com/?p=490

Part3: http://blog.goldenhelix.com/?p=510

Also, yes molecular biologists with few exceptions know little more than fuck all about ecology. Hence the mostly gung-ho attitudes to GM of crop foods for example. Honestly. I've done real molecular biology work (simple commercial protein chemistry and molecular phylogenetics of mitochondrial DNA) and tried to start a PhD in ecology (failed due to funding issues and realising it was a dead end job wise).

There are a lot of problems in bioinformatics. Mainly, lack of reproducibility (ie "custom perl scripts"), poorly organized and characterized data and plenty of wheel reinvention (I heard Jim Kent, who first assembled the human genome, created his own version of wc [word of mouth, citation needed]).

The fact of the matter is that through high-throughput sequencing, microarrays, what have you, generation of biologically-meaningful results is possible.

There are a lot of problems in bioinformatics that need to be solved. Github has helped. More of bioinformaticians are learning about good software development practices, and journal reviewers are becoming more enlightened of the merits of sharing source code.

Also see the discussion at the bioinformatics subreddit: http://www.reddit.com/r/bioinformatics/comments/179e9k/a_far...

Interesting to read since I made the same career move last year. I agree with about half of it but don't see a lot of value or useful advice here.

I find it curious that he stops to salute ecologists, since I was in an ecology lab. I liked my labmates and our perspective, but we didn't have any magical ability to avoid the problems he aludes to here.

I think a lot of his frustration comes down to not being more involved in the planning process. That's not a new problem. R.A. Fisher put it this way in 1938: “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”

Perhaps the idea that we can have bioinformatics specialists who wait for data is just wrong. Should we blame PIs who don't want to give up control to their specialists, or the specialists who don't push harder, earlier? Ultimately the problem will only be solved as more people with these skills move up the ranks. But the whole idea that we need more specialists working on smaller chunks of the problem may be broken from the start (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1183512/).

OK, I agree that there are some shitty work on this field, but he can't think they we all in the same boat. For example "Irene Pepperberg’s work with Alex the parrot dwarfs the scientific contributions of all other sequencing to date put together." this is not true. Bioinformatics is not just blinding sequencing new DNA, but analyzing data and almost every new breakthrough in medicine is based in a direct (or indirect) bioinformatics analysis. I used to work in an agrobiotech company and the sequencer was the first source of data for any breeding program. Bioinformatics was used to design primers for PCR to find molecular markers. There is bad software out there? Yes, but I see this as an opportunity than a problem. And the cause is not the need to hide something, but the lack of ability of biologists with no CS background in the field.

Maybe overblown, but it echoes complaints I've heard from other bioinformatics people.

Surely this means there's a goldmine waiting there for someone to produce a non-broken toolchain for bioinformatics?

Or is it even possible to produce standard tools? Maybe all the labs are too bespoke?

i totally disagree on Fred's negative view of Bioinformatics. as "software is eating the world", it's actually bioinformatics is eating biology. today's main-stream biology is dealing with exploding amount of data from modern instruments, images or clinical data collected every day and mostly machine readable. to stay up-to-date a modern biologist / bioinformatist need to think biological problems in a "big-data" (i know, cliche) way, then try to gain some insight from the data with (computational) tools. today it's the algorithms, mathematical models and software packages on top of databases to pinpoint cancer SNPs and drive drug discovery. and today it's these same algorithms and math models driving how web bench works are designed. if you think biological data are "shitty", i guess you never see other kind of unstructured data out there. so many scholars in other fields envy biologist and medical scientists for something called "PubMed". on the other hand, for those purely wet bench "biologists" who think computers are magic boxes to give answers, insights, models with one push of the button, i do feel sorry for them. they are so last-gen as they just don't have the essential techniques nowadays (just like a molecular biologist not knowing pcr).

This is a little discouraging - BioInformatics was my top choice for a Master's program I'm planning to start this year. The program at Melbourne Uni looks really good (accepts from three streams, Math/Stats, Biology or Computing and tailors the course based on your background). Maybe I should go for a more generic Machine Learning one and try to apply that to healthcare in some other field if things are really this bad.

As someone in the field, let me assure you: This article does not accurately reflect the state of the field.

Thanks for the reply. I wasn't basing this just on the article, there seem to be a fair number of comments here supporting a less-extreme version of what he's saying.

I'm just starting a PhD at Melbourne Uni in bioinformatics after working in the field for several years. Don't pay any attention to this is my advice. Bioinformatics is a field currently pulling itself up by its own bootstraps out of the realm of research into the clinic. That's a painful process to be sure, but IMHO it's the most profoundly exciting time to be part of any discipline. You are literally being a part of and watching history in the making. It's going to be messy, but there are chances to contribute here like no other field going around.

Interesting, thanks for the point of view. If you'd be interested in getting a beer or a coffee at any point, let me know - contact details in my profile.

Could you add an email to your profile? I'd like to email you regarding Masters courses at UniMelb.

Sure, done.

I am also in the field, and IMHO we are starting to get away from the worst excesses code quality wise i.e. things are getting better.

6 years ago using CVS or something like that was novel. Now not using GIT is. Big improvement!

Problems are still interesting and challenging.

Some things are going to suck in academia, as this guy points out. But, its a necessary step and todays progress is almost always going to be tomorrows shit. So quit bitching.

Biologists are almost never good coders, if they can code at all. But thats not what they do, they signed up for pipettes, not python.

Its the programmers who wrote said shitty code that are to be blamed, but you can't hate under-paid and over-worked phd students who write this code even though it usually has nothing to do with their thesis (the math/algorithm is the main part, the deployable implementation is usually not the most important).

If you want good code and organized/accountable databases, go to industry. Theres nothing new about this transition. The IMPORTANT part, is that industry gives back to academia. So when you get an office with windows and a working coffee machine, remember to help make some phd student's life a little easier by making part of your code open source.

Where does the Rosalind project (rosalind.info) fit into all of this, I'm wondering? It seems to be written by people who have actual understanding of the mappings between biology and informatics, with clear explanations of problems in terms of the programming challenge involved.

Surely they can't get that far without having some kind of sensible method?

Why is this on the front page or why is it relevant? It's kind of a rant. I did some work on a publication in this field and was published once; I don't think it is a horrible research program. There may exist some of the issues in bioinformatics described here but I don't think it is terribly productive.

Having working in the bioinformatics industry as an SE for 9 years I can both agree and disagree.

1. I agree that SE standards and good coding practice are completely absent in the bioinformatics world. I remember being asked to improved the speed of some sequence alignment tools and realized that the source code was originally Delphi that had been run through a C++ converter. No comments, single monolithic file. The vast majority of the bioinformatics code I worked with was poorly written/documented Perl. In addition a lot of bioinformatics guys don't understand SE process and so rather than having a coordinated engineering effort, you end up with a lot of "coyboy coding" with guys writing the same thing over and over.

2. I agree that productivity is very slow. This is a side product of research itself though. In the "real world" (quoted) where people need to sell software, time is the enemy. It's important to work together quickly to get a good product to market. In the research world, you get a 2/5 year grants and no one seems have much of a fire under them to get anything done (Hey we're good for 5 years!). You would think that the people would be motivated to cure caner quickly (etc), but it's not really the case. Research moves at a snail's pace - and consequently the productivity expectations of the bioinformatics group.

3. I disagree that research results from the scientists are garbage. Yes it's true that some experiments get screwed up. However, if you having a lot of people running those experiments over and over, the bad experiments clearly become outliers. Replication in the scientific community is good because it protects against bad data this way. Somehow the author must have had a particularly bad experience.

4. Something the author didn't mention that I think is important to understand: most scientists have no idea how to utilize software engineering resources. The pure biologists, many times are the boss, and don't really understand how to run a software division like bioinformatics. Many times PHD's in CS run a bioinformatics group, who have never worked in industry and don't know anything about good SE practice or how to run a software project. A lot of the problems in the bioinformatics industry is directly related to poor management. Wherever you go you're going to have team members that have trouble programming, trouble with their work ethic, trouble with following direction. However, in a bioinformatics environment where these individuals are given free reign and are not working as a cohesive unit, you can see why there is so much terrible code and duplication.

This piece seems to have touched a nerve in the bioinformatics community, though I have no idea why. Much of what is said here is obvious to anyone working in academic research that requires programming expertise.

Yes, industry typically pays more than academia. Yes, most molecular biologists cannot code and rely on bioinformatics support. Yes, biological data is often noisy. Yes, code in bionformatics is often research grade (poorly implemented, poorly documented, often not available). These are all good points that have been made many times more potently by others in the field like C. Titus Brown (http://ivory.idyll.org/blog/category/science.html). But they are not universal truths and exceptions to these trends abound. Show me an academic research software system in any field outside of biology that is functional and robust as the UCSC genome browser (serving >500,000 requests a day) or the NCBI's pubmed (serving ~200,000 requests a day). To conclude from common shortcomings of academic research programming that bioinformatics is "computational shit heap" is unjustified and far from an accurate assessment of the reality of the field.

From looking into this guy a bit (who I've never heard of before today in my 10+ years in the field), my take on what is going is here is that this is the rant of a disgruntled physicist/mathematician is a self-proclaimed perfectionist (https://documents.epfl.ch/users/r/ro/ross/www/values.html), who moved into biology but did not establish himself in the field. From what I can tell contrasting his CV (https://documents.epfl.ch/users/r/ro/ross/www/cv.pdf) to his linkedin profile (http://www.linkedin.com/pub/frederick-ross/13/81a/47), it does not appear that he completed his PhD after several years of work, which is always a sign of something something going awry and that someone has had a bad personal experience in academic research. I think this is most important light to interpret this blog post in, rather than an indictment of the field.

That said, I would also like to see bioinformatics die (or at least whither) and be replaced by computational biology (see differences in the two fields here: http://rbaltman.wordpress.com/2009/02/18/bioinformatics-comp...). Many of the problems that apparently Ross has experienced come from the fact that most biologists cannot code, and therefore two brains (the biologist's and the programmer's) are required to solve problems that require computing in biology. This leads to an abundance of technical and social problems, which as someone who can speak fluently to both communities pains me to see happen on a regular basis. Once the culture of biology shifts to see programming as an essential skill (like using a microscope or a pipette), biological problems can be solved by one brain and the problems that are created by miscommunication, differences in expectations, differences in background, etc. will be minimized and situations like this will become less common.

I for one am very bullish that bioinformatics/computational biology is still the biggest growth area in biology, which is the biggest domain of academic research, and highly recommend students to move into this area (http://caseybergman.wordpress.com/2012/07/31/top-n-reasons-t...). Clearly, academic research is not for everyone. If you are unlucky, can't hack it, or greener pastures come your way, so be it. Such is life. But programming in biology ain't going away anytime soon, and with one less body taking up a job in this domain, it looks like prospects have just gotten that little bit better for the rest of us.

I agree that a lot of effort that is put into bioinformatics is wasted. But it's silly to say that bioinformatics hasn't contributed much to science, and naive to think that dysfunctional software development is less widespread outside of bioinformatics.

Fascinating HN thread. I work in the geoinformatics domain where many of the same comments apply. I agree scientists turned programmers are often poor software developers. Moreover, this group often belittles industry established best practices in software development. But in truth, the "pure" software engineer/computer scientist lacks sufficient domain expertise to accomplish something useful. Learning fluid dynamics requires many years of education. Ideally, you would like these two groups to work closely together and with mutual respect.

I largely agree with Fred's opinion on the shortcomings of bioinformaticians and the general attitude in the industry, but my personal experience was actually pretty positive. My past research was on building visualizations of the complicated biochemical processes, for use in educating undergrads. It was certainly more interesting than slogging through mounds of crappy data.

Just another data point for someone contemplating a career in BINF, although some purists might say that my work did not really fall under the same category.

Spelling error: 'technically apt', not 'ept'.

"Ept" means effective. As in "inept"

I don't understand this part:

> No one seems to have pointed out that this makes your database a reflection of your database, not a reflection of reality. Pull out an annotation in GenBank today and it’s not very long odds that it’s completely wrong.

In fact this entire article seems to be a rant on why bioinformatics as a field is rotting. But instead of ranting, surely something can be done about it?

Shouldn't we as hackers see this as an opportunity to revolutionize the field?

As a general rule, the people on the short end of the stick are the people least capable of producing change. Worse, change that they bring about tends to be good from a strict, technical viewpoint but has huge negative side effects that go unnoticed or deliberately ignored until it becomes difficult to distinguish the resultant system as a better one.

Rants like this, and providing interviews to third parties, are actually one of the more positive things that he could bring to the table: it provides information to people who aren't aware and inspires motivation in people who aren't entangled.

I don't know, but I think Fred is in a prime position to disrupt bioinformatics. He knows all the flaws, he knows all the problems. If I were him, I'd have seized the opportunity and work on a hard problem.

Then again, I am in no position to judge what Fred should or should not do

It all begins with a rant.

Say for the purposes of argument that this thesis were true. What is there (if anything) to be done about it? I ask as a naive interested party with a CS background.

That's a shame. I just finished a uni module about bioinformatics. It seemed like a cool field where progress was being made, and as an undergraduate I could generate meaningful looking results by following very recent papers. I hope the field has some saving graces even if this is all true. The idea of CompSci folk working with biology folk to solve human problems inspired me a lot.

The author is exactly right about the quality of data in bioinformatics. There are datasets with genes named MAR1, DEC1, etc. getting mangled to 1-Mar, 1-Dec, because of Microsoft Excel autoformatting.


The bio in bioinformatics is the important bit. Informatics plays second fiddle, even in the name. Very few will appreciate your beautiful code, but many will appreciate you finding a cure for cancer. That is the reality of bioinformatics, most of the code has a short shelf life. If you luck out, your software may live longer, as is the case with samtools. That samtools code is crappy is true, still the much cleaner code alternatives, sambamba and bamtools, are not much used! Go figure.

Maybe bioinformatics is not the place to aim for great informatics. We do bioinformatics because of love of science first and foremost. This is frontier land, the wild west, and it pays to play quick and dirty. I would suggest to hang on to some best practices, e.g. modularity, TDD and BDD, but forget about appreciation. Dirty Harry, as a bioinformatician you are on your own.

To be honest, in industry it is not much different. These days, coders are carpenters. If you really want to be a diva, learn to sing instead.

molecular biology has been dead for years now, but the amount of money poured into it makes it impossible to publish its death certificate. Here is why and how it happened (among other things): http://www.youtube.com/watch?v=Y0b11S1FjXY

Come work with me in my genomic interpretation company. Fun application building, no data mess, big money!

>> I’m leaving bioinformatics to go work at a software company with more technically ept people and for a lot more money.

More money, good on you. Starting off your critique of your former colleagues with "technically ept people'...not going to get a lot of sympathy for the correctness of your work.

Everyone is jumping on that, but (while I had to look it up too) 'ept' actually is a real word:

from the OED:

ept, adj. Pronunciation: /ɛpt/ Etymology: Back-formation < inept adj.

  Used as a deliberate antonym of ‘inept’: adroit, appropriate, effective.
1938 E. B. White Let. Oct. (1976) 183, I am much obliged..to you for your warm, courteous, and ept treatment of a rather weak, skinny subject.

1966 Time 30 Sept. 7/1 With the exception of one or two semantic twisters, I think it is a first-rate job—definitely ept, ane and ert.

1976 N.Y. Times Mag. 6 June 15 The obvious answer is summed up by a White House official's sardonic crack: ‘Politically, we're not very ept.’

We have the term "adept" though, which is actually in common usage and fits the intended meaning here...

That was…surprisingly thorough.

That is the point of the OED: to be comprehensive and include real usages.

That's the point of any half-decent dictionary.

The OED is a gold standard, though.

James Murray was the true Scotsman.

Isn't it more likely that he just mispelled "apt".

Well, ept is obviously a back-formation and a clever and amusing one.

Etymology is straight from Latin: ineptus, which is prefix in- plus aptus (fitting or suitable). Interestingly there's also inapt which is quite similar.

edit: aheilbut's research on this is much more thorough.

I used a similar back figuring when describing a co-worker who was in the wrong job... "He's not inept, he is inapt"

Someone's got a bad case of God Complex.

i'm not into bio, but read articles on latest development. my sister also took bioinformatics but the scope in India is very less it seems.

have you checked out synthetic biology? will it be easy to understand when you have a degree in bioinformatics?

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact