When someone finds fault with the way a field conducts itself, I would implore them to constructively influence that field. You might be surprised how many are actually sympathetic to your concerns.
I'm not dismissing this author's concerns: to do that would really require knowing the molecular biology field (which is more than sequencing, it turns out). I do neuroscience right now, and programming can be a problem for some. But a constructive suggestion to change can have much more impact than a long rant.
It's a similar issue. I think statisticians are taking constructive steps to correct their path, since you know, ML is the new sexy thing. Bioinformatics could take a much longer time to self-correct though.
Although, as I mentioned in an earlier comment, Fred seems to be in a prime position to disrupt the bioinformatics field since he seems to know all the problems that afflict it
Is the interest in advanced Info Tech that widespread in those countries or simply because the only people who could use Google in those countries are government-sanctioned researchers? Anyone familiar with the reason could shine light for the rest of us?
Pakistan's internet is generally open (except youtube and pornography). But there is no widespread interest in ML particularly. Only a few companies - most of them outsourcing from the US.
In my experience, what happens is that biologists define the science, and they depend on the computer scientists / engineers to implement solutions to their computational problems. The computational people depend on the biologists to validate whatever results they produce. The iteration cycle can be painfully slow, especially for people used to telling machines what they want them to do, and getting results immediately. The proposition of changing that dynamic is not alluring to most people, but I still hope there will be some who try.
I spent five years working in bioinformatics, and this is exactly the attitude of both the researchers and the other developers on the projects I worked on. It was very frustrating.
My single most limited resource is programmer time. My time and the time of other people who work with me. I have access to loads of computers that sit idle all the time, even if it is on nights and weekends. There is zero opportunity cost to me in using these computers more fully. I have enough human work to do that I can wait for the results without having any wait states.
There can be a big opportunity cost in trying to rework a workflow so that it is more efficient and then test it thoroughly ensure correctness. Doing this may seem more appealing to someone who is interested primarily in computational efficiency. But I am more interested in research efficiency, and so are my employers and funders.
Hi, I recognize your name as a legit bioinformatician, am a huge fan of the lab that you're currently in, and others should listen to you.
I'd like to add that for many projects, general reusable software engineering is not necessarily a huge advantage. Instead of verifying a single implementation, it's often better for somebody to reimplement the idea from scratch; if a second implementation in a different language written by a different programmer gets the same results, this is a much more thorough validation of the software than going over prototype software line by line.
Also, I've seen way too many software engineers come in with an enterprisey attitude of establishing all sorts of crazy infrastructure and get absolutely no work done. If Java is your idea of a good time, it's unlikely that you'll be an effective researcher (though it's not unheard of), because it's not good at maximizing single-programmer output, and not good at maximizing I/O or CPU or string processing. In research it's best to get results, fail fast fast fast, and move on to the next idea. If you're lucky, 1 in 20 will work out. Publish your crap, and if it's a good idea, it will be worth polishing the turd later, but it's better to explore the field then to spend too much time on an uninteresting area.
The only time you worry about efficiency is when it enables a whole other level of analysis. So, for example, UCSC does most of their work in C, including an entire web app and framework written in C, because when they were doing the draft assembly of human genome a decade ago on a small cluster of computers that they scrounged from secretaris' desks over the summer, Perl wouldn't cut it.
Reproducible code: extremely important.
Correct code: extremely important.
Readable code: very important.
Efficient code: often not as important.
Even today, the UCSC Genome Browser is an example where efficient code is important. It is interactive software, has many human users who can work much efficiently when the browser is responsive. And with projects like ENCODE, there are now incredible amounts of data available from the browser that would not be easily possible with a less efficient system.
Very different from an analysis system that will be run a handful of times in batch mode.
You want Haskell. :)
FWIW, I have in the past gotten good results out of Java and C# (it's a lot easier in C#) by writing programs that generate bytecode at runtime, so they can use the JIT to further optimize performance. Getting the same results out of C would require a lot more work. This includes string processing - I wrote a regex compiler for Java at one point, easily outperforming java.util.regex.
And such things are not difficult - or at least, not difficult to me now, knowing all I know - perhaps 10 hours work for simple regex compiler. And that is how I would use tools like Java to optimize my own performance: adapt them to interpret or compile a language that is close to the problem domain. A slightly higher constant cost, with the aim of a much lower per-idea cost.
Which of the released turds do you consider to be polished?
In terms of next-generation sequence analysis, Heng Li's BWA mapper and Samtools libraries are fairly good. His coding style is a bit terse for my tastes, but it keeps out people who don't know what they're doing, it's very clear code for semi-complicated algorithms, and BWA is some of the most reliable software I use everyday.
On the infrastructure side, Galaxy [https://main.g2.bx.psu.edu] is getting fairly good.
The BioConductor repository of R packages is extremely mixed. I don't like some of their architectural choices, but it's ended up working out OK.
I still use Michael Eisen's Cluster from a decade ago, along with Java TreeView.
"Look at the disgusting state of the samtools code base. Many more cycles are being used because people write garbage. For a tool that is intimately tied to research, the absence of associated code commentary and meaningful commit messages is very poor. The code itself is not well self documenting either."
As I said, the style is very terse, and I have my suspicions that this is by design to minimize the number of less-qualified programmers trying to submit sub-standard code back to the project. (Edit: since it's been 10 minutes and I still can't reply to tomalsky's comment, I should point out that my "suspicions" are a joke; read the linked code sample and judge its quality for yourself.)
I have dived deep into the samtools code, rewritten chunks of file I/O inside it, messed with alternate formats, and my personal experience is that it's been easier for me to change, adapt, and understand it then any other open-source C project I've tried to dive into, such as, say, GNU join.
If anybody can point to where samtools is using many more cycles than it has to, please let me know! The worst part about it is that the compression and decompression is not multithreaded, but that is being worked out, I believe.
I didn't link the source because I am ashamed to admit that I clicked on the reddit link that was posted in this thread.
I feel exactly the opposite. I'm suspicious of anyone that does not use AWK (or other Unix text utilities) as a standard tool for checking the integrity of multi-gigabyte files, or generating summaries. AWK is super-fast, allows highly flexible checks, and allows quick and reliable interaction with huge amounts of data in the way that a script can not.
To build the string, I had to concatenate 1024 of the 32 characters strings to an intermediate string, and then concatenate these into the final monster, because concatenate just the 32 character strings took too long - a reallocation after every concatenation.
That was fun.
Bit-packing is simple. You spent a lot of time working around problems that shouldn't have existed in the first place. Even when using the approach you described, here is Python code which does what you described:
>>> byte_to_bits = dict((chr(i), bin(i)[2:].zfill(8)) for i in range(256))
>>> as_bits = "".join(byte_to_bits[c] for c in open("benzotriazole.sdf").read())
>>> chr(int(as_bits[:8], 2))
>>> chr(int(as_bits[8:16], 2))
The run-time was small enough that I didn't notice it.
The thing is, you succeeded in solving the problem, and are justly proud of your success. This is how a lot of scientists feel. But a lot of CS people look at the wasted work when there are simpler, better, more maintainable ways.
It is intentional fraud, no doubt about it; they restarted halted clinical trials. I was just pointing out they did sloppy work too.
Fundamental methodological error -> "Come on, these are competent people, you have to trust that whatever error they made didn't effect the final result."
False claim of accolade -> "How dare you fucking try to pass off this garbage as legitimate science?!?!?"
If you work with that mentality, you're asking for trouble. Well, not so much asking for trouble, but sending Trouble a voicemail that says "We're over here, you lazy bastard, just see if you can mess something up!"
Edited to add: some projects might even have a bug tracker that will already have problems you can tackle.
I don't do programming for fun, but I'll be visiting a local university soon, and could share this with students.
I proposed, implemented, and tested an 8 line change to our alignment tool that saved 6% cpu time. It took me two days, most of which was my spare time at home. This one program was using 15 cpu years every month. Nobody cared. It never went into production. I started interviewing for a new job and left shortly after that.
Is it so in the US? Or where? Here in Russia it is far from true, at least in the top institutes. As long as you produce publishable results, you may do virtually whatever you want, and nowadays pretty much anything is publishable. And this way you get funding, too, because the funding agency doesn't seem to want you to solve some particular problem, it just wants to be sure your science keeps up with the world.
The downside here is that the academy usually pays bad. Thus it seems most successful labs work like 70/30 on commercial projects and "pure science". Anyway, when you work on commercial projects you usually get much more interesting results than you'd care to publish.
That does not mean that we're just lapdogs for industry or Mammon, but it does mean that we're selective in what we do and how we do it.
The tools are written by (in my experience) very smart bioinformaticians who aren't taught much computer science in school (you get a smattering, but mostly it's biology, math, chemistry, etc.). Ex:
The tools themselves are written by smart non-programmers (a very dangerous combination) and so you get all sorts of unusual conventions that make sense only to the author or organization that wrote it, anti-patterns that would make a career programmer cringe, and a design that looks good to no one and is barely useable.
Then, as he said, they get grants to spend millions of dollars on giant clusters of computers to manage the data that is stored and queried in a really inefficient way.
There's really no incentive to make better software because that's not how the industry gets paid. You get a grant to sequence genome "X". After it's done? You publish your results and move on. Sure, you carve out a bit for overhead but most of it goes to new hardware (disk arrays, grid computing, oh my).
I often remarked that if I had enough money, there would be a killing to be made writing genome software with a proper visual and user experience design, combined with a deep computer science background. My perfect team would be a CS person, a geneticist, a UX designer, and a visual designer. Could crank out a really brilliant full-stack product that would blow away anything else out there (from sequencing to assembly to annotation and then cataloging/subsequent search and comparison).
Except, I realized that most folks using this software are in non-profits, research labs, and universities, so - no, there in fact is not a killing to be made. No one would buy it.
I wrote a post about why GATK - one of the most popular bioinformatic tools in Next Generation Sequencing should not be put into a clinical pipeline:
In terms of your ideal software strategy, I can speak to that as well, as I am actually attempting to do almost exactly what you suggesting. My team is all masters in CS & Stats, with focus on kick-ass CG visualization and UX.
We released a free genome browser (visualization of NGS data and public annotations) that reflects this:
But you're right, selling software in this field is a very weird thing. It's almost B2B, but academics are not businesses and their alternative is always to throw more Post-Doc man-power at the problem or slog it out with open source tools (which many do).
That said, we've been building our business (in Montana) over the last 10 years through the GWAS era selling statistical software and are looking optimistically into the era of sequencing having a huge impact on health care.
I've seen you link to your blog post a couple of times now, and I still think it's misleading. I do wonder whether your conflict of interest (selling competing software) has led you to come to a pretty unreasonable conclusion. (My conflict of interest is that I have a Broad affiliation, though I'm not a GATK developer.)
In your blog post, you received output from 23andme. The GATK was part of the processing pipeline that they used. What you received from 23andme indicated that you had a loss of function indel in a gene. However, it turns out that upon re-analysis, that was not present in your genome; it was just present in the genome of someone else processed at the same time as you.
Somehow, the conclusion that you draw is that the GATK should not be used in a clinical pipeline. This is hugely problematic:
1) It's not clear that there were any errors made by the GATK. Someone at 23andme said it was a GATK error, but the difference between "user error" and "software error" can be blurred for advantage. It's open source, so can someone demonstrate where this bug was fixed, if it ever existed?
2) Now let's assume that there was truly a bug. Is it not the job of the entity using the software to check it to ensure quality? An appropriate suite of test data would surely have caught this error yielding the wrong output. Wouldn't it be as fair, if not more so, to say that 23andme should not be used for clinical purposes since they don't do a good job of paying attention to their output?
Your blog post shows, for sure, a failure at 23andme. Depending on whether the erroneous output was purely due to 23andme or if the GATK had a bug in production code, your post shows an interesting system failure: an alignment of mistakes at 23andme and in the GATK. But I really don't think it remotely supports the argument that the GATK is unsuitable for use in a clinical sequencing pipeline.
On your second point. 23andMe had every incentive to pay attention to their output, but it is fair to say it's their responsibility for letting this slip through. But, it's worth noting in the context of the OP rant, that 23andMe probably paid much more attention to their tools than most academics who often treat alignment and variant calling as a black box that they trust works as advertised.
So what I actually argue in the post (and should have stated more clearly in my summary here) was that GATK is incentivised, as an academic research tool, to quickly advance their set of features with the cost of bugs being introduced (and hopefully squashed) along the way.
This "dev" state of a tool is inappropriate for a clinical pipeline, and GATK's teams' answer to that is a "stable" branch of GATK that will be supported by their commercial software partner. Good stuff.
Finally, I actually have no conflict of interest here as Golden Helix does not sell commercial secondary analysis tools (like CLC Bio does). I wrote this from the perspective of someone who is a 23andMe consumer as well as being informed as I give recommendations of upstream tools with our users (which I might add, I would still recommend and use GATK for research use, with the caution to potentially forgo the latest release for a more stable one).
You know though, the conflict of interest dismissal is something I run into more than I would expect. I'm not sure if some commercial software vendor has acted in bad faith in our industry to deserve the cynicism or if this is defaultly inherited by the "academic" vs "industry" ethos.
Sure, I agree with that. And I would agree if you would say "Using bleeding-edge nightly builds of %s for production-level clinical work is a bad idea," whether %s was the GATK or the Linux kernel. I would be in such complete agreement that I wouldn't even feel compelled to respond to your posts if that's what you would say originally, rather than saying, "the GATK ... should not be put into a clinical pipeline". The former is accepted practice industry-wide; the latter reads like FUD and cannot be justified by one anecdote.
> You know though, the conflict of interest dismissal is something I run into more than I would expect.
Regarding conflict of interest, my point in trying to understand your potential interests, and also disclosing my own so that you can see where I'm coming from. That's not a dismissal, it's a search for a more complete picture. Interested parties are often the most qualified commenters, anyway, but their conclusions merit review.
Hopefully people wouldn't dismiss my views because of my Broad connection, anymore than they would dismiss yours if you sold a competing product.
GATK currently has no concept of a "stable" branch of their repo (Appistry is going to provide quarterly releases in the future, which is great).
The flag I am raising is that a "stable" release is needed before it get's integrated into a clinical pipeline. Because the Broad's reputation is so high, it is important to raise this flag as otherwise researchers and even clinical bioinformaticians assume choosing the latest release of GATK for their black-box variant caller is as safe as an IT manager choosing IBM.
In my experience, this applies to accounting software, sensor data, computer-aided design, print manufacturing, healthcare, etc.
I imagine there's phases of maturity, something akin to CMM/SEI. Eventually there's enough people with a foot on both sides to bridge the gap.
It just takes time.
Maybe it's still in the early going, but I do see how it's going to be real difficult making a living doing this. OTOH, companies like CLC Bio seem like they're doing well for themselves...
So I disagree with you on your very last sentence (agree with the rest)
The trick is, academics often have excess manpower capacity in the form of grad students and post-docs. Even though personell is usually one of the highest expenses on any given grant, they often don't look at ways to improve the efficiency of their research man-hours.
That's not a blank rule, as we have definitely had success with the value proposition of research efficiency, but in general, a lot of things business adopt to improve project time (like Theory of Constraints project management, Mindset/Skillset/Toolset matching of personel et) is of no interest to academic researchers.
As for whether there's "a killing to be made", it's kind of unclear so far.
For example, it isn't true at all that microarray data is worthless. The early data was bad, and it was very over-hyped, but with a decade of optimization of the measurement technologies, better experimental designs, and better statistical methods, genome-wide expression analysis became a routine and ubiquitous tool.
The claim that sequencing isn't important is ridiculous. It's the scaffold to which all of biological research can be attached.
There is a great deal of obfuscation, and reinventing well-known algorithms under different names (perhaps often inadvertently). There's also a lot of low-quality drivel on tool implementations or complete nonsense. This is driven largely by the need in academia to publish.
The other side of this problem is that in general, CS and computer scientists don't get much respect in biology. People care about Nature/Science/Cell papers, not about CS conference abstracts. Despite bioinformatics/computational biology not really being a new field anymore, the cultures are still very different.
Bioinformatics is hard, but too many careerists take advantage of difficulties and uncertainty to publish as many papers as they can get away with.
Minor quibble: genome assembly is definitely still an open problem that's computationally difficult. So is robust high dimension inference, but that falls more under statistics.
I've wanted to leave at least a dozen times too, for the better pay, for working with programmers that can teach me something, and to not have my work be interrupted by academic politics. But the people pissed at the status quo are the ones that are smart enough to see it's broken and try to fix it, and if we all leave, science is really fucked.
"Must be an expert in 18 technologies"
"Must have a PHD in Computer Science or Molecular Biology"
"Must have 12 years experience and post doctoral training"
It's delusional because they apply the requirements it took for themselves to get a job in Molecular Biology (long PHD, post doc, very low pay for first jobs) and just apply it carte blanche to all fields that may be able to aid in their pursuits. Especially when it comes to software engineering where it can often be extremely difficult to explain why you did not pursue a PHD.
In my geographic area, this salary range is somewhat below corporate IT work (say 10% to 15%), but generally higher than the typical university software dev job listing. The university is really bad to list jobs and job requirements with laughable salaries. I have seen (in other departments) web app dev jobs that require significant front-end and back-end skillsets/experience and then pop a salary that is full 50% less than entry level jobs for CS undergrads.
One problem is that hiring departments in that position will find someone to hire at that rate, so they think it was correct. From personal experience, I can verify that "good on-paper" candidates with exceptional credentials (say MS in CS, bunch of experience) from other depts who look to join our team are unable to to write any code at the whiteboard at all (say a for loop in java to println something). But to be fair, a recent job interview cycle one of my teammates performed produced exactly two candidates out of 16 who could do this and only one of those could write a SQL statement that required a simple inner-join. Most of those folks were external, so it's not just a problem inside the institution.
I have a number of cynical and embarrassing opinions about this situation.
The whiteboard is only useful as an aid in explaining an algorithm. If a candidate can do that without the whiteboard, even better.
My bigger concern is that for a job that specifically highlighted the need for at least some SQL skills and some Java expertise, a candidate that can not, even after prompting, write a for loop in Java (or in any language, when offered the chance to do so in a "favorite" language") or write a SQL statement that joins two tables probably can't do much of anything, let alone work on interesting problems.
Here is the cold, hard truth - I know, both because of my own limitations and the opportunity of the job, that we are not going to get top % hackers. But if you apply to a job where the primary need is coding in blub, I think its fair to expect a simple question or two about basic blub constructs. I myself would be nervous about whiteboard coding for something complex, but also generally offer (in a cover letter) to provide some code examples to talk through at an interview ahead of time.
I think it behooves us all to have at least some baseline expectation to demonstrate some competence. Remember, I'm not thinking that whiteboard coding of an algorithms or anything.
I think a very fair (and concerning to me) insight might be: if you can use Google and an IDE, can you do all that this job requires?
It's only delusional if they can't find people to fill the jobs. The idea that, as an outsider, you know what requirements they should use in their hiring process better than they do is perhaps more delusional.
I worked in bioinformatics for more than 10 years before I moved on, and In my experience they do have a lot of trouble finding people to fill positions, especially outside of massive government funded groups like the NIH. This often results in passing on competent software engineers with a B.Sc. that don't meet the requirements in favor of PHD level biology graduates who have taken a year or so of undergrad computer science courses. In my experience, this leads to many of the problems discussed (and exaggerated) by the OP. While some of these people are smart and produce good work, much of the time they produce poor quality software that gets the job done, but as inefficiently as possible and they leave a code base that is virtually unusable. Overall, I mostly just wanted say that it's a mindset they REALLY need to get past for the long term success of the industry.
It's not really the money that's skewed, it's their idea about the person they need for the job. They don't need someone with that background (most of the time), they just need a junior level software engineer in which case the pay scale may not be too bad. There's a problem in realizing this, however, when the standards for your own field (molecular biology for example) are extremely high, so you expect it of all others as well...
If you really feel strongly about something, write it dispassionately (normally some time after the event) and treat it like a dissertation, backed with case studies and citations.
Sh*tty data? Comes from the community. If the data and algorithms are so poor, and the author so superior, he should have been able to improve the circumstances.
This whole screed reads like an entitled individual who entered a profession, didn't get the glory, oh and yeah, academia doesn't pay well.
In the realm of bioinformatics, lets ignore the work done on the human genome and the like.
Why? Aren't you assuming a lot about the incentives? What if the ground truth is simply that all the results are false due to a melange of bad practices? Do you think he'll get tenure for that? (That was a rhetorical question to which the answer is 'no'.) Then you know there's at least one very obvious way in which he could not improve the circumstances of poor data & algorithms.
Given that he is not a professor it is not clear why he would be expected to be seeking tenure.
He discusses this specifically in the rant. Are you saying he's wrong?
Was anyone asking him to? Was anyone paying him to? No? Then it's an uphill battle and also not his responsibility. Leaving is saner.
Academia rewards journal publication and does not adequately reward programming and data collection and analysis, although these are indispensable activities that can be as difficult and profound as crafting a research paper. At least the National Science Foundation has done researchers a small favor by changing the NSF biosketch format in mid-January to better accommodate the contributions of programmers and "data scientists": the old category Publications has been replaced with Products.
Naming is important to administrators and bureaucrats. It can be easy to underestimate the extent to which names matter to them. Now there is a category under which the contribution of a programmer can be recognized for the purpose of academic advancement. Previously one had to force-fit programming under Synergistic Activities or otherwise stretch or violate the NSF biosketch format. This is a small step, but it does show some understanding that the increasingly necessary contributions of scientific programmers ought to be recognized. The alternative is attrition. Like the author of the article, programmers will go where their accomplishments are recognized.
Still, reforming old attitudes is like retraining Pavlov's dogs. Scientific programmers are lumped in with "IT guys." IT as in ITIL: the platitudinous, highly non-mathematical service as a service as a service Information Technocracy Indoctrination Library. There is little comprehension that computer science has specialized. For many academics, scientific programmers are interchangeable IT guys who do help desk work, system and network administration, build websites, run GIS analyses, write scientific software and get Gmail and Google Calendar synchronization running on Blackberries. It is as if scientists themselves could be satisfied if their colleagues were hired as "scientists" or "natural philosophers" with no further qualification, as opposed to "vulcanologist" or "meteorologist" (to a first order of approximation).
"[bioinformatics] software is written to be inefficient, to use memory poorly, and the cry goes up for bigger, faster machines! [...]"
Well, the author is heading for a very bitter surprise...
- This guy clearly has a limited understanding of the field. This quote is laughable: "There are only two computationally difficult problems in bioinformatics, sequence alignment and phylogenetic tree construction."
- As a bioinformatician, I feel sorry for this guy. Just like any other field, there are shitty places to work. If I was stuck in a lab where a demanding PI with no computer skills kept throwing the results of poorly designed experiments at me and asking for miracles, I'd be a little bitter too.
- Just like any other field, there are also lots of places that are great places to work and are churning out some pretty goddamn amazing code and science. I'm working in cancer genomics, and we've already done work where the results of our bioinformatic analyses have saved people's lives. Here's one high-profile example that got a lot of good press. (http://www.nytimes.com/2012/07/08/health/in-gene-sequencing-...)
- I'm in the field of bioinformatics to improve human health and understand deep biological questions. I care about reproducibility and accuracy in my code, but 90% of the time, I could give a rat's ass about performance. I'm trying to find the answer to a question, and if I can get that answer in a reasonable amount of time, then the code is good enough. This is especially true when you consider that 3/4 of the things I do are one-off analyses with code that will never be used again. (largely because 3/4 of experiments fail - science is messy and hard like that). If given a choice between dicking around for two weeks to make my code perfect, or cranking out something that works in 2 hours, I'll pretty much always choose the latter. ("Premature optimization is the root of all evil (or at least most of it) in programming." --Donald Knuth)
- That said, when we do come up with some useful and widely applicable code, we do our best to optimize it, put it into pipelines with robust testing, and open-source it, so that the community can use it. If his lab never did that, they're rapidly falling behind the rest of the field.
- As for his assertion that bad code and obscure file formats are job security through obscurity, I'm going to call bullshit. For many years, the field lacked people with real CS training, so you got a lot of biologists reading a perl book in their spare time and hacking together some ugly, but functional solutions. Sure, in some ways that was less than optimal, but hell, it got us the human genome. The field is beginning to mature, and you're starting to see better code and standard formats as more computationally-savvy people move in. No one will argue that things couldn't be improved, but attributing it to unethical behavior or malice is just ridiculous.
tl;dr: Bitter guy with some kind of bone to pick doesn't really understand or accurately depict the state of the field.
This is the only bad point that a lot of people are aligned with.
The more time a program needs to finish, the more time you will need to run it again with some other dataset, and in turn - more time to find the right answer.
I really feel that people with scientific and mathematics background should learn proper programming (not take a course in some language - but have actual experience). Design patterns, data structures, best practices, memory consumption, are all things that should be known before a person starts submitting code for this kind of projects.
I'm very interested in bioinformatics, but sadly don't know as much about the field as I'd like.
These are just two of many questions ( biased towards my research interests of course ). It is really funny that he mentions sequence alignment and phylogenetically as the two big problems, because people generally consider these to be boring, uncool, solved-well-enough-for-our-purposes problems nowadays and just trust the algorithms described by Durbin decades ago. It sounds like the writer really doesn't know bioinformatics that well...
Genome assembly is the shortest common super sequence problem. It involves finding the best rearrangement and overlap of reads which minimize the overall sequence, given the expected errors in the read technology. It would still be hard even if all of the reads were perfect.
Sequence alignment looks at two or more sequences in their entirety, and does a best fit alignment using a given model of how substitutions and gaps can occur. This model may be based on chemical or evolutionary knowledge.
A "super-efficient solution to sequence alignment" doesn't lead to a way to tell how the reads should be assembled into a single large sequence, even ignoring possible read errors.
Definitely a computationally difficult problem because while naive approaches work, they produce crappy results, wasting the result of tens of thousands of dollars of experiments. I see a big move towards applying statistical/machine learning methods, and graph theory stuffs in our field.
A lot of the rants in the original article are correct, with regards to prototyping and throwaway codes. That's because researchers are rushing to get an MVP out. The truly good ones got turned into (usually open-source) products, where the code quality hopefully improves a fair bit.
If you're a CS person who's interested or considering a move into bioinfo, I wrote a blog post about it recently: http://www.joewandy.com/2013/01/getting-into-bioinformatics....
Any solid factual resources besides the references mentioned in this justified rant?
See there for answers to your question, eg:
* Best resources to learn molecular biology for a computer scientist. 
* What are the best bioinformatics course materials and videos (available online)? 
"A Hitchhikers Guide to Next Generation Sequencing"
The fact of the matter is that through high-throughput sequencing, microarrays, what have you, generation of biologically-meaningful results is possible.
There are a lot of problems in bioinformatics that need to be solved. Github has helped. More of bioinformaticians are learning about good software development practices, and journal reviewers are becoming more enlightened of the merits of sharing source code.
I find it curious that he stops to salute ecologists, since I was in an ecology lab. I liked my labmates and our perspective, but we didn't have any magical ability to avoid the problems he aludes to here.
I think a lot of his frustration comes down to not being more involved in the planning process. That's not a new problem. R.A. Fisher put it this way in 1938: “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”
Perhaps the idea that we can have bioinformatics specialists who wait for data is just wrong. Should we blame PIs who don't want to give up control to their specialists, or the specialists who don't push harder, earlier? Ultimately the problem will only be solved as more people with these skills move up the ranks. But the whole idea that we need more specialists working on smaller chunks of the problem may be broken from the start (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1183512/).
Surely this means there's a goldmine waiting there for someone to produce a non-broken toolchain for bioinformatics?
Or is it even possible to produce standard tools? Maybe all the labs are too bespoke?
6 years ago using CVS or something like that was novel. Now not using GIT is. Big improvement!
Problems are still interesting and challenging.
Biologists are almost never good coders, if they can code at all. But thats not what they do, they signed up for pipettes, not python.
Its the programmers who wrote said shitty code that are to be blamed, but you can't hate under-paid and over-worked phd students who write this code even though it usually has nothing to do with their thesis (the math/algorithm is the main part, the deployable implementation is usually not the most important).
If you want good code and organized/accountable databases, go to industry. Theres nothing new about this transition. The IMPORTANT part, is that industry gives back to academia. So when you get an office with windows and a working coffee machine, remember to help make some phd student's life a little easier by making part of your code open source.
Surely they can't get that far without having some kind of sensible method?
1. I agree that SE standards and good coding practice are completely absent in the bioinformatics world. I remember being asked to improved the speed of some sequence alignment tools and realized that the source code was originally Delphi that had been run through a C++ converter. No comments, single monolithic file. The vast majority of the bioinformatics code I worked with was poorly written/documented Perl.
In addition a lot of bioinformatics guys don't understand SE process and so rather than having a coordinated engineering effort, you end up with a lot of "coyboy coding" with guys writing the same thing over and over.
2. I agree that productivity is very slow. This is a side product of research itself though. In the "real world" (quoted) where people need to sell software, time is the enemy. It's important to work together quickly to get a good product to market. In the research world, you get a 2/5 year grants and no one seems have much of a fire under them to get anything done (Hey we're good for 5 years!). You would think that the people would be motivated to cure caner quickly (etc), but it's not really the case. Research moves at a snail's pace - and consequently the productivity expectations of the bioinformatics group.
3. I disagree that research results from the scientists are garbage. Yes it's true that some experiments get screwed up. However, if you having a lot of people running those experiments over and over, the bad experiments clearly become outliers. Replication in the scientific community is good because it protects against bad data this way. Somehow the author must have had a particularly bad experience.
4. Something the author didn't mention that I think is important to understand: most scientists have no idea how to utilize software engineering resources. The pure biologists, many times are the boss, and don't really understand how to run a software division like bioinformatics. Many times PHD's in CS run a bioinformatics group, who have never worked in industry and don't know anything about good SE practice or how to run a software project. A lot of the problems in the bioinformatics industry is directly related to poor management. Wherever you go you're going to have team members that have trouble programming, trouble with their work ethic, trouble with following direction. However, in a bioinformatics environment where these individuals are given free reign and are not working as a cohesive unit, you can see why there is so much terrible code and duplication.
Yes, industry typically pays more than academia. Yes, most molecular biologists cannot code and rely on bioinformatics support. Yes, biological data is often noisy. Yes, code in bionformatics is often research grade (poorly implemented, poorly documented, often not available). These are all good points that have been made many times more potently by others in the field like C. Titus Brown (http://ivory.idyll.org/blog/category/science.html). But they are not universal truths and exceptions to these trends abound. Show me an academic research software system in any field outside of biology that is functional and robust as the UCSC genome browser (serving >500,000 requests a day) or the NCBI's pubmed (serving ~200,000 requests a day). To conclude from common shortcomings of academic research programming that bioinformatics is "computational shit heap" is unjustified and far from an accurate assessment of the reality of the field.
From looking into this guy a bit (who I've never heard of before today in my 10+ years in the field), my take on what is going is here is that this is the rant of a disgruntled physicist/mathematician is a self-proclaimed perfectionist (https://documents.epfl.ch/users/r/ro/ross/www/values.html), who moved into biology but did not establish himself in the field. From what I can tell contrasting his CV (https://documents.epfl.ch/users/r/ro/ross/www/cv.pdf) to his linkedin profile (http://www.linkedin.com/pub/frederick-ross/13/81a/47), it does not appear that he completed his PhD after several years of work, which is always a sign of something something going awry and that someone has had a bad personal experience in academic research. I think this is most important light to interpret this blog post in, rather than an indictment of the field.
That said, I would also like to see bioinformatics die (or at least whither) and be replaced by computational biology (see differences in the two fields here: http://rbaltman.wordpress.com/2009/02/18/bioinformatics-comp...). Many of the problems that apparently Ross has experienced come from the fact that most biologists cannot code, and therefore two brains (the biologist's and the programmer's) are required to solve problems that require computing in biology. This leads to an abundance of technical and social problems, which as someone who can speak fluently to both communities pains me to see happen on a regular basis. Once the culture of biology shifts to see programming as an essential skill (like using a microscope or a pipette), biological problems can be solved by one brain and the problems that are created by miscommunication, differences in expectations, differences in background, etc. will be minimized and situations like this will become less common.
I for one am very bullish that bioinformatics/computational biology is still the biggest growth area in biology, which is the biggest domain of academic research, and highly recommend students to move into this area (http://caseybergman.wordpress.com/2012/07/31/top-n-reasons-t...). Clearly, academic research is not for everyone. If you are unlucky, can't hack it, or greener pastures come your way, so be it. Such is life. But programming in biology ain't going away anytime soon, and with one less body taking up a job in this domain, it looks like prospects have just gotten that little bit better for the rest of us.
Just another data point for someone contemplating a career in BINF, although some purists might say that my work did not really fall under the same category.
"Ept" means effective. As in "inept"
I don't understand this part:
> No one seems to have pointed out that this makes your database a reflection of your database, not a reflection of reality. Pull out an annotation in GenBank today and it’s not very long odds that it’s completely wrong.
In fact this entire article seems to be a rant on why bioinformatics as a field is rotting. But instead of ranting, surely something can be done about it?
Shouldn't we as hackers see this as an opportunity to revolutionize the field?
Rants like this, and providing interviews to third parties, are actually one of the more positive things that he could bring to the table: it provides information to people who aren't aware and inspires motivation in people who aren't entangled.
Then again, I am in no position to judge what Fred should or should not do
Maybe bioinformatics is not the place to aim for great informatics. We do
bioinformatics because of love of science first and foremost. This is frontier
land, the wild west, and it pays to play quick and dirty. I would suggest to
hang on to some best practices, e.g. modularity, TDD and BDD, but forget about
appreciation. Dirty Harry, as a bioinformatician you are on your own.
To be honest, in industry it is not much different. These days, coders are
carpenters. If you really want to be a diva, learn to sing instead.
More money, good on you. Starting off your critique of your former colleagues with "technically ept people'...not going to get a lot of sympathy for the correctness of your work.
from the OED:
Etymology: Back-formation < inept adj.
Used as a deliberate antonym of ‘inept’: adroit, appropriate, effective.
1966 Time 30 Sept. 7/1 With the exception of one or two semantic twisters, I think it is a first-rate job—definitely ept, ane and ert.
1976 N.Y. Times Mag. 6 June 15 The obvious answer is summed up by a White House official's sardonic crack: ‘Politically, we're not very ept.’
The OED is a gold standard, though.
Etymology is straight from Latin: ineptus, which is prefix in- plus aptus (fitting or suitable). Interestingly there's also inapt which is quite similar.
edit: aheilbut's research on this is much more thorough.
have you checked out synthetic biology? will it be easy to understand when you have a degree in bioinformatics?