I do genome mapping where our indexes won't entirely fit in memory. It would be very handy to be able to spin up a few of these instances, load the indexes from an EBS volume onto the local SSDs, then run for a couple of hours or so. This is a very I/O intensive job that we need to run about once a week, but then the rest of the time could be idle.
SSDs would make our jobs run significantly faster. So much so that we've toyed with the idea of adding SSDs to our in-house cluster, but couldn't quite justify the costs. This might actually shift the cost savings to get our lab to migrate to EC2 as opposed to our in-house or university cluster.
I'm working on a data visualisation app, which is getting a lot of interest from biologists and bioinformaticians. I'd like to learn a bit more about your work. Can I email you somewhere? Or please drop me an email at hrishi@prettygraph.com. Thanks!
based on this, it shouldn't take more than half a day under the worst circumstances (single EBS drive with crappy performance), and if you Raid together enough drives, you can do it in about an hour. Correct me if I'm wrong, but you pay for EBS by size, not physical disks, so the more you can split up your data in blocks, the more performance you're going to get.
You can get some of the data from the 1000 genomes project directly from Amazon, so you don't need to pay to download it. There's about 200TB of data there (so far).
What I'm working on is mapping those short sequences (50-75 bases) to the genome and then either looking for mutations or expression levels (how many of those reads map to a particular location). There are a couple of ways to do the mapping, but most these days use either a big hash table or a Burrow Wheeler transform.
Well, the raw output of a typical so-called "next gen sequencing" (which are actually very current gen) machine is around 1TB (at least, the ones we used here).
This is raw file though, so once processed (but not yet analyzed) I believe we have sizes around 50 to 100GB (but that's not really what I work on so don't quote me fully on this).
The next steps vary on what you want to do exactly, but it usually involves alignment of base pairs (basically, trying to tie together by their ends sequences of DNA but seeing if they "fit").
Essentially you sequence tons of short bits of dna and then either fit them together (assemble) or fit them to a reference (align). You can find example data sets in the Short Read Archive:
http://www.ncbi.nlm.nih.gov/sra/
Cloudburst (a hadoop based aligner) has a good description of an algorithm:
http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.p...
Though they can get much more sophisticated and there are a number of open and closed source implementations...I only link this one because of the quality of the figure.
The data sets we work with in my group can be up 400gb's of compressed text for the reads from a single individual.
Another example from biology with a similar computational profile would be searching through a hugh number of mass spectrometer outputs to identify the components in a new sample.
SSDs would make our jobs run significantly faster. So much so that we've toyed with the idea of adding SSDs to our in-house cluster, but couldn't quite justify the costs. This might actually shift the cost savings to get our lab to migrate to EC2 as opposed to our in-house or university cluster.