
Big data in genomics: The $1k genome has arrived - cedricr
http://www.nature.com/nature/journal/v527/n7576_supp/full/527S2a.html
======
searine
The bottleneck in genomics hasn't been cost since about 2012.

The chokepoint is analysis. Most biologists don't know how to program, and
most programmers don't know biological context, neither know statistics well.

I work with so many scientists whose only thought is to sequence first and ask
questions later. Usually all the real work ends up falling on the shoulders of
one skilled researcher while the rest look on like some unionized road crew.

It's only going to get worse, but the good news is, if you are one of the
biologists who can program and use statistics then you're in good shape. There
is already so much idle data out there already, that you'll never have to
spend a dime on sequencing.

~~~
adenadel
People say this all the time, but with some of the most common applications of
high throughput sequencing there are very good canned solutions (using open
source software) that you can pay for. DNAnexus, Seven Bridges, and Illumina
BaseSpace all provide cloud storage and analysis. Unless you are doing a
custom prep for your sequencing one of these probably has an analysis solution
for you.

~~~
BKPetkov
Is the time required to upload data to the cloud ever a problem with these
solutions? Of course, it depends on what you are trying to do, but suppose you
were working with thousands of genomes?

~~~
epistasis
The sequencers can stream to a data analysis center as the data is being
generated.

It takes a 100mbit stream/$1M of sequencing capital, so network connectivity
to transfer to a data center is a tiny tiny cost of the whole ordeal.

However, paying for AWS storage is pretty prohibitive, unless you're at a
small scale. So big centers will build their own storage facilities.

The small data producers like the ones that the thread author talks about can
use often use AWS more cost efficiently than building a compute cluster.
However, they need to budget for that, which is not always thought of. They
may also need to fight their institute's core center so that they can use
DNANexus.

~~~
laurencerowe
S3 storage is pretty cheap, it's the data egress that really costs.

For academic centers though there is often an incentive to move things in
house due to different treatment for capital expenditures and the opportunity
to externalize some of your costs from your grant onto central services.

~~~
epistasis
Data transfer is less than a single year of Glacier storage, so while it's
pricy I wouldn't egress a major portion of the cost.

Keeping this data for less than 5-10 years is pretty questionable, since it's
so expensive to generate. Eventually it may be cheaper to store the DNA and
resequence when if it needs to be looked at again. However, if you're doing
petabytes of storage, it's going to me much more economical to have your own
storage and compute than to use AWS. Particularly at the rate that academic
centers pay for sysadmins.

~~~
laurencerowe
Running a public data portal our egress is higher than our storage costs. (We
now proxy downloads through a direct connect to our university network...)

Remember to account for future reductions in storage costs. S3 has come down
from $0.1500/GB month in 2010 to $0.0300/GB month today. And the recently
introduced infrequent access storage tier is under half that again at
$0.0125/GB month. It's now significantly cheaper to use S3/Azure/Google than
running the storage ourselves.

------
epistasis
Though that $1000 produces 100GB of data, after processing there's probably
only about ~100MB of features left for machine learning, at most. And most of
that will be incidental. With enough data we can hope to have a better filter
between signal and noise.

Until now, biology has had a huge problem that most big data settings don't:
far far more features than labels. With enough patients' data, the matrix will
become squarish, but that's a long time from now, still.

~~~
A_Beer_Clinked
I found this link: [https://medium.com/precision-medicine/how-big-is-the-
human-g...](https://medium.com/precision-medicine/how-big-is-the-human-
genome-e90caa3409b0) In summary: > 1\. In a perfect world (just your 3 billion
letters): ~700 megabytes > 2\. In the real world, right off the genome
sequencer: ~200 gigabytes > 3\. As a variant file, with just the list of
mutations: ~125 megabytes

~~~
BKPetkov
How long does it take (and with what computational bandwidth) to produce a
125MB variant file from 200GB raw sequence data?

~~~
adenadel
Depending on the pipeline you use and the compute resources available you
could have a full workflow done in anywhere from several hours to a couple
days. Illumina BaseSpace is free (for now) and has some example data sets with
a bunch of canned pipelines for analysis if you're interested in trying it for
yourself. [https://basespace.illumina.com/](https://basespace.illumina.com/)

~~~
jghn
You're not going to VCF on a whole genome in several hours.

~~~
adenadel
With particular hardware and software you can. Edico Dragen claims speeds for
bcl -> vcf of 20 minutes [1]. With Microsoft Research's snap aligner and 450GB
of memory you can get whole genome alignment in ~30 minutes and then variant
calling can be done in a couple hours.

1\. [http://www.edicogenome.com/dragen/dragen-
gp/](http://www.edicogenome.com/dragen/dragen-gp/)

------
mschuster91
What I find most worrying with mass-collecting fully sequenced DNA (or DNA at
all) is that it is inevitable that law enforcement, military, spy agencies or
politicians will want access to the data.

Given enough sequenced DNA and a free-fall of sequencing costs, it might be
feasible in the future to find out who exactly took a dump in public just due
to DNA. And well, the Israelis are already doing this with dog dumps
([http://uk.reuters.com/article/2008/09/16/uk-israel-dogs-
idUK...](http://uk.reuters.com/article/2008/09/16/uk-israel-dogs-
idUKLG37942520080916)), so applying the same tech to humans is not far away.

Fucking scary if you ask me.

------
bayesianhorse
I think the currenntly more interesting application in sequencing is pathogen
detection. These have smaller and simpler genomes (mostly), and tracking them
and their features improves epidemiology and the choice of treatments.

~~~
damurdock
Pathogen identification is indeed a very exciting application for NGS. In case
you're interested, here[0] is a paper about a tool called SURPI (Sequence-
based Ultra-Rapid Pathogen Identification) which was designed for that
purpose. Also, here[1] is a case report from the NEJM where SURPI was used to
diagnose a patient with Neuroleptospirosis, which allowed him to be treated
quickly and eventually recover. SURPI isn't the only horse in this game, of
course, but I've worked with it before so it immediately came to mind.

[0]: "A cloud-compatible bioinformatics pipeline for ultrarapid pathogen
identification from next-generation sequencing of clinical samples"
[http://genome.cshlp.org/content/24/7/1180.long](http://genome.cshlp.org/content/24/7/1180.long)

[1]: "Actionable Diagnosis of Neuroleptospirosis by Next-Generation
Sequencing"
[http://www.nejm.org/doi/full/10.1056/NEJMoa1401268](http://www.nejm.org/doi/full/10.1056/NEJMoa1401268)

~~~
tridint
The clinical work they're doing is great, but the code is problematic. Its a
bunch of Perl and Python duct taped together with shell scripts. From the
github repo:

Shell 84.3% Perl 8.9% Python 6.3% C 0.5%

Check out the source
[https://github.com/chiulab/surpi](https://github.com/chiulab/surpi)

------
mjpuser
A quick search for public genome data led me here
[http://www.completegenomics.com/public-
data/69-Genomes/](http://www.completegenomics.com/public-data/69-Genomes/). I
wonder if there will be a time when you can search millions through a rich UI,
comparing your own genome with others, etc.

~~~
jhull
For more genomic data check out www.solvebio.com

You'll need to register for a (free) API key to access public data, although
we're removing that requirement in the next few days (I work here.)

