
Google Genomics: store, process, explore and share genomic data - fitzwatermellow
https://cloud.google.com/genomics/
======
i000
I work at an academic institution that generates and analyzes large amounts of
genomic data (clinical sequencing of cancer patients). We were really
impressed with the Google cloud and executed several large proof-of-concept
projects on it. Honestly, being able to spin up 1000 machines to process 10s
of thousands of samples within hours (using cheap preemptible machines) is a
huge step forward.

The biggest challenge we face is data protection. Sequencing data is not
considered protected health information so theoretically HIPAA regulations do
not apply, but this may change in future so nobody is willing to take
responsibility. If sequencing data is considered PHI, HIPAA requires a BAA
between the institution and Google. The process of getting these in place was
so convoluted and difficult that it was just easier to use another provider
more closely associated with our institution.

~~~
timeu
I am wondering if you did a cost analysis. Last time I checked a dedicated on-
premise HPC was still cheaper than using the cloud. Actually the CPU hours are
cheaper in the cloud but data-transfer (bandwith) and storage make it more
expensive (might not apply to storage anymore but AFAIK data transfer costs
are still an issue). What is your experience ?

~~~
boulos
Just a note, you don't pay for sending data to any provider (ingress) only for
pulling data out / serving it (egress). You can put your GCS buckets behind
Cloudflare to take advantage of "interconnect" pricing of $.04/GB.

Disclaimer: I work on Compute Engine.

~~~
timeu
good to know. Thanks for the info !

------
united893
Google is one of the lead investors in DNANexus[1], this must suck for them.

[1] - [http://www.fiercebiotech.com/story/google-ventures-backs-
dna...](http://www.fiercebiotech.com/story/google-ventures-backs-
dnanexuss-15m-round-rd-cloud-platform/2014-01-03)

~~~
bitmapbrother
Google is also one of the major investors in Uber.

~~~
toomuchtodo
Google seems to be the adult who says, "Here kid, have some cash and go work
out the kinks for us."

Not that I blame them.

~~~
beambot
It's more like, "We know this is important. Someone will succeed. We should
diversify, and we don't have so much hubris as to believe we will always be
better than everyone else."

------
theophrastus
The majority of this data is generated via public funding. So over the crusty
decades there have arisen public databases of notably varied quality,
curation, and stability; e.g. GenBank, NCBI, SwissProt, EMBL, PIR, SBKB, PDB,
...quite a long list. (I've been involved in direct dealings with several of
these sites and sometimes the histories and policies get very opaque)

Finally, the question: this is public data, (sometimes held in confidence but
technically only for a "short" period), google is a private company, why
should we place our trust in a for-profit organization to keep our public data
publicly available without charging access fees? Admitting freely that having
dealt for decades with the old model, i do have an open mind about this
approach.

~~~
roye
I don't think their intent is to replace existing database servers. My
impression is that they have copied the data onto their servers to make
analysis on their cloud service more attractive. For example, the fact that
the 1000 Genomes data is there saves researchers from having to go to the
trouble of uploading it themselves.

~~~
jghn
Having worked with them a bit my impression is that at a high level their
intent is to get people to spend money using google cloud services. With that
in mind it is not in their interest to annoy half of the field by playing
shenanigans with things like data ownership

------
gabeiscoding
While I think it's great to have Google putting their weight behind
standardization efforts like Global Alliance for Genomic Health (GA4GH), I
really don't get the need to replace VCF and BAM files with API calls.

Ultimately, the "hard part" about genomics is not big-data requiring Spanner
and BigTable to get anything done. I actually wrote a blog post about this
this week:

[http://blog.goldenhelix.com/grudy/genomic-data-is-big-
data-b...](http://blog.goldenhelix.com/grudy/genomic-data-is-big-data-but-
that-is-not-the-hard-part/)

Both BAM and VCF files can be hosted through a plain HTTP file-server and be
meaningfully queried through their BAI/TBI indexes. Visualization tools like
our GenomeBrowse or the Broad's IGV can already read S3 hosted genomic files
directly without having an API layer and very efficiently (gzip compressed
blocks of binary data). So, I see the translation of the exact same data into
API-only accessible storage system, where I can't download the VCF and do
quick and iterative analysis on it more of a downside that plus.

Disclaimer: I build variant interpretation software for NGS data at Golden
Helix. Our customers are often small clinical labs who size of data and volume
are not driving them to the cloud.

~~~
flux988
How do you think this compares against
[http://basepair.io](http://basepair.io) ?

~~~
gabeiscoding
Looks great, but I can't comment more as I haven't used it.

It looks to be solving the same problems as DNAnexus, Seven Bridges, BaseSpace
etc as a way to wrap open source tools in more user-friendly ways.

But it's orchestrating the production of smaller set of data that still needs
the next step of human interpretation, report writing, family-aware algorithms
and most complex annotations (the problem space Golden Helix is in).

In other words, the automatable bits that is not the hard part that I
mentioned in my blog post.

------
halite
Google is getting lot better at design but I can't help notice little things.
Like the "Get Started" section has three boxes with first one slightly bigger,
how hard was it to make all of them same size
([http://i.imgur.com/eGpjRDv.png](http://i.imgur.com/eGpjRDv.png)). It is
minute detail I know.

------
FreedomToCreate
So whose first thought was "So this is how google is going to accelerate there
own internal genetics research".

------
astazangasta
At first glance I'm not super into this... this seems like a bunch of opaque,
hard-to-use utilities wrapped around already opaque, hard-to-use utilities
(like GATK). What am I getting for my effort? I'd rather just use the low-
level tool directly.

This seems more a way to pipe people into using Google Cloud Services for
storing data than it seems a useful way to do bioinformatics. As with most of
these cloud-genomics things, it's not useful to have canned analyses that I
can't extend (for example, I am so far from giving a shit about
transition/transversion ratios and would never think to store that information
in my variant table) What's the big plus?

~~~
davidglazer
We've found that different users want different levels of wrapping -- some
(like you) are experts on the low-level tools, and choose to use our low-level
compute, storage, and analysis services directly -- I'm biased, but I believe
Google Cloud's low-level services have features, like preemptible VMs and
nearline storage, that are a great fit for working with genomic data. Others
want gateways into bioinformatics-friendly environments like R Studio and
iPython; others want wrapped packaged services; others want point-and-click
UIs ,etc.

Disclaimer: I work on Google Genomics.

~~~
laurencerowe
The problem with spot market and preemptible VMs is that much of the existing
bioinformatics code often has run times of days to weeks with current
sequencing depths. What's really needed is a way to have a VM suspend instead
of shutdown when costs get high so it can resume once prices go down.

~~~
jghn
It depends. If you picture an entire pipeline what often happens is that you
end up with both long and short running tasks. Tuning your pipeline to an
environment like this involves figuring out what can go where. As an example
we turn around a whole genome in just over a day but that's comprised of
hundreds of individual units of work

~~~
jsolson
Assuming you're running multiple pipelines at once you can add robustness in
this sort of scenario by running the long-running tasks on guaranteed VMs
(complete with live migration on GCE) and the short-running tasks on cheap
preemptible VMs.

Another features that comes to mind for enabling this sort of work is GCE's
support for attaching a persistent disk volume to multiple VMs (read-write for
one, read-only for others) -- have each pipeline stage consume from a read-
only PD containing the output of the previous stage and write its results to a
PD shared with the VM(s) that handle the next stage. Have the VMs communicate
with each other for control plane operations ("hey, this directory contains
the complete results of stage <x>, one of you stage <y> workers can pick it
up" or "I have completed processing the stage <x> outputs for job <5>; they
can be removed or archived to GCS."). This strategy also allows independent
scaling of the workers responsible for each stage based on the computational
requirements of that stage, since a given PD can be shared read-only with many
VMs (with limited performance impact in the many-reader case, especially if
the readers are not hammering the same parts of the volume).

Another option for truly pipeline oriented processing (assuming the underlying
workers can generate data suitable for it) is to leverage something like Cloud
Dataflow -- then they'll handle the work balancing for you.

(note: I work on GCE)

~~~
i000
I have to say that PDs attached read-only to multiple VMs is a _very_
important part of my pipelines. And also a distinguishing feature of GCE.

Clound Dataflow on the other hand is difficult to use in practice - one of the
main reasons is the large footprint of intermediate files (sometimes 10x the
size of the input)

Currently we are running monolithic pipelines on the cloud where one instance
does the job from start to finish (if it get's preempted - I restart). I plan
the next version to be in some way "split-apart" but have not found a good
strategy/tooling - any advice?

~~~
jkff
Hi! I'm an engineer on the Dataflow team. I'd love to understand the issue you
mentioned about size of intermediate files, and help make Dataflow work well
for you. Could you send a note to dataflow-feedback@google.com and elaborate?

------
jamesblonde
Here's some good videos on the latest research on Hadoop and NGS data:
[https://www.youtube.com/playlist?list=PL5oElY7F-znCU_Ppb7YWJ...](https://www.youtube.com/playlist?list=PL5oElY7F-znCU_Ppb7YWJ8jifqbDttxOt)

------
vonklaus
Larry Page has been talking about this for a while. Interesting to see what
alphabet does

------
rhema
I don't have first experience in genetics. However, some professors I met who
work with Gene's complained about the high cost of running AWS services. I
don't know how much of a market there is for this, but there is some.

------
awinter-py
is there a privacy policy? the only search results are G's standard policy and
scarewords like 'G's expertise in privacy'. Are they mining this cross-client?

~~~
davidglazer
Short answer: it's your data. Period.

You decide who can see the data you store, and the default is private. That's
spelled out in lots of words in the legal terms somewhere; sorry it wasn't
more clear on the website.

Disclaimer: I work for Google Genomics.

~~~
hxrts
This needs to be front and center. Data protections / PHI are the #1 issue for
academic groups in the field. – bioinformatician

