Hacker News new | past | comments | ask | show | jobs | submit login
Google Genomics: store, process, explore and share genomic data (cloud.google.com)
162 points by fitzwatermellow on Jan 9, 2016 | hide | past | favorite | 53 comments



I work at an academic institution that generates and analyzes large amounts of genomic data (clinical sequencing of cancer patients). We were really impressed with the Google cloud and executed several large proof-of-concept projects on it. Honestly, being able to spin up 1000 machines to process 10s of thousands of samples within hours (using cheap preemptible machines) is a huge step forward.

The biggest challenge we face is data protection. Sequencing data is not considered protected health information so theoretically HIPAA regulations do not apply, but this may change in future so nobody is willing to take responsibility. If sequencing data is considered PHI, HIPAA requires a BAA between the institution and Google. The process of getting these in place was so convoluted and difficult that it was just easier to use another provider more closely associated with our institution.


Glad to hear this worked out for you (particularly preemptible VMs which I worked on). Sorry about the HIPAA/ BAA issues, which institution are you with so I can have someone do the work so you can have a better experience next time.

Disclaimer: I work on Compute Engine.


Yes please -- if you're able to share which institution you're with, I'd like to follow up on the BAA hassles. (We do regularly sign BAAs with institutions, but I know that, as with many things legal, simple things don't always stay simple.)

Disclaimer: I work on Google Genomics.


I'd love to do more with Google at my university but we couldn't get Google to even start talking about signing our BAA or other agreements. It was a non-starter for our legal department.

We ended up moving towards AWS because they were willing to sign. It sucked from my perspective because we already had such a tie in with Google with our email/ collab platform.

In the couple years since then our agreements have only gotten stronger.

The best way for Google to address this is to sign something with Internet2 and then the institutions can use that contract instead of negotiating separately.


In our case Google was very proactive and willing to sign a BAA. But the terms of those agreements have to satisfy both partners which for some reason ultimately did not happen.


For the layman, is a BAA this? https://en.wikipedia.org/wiki/Broad_Agency_Announcement

(The other terms were a bit easier to google)

Here are the other terms:

HIPAA: the acronym for the Health Insurance Portability and Accountability Act that was passed by Congress in 1996

PHI: Protected health information


BAA = business associate agreement. Contract between an organization that handles personal health information (e.g. a research lab in a university ) and a business that does work with that data (e.g. Google Genomics).

Edit: link to relevant definition http://searchhealthit.techtarget.com/definition/HIPAA-busine...


No, it's a Business Associate Agreement, part of HIPAA. (See a sample here: http://www.hhs.gov/hipaa/for-professionals/covered-entities/...)

Basically an agreement between organizations that there is an understanding that PHI (protected health information) is at stake and that both parties are aware of their obligations to protect it.


Just curious, was this primarily hotspot or other smaller panels? We looked at this internally as well but the available machine types have very limited ram for things like the exome panels we're starting to work with.


Please take another look; we introduced custom machine types back in November:

  http://googlecloudplatform.blogspot.com/2015/11/introducing-Custom-Machine-Types-the-freedom-to-configure-the-best-VM-shape-for-your-workload.html
Which would let you adjust your RAM to vCPU ratio (as low as ~1 GiB per vCPU and up to 6.5).

Disclaimer: I work on Compute Engine.


I am wondering if you did a cost analysis. Last time I checked a dedicated on-premise HPC was still cheaper than using the cloud. Actually the CPU hours are cheaper in the cloud but data-transfer (bandwith) and storage make it more expensive (might not apply to storage anymore but AFAIK data transfer costs are still an issue). What is your experience ?


Of course we did this, but cost is not the only consideration.

Frankly, CPU-hours and TB-years are still within 5-10% of the total cost providing sequencing results to the patients. What is really important is that the cloud allows us to determine exactly our cost-per-analysis without paying for excess capacity or maintenance.

What matters also is reliability, redundancy, and peak capacity, where the cloud (both Google and the other provider) win hands down.


Just a note, you don't pay for sending data to any provider (ingress) only for pulling data out / serving it (egress). You can put your GCS buckets behind Cloudflare to take advantage of "interconnect" pricing of $.04/GB.

Disclaimer: I work on Compute Engine.


good to know. Thanks for the info !


As boulos points out ingress is free -- the other question is how much you actually need to egress. Transfer within a GCP region is generally free as well, so unless you really need your giant result set in its raw form, you can potentially just leave it in GCP and do any analysis/summarizing work you might otherwise have done externally within GCP.

Taking this to the extreme, you might even fire up a Windows VM so you could run some analysis or reporting tool there (say, Tableau), within GCP and then only need to egress the Dashboard or Report that it generated. You'd still have the raw data, but you would only be paying to store it. Especially if you're only keeping it around for archival purposes you could put it in a Nearline bucket and really minimize your costs.

(note: I work on GCE)


You can use https://aws.amazon.com/directconnect/ to reduce egress costs from AWS down to $0.02/GB.


You could do the same before on AWS. And AWS spot instances are way way cheaper than Google's offerings.

I bet your university has hundreds of computers idling at night - why not use them?


Sounds like exactly what NPS (netapp private storage) was designed for.

I'm confident a Google version could be spun up if there were the appropriate demand.


Google is one of the lead investors in DNANexus[1], this must suck for them.

[1] - http://www.fiercebiotech.com/story/google-ventures-backs-dna...


Google is also one of the major investors in Uber.


Google seems to be the adult who says, "Here kid, have some cash and go work out the kinks for us."

Not that I blame them.


It's more like, "We know this is important. Someone will succeed. We should diversify, and we don't have so much hubris as to believe we will always be better than everyone else."


I don't see it that way. The objective of Alphabet's GV is to make as much ROI as possible regardless of what the other companies within Google are doing. So just because Google or some other business within Alphabet has a similar business that's really irrelevant to how GV will be measured. I view it more along the lines of hedging your bets. GV could win or lose with their investment and Google could win or lose with their solution, but in the end Alphabet is usually going to win.


The majority of this data is generated via public funding. So over the crusty decades there have arisen public databases of notably varied quality, curation, and stability; e.g. GenBank, NCBI, SwissProt, EMBL, PIR, SBKB, PDB, ...quite a long list. (I've been involved in direct dealings with several of these sites and sometimes the histories and policies get very opaque)

Finally, the question: this is public data, (sometimes held in confidence but technically only for a "short" period), google is a private company, why should we place our trust in a for-profit organization to keep our public data publicly available without charging access fees? Admitting freely that having dealt for decades with the old model, i do have an open mind about this approach.


I don't think their intent is to replace existing database servers. My impression is that they have copied the data onto their servers to make analysis on their cloud service more attractive. For example, the fact that the 1000 Genomes data is there saves researchers from having to go to the trouble of uploading it themselves.


Having worked with them a bit my impression is that at a high level their intent is to get people to spend money using google cloud services. With that in mind it is not in their interest to annoy half of the field by playing shenanigans with things like data ownership


> I don't think their intent is to replace existing database servers.

We thought the same with Freebase and the Knowledge Graph. It will happen again.


While I think it's great to have Google putting their weight behind standardization efforts like Global Alliance for Genomic Health (GA4GH), I really don't get the need to replace VCF and BAM files with API calls.

Ultimately, the "hard part" about genomics is not big-data requiring Spanner and BigTable to get anything done. I actually wrote a blog post about this this week:

http://blog.goldenhelix.com/grudy/genomic-data-is-big-data-b...

Both BAM and VCF files can be hosted through a plain HTTP file-server and be meaningfully queried through their BAI/TBI indexes. Visualization tools like our GenomeBrowse or the Broad's IGV can already read S3 hosted genomic files directly without having an API layer and very efficiently (gzip compressed blocks of binary data). So, I see the translation of the exact same data into API-only accessible storage system, where I can't download the VCF and do quick and iterative analysis on it more of a downside that plus.

Disclaimer: I build variant interpretation software for NGS data at Golden Helix. Our customers are often small clinical labs who size of data and volume are not driving them to the cloud.


How do you think this compares against http://basepair.io ?


Looks great, but I can't comment more as I haven't used it.

It looks to be solving the same problems as DNAnexus, Seven Bridges, BaseSpace etc as a way to wrap open source tools in more user-friendly ways.

But it's orchestrating the production of smaller set of data that still needs the next step of human interpretation, report writing, family-aware algorithms and most complex annotations (the problem space Golden Helix is in).

In other words, the automatable bits that is not the hard part that I mentioned in my blog post.


Google is getting lot better at design but I can't help notice little things. Like the "Get Started" section has three boxes with first one slightly bigger, how hard was it to make all of them same size (http://i.imgur.com/eGpjRDv.png). It is minute detail I know.


So whose first thought was "So this is how google is going to accelerate there own internal genetics research".


At first glance I'm not super into this... this seems like a bunch of opaque, hard-to-use utilities wrapped around already opaque, hard-to-use utilities (like GATK). What am I getting for my effort? I'd rather just use the low-level tool directly.

This seems more a way to pipe people into using Google Cloud Services for storing data than it seems a useful way to do bioinformatics. As with most of these cloud-genomics things, it's not useful to have canned analyses that I can't extend (for example, I am so far from giving a shit about transition/transversion ratios and would never think to store that information in my variant table) What's the big plus?


We've found that different users want different levels of wrapping -- some (like you) are experts on the low-level tools, and choose to use our low-level compute, storage, and analysis services directly -- I'm biased, but I believe Google Cloud's low-level services have features, like preemptible VMs and nearline storage, that are a great fit for working with genomic data. Others want gateways into bioinformatics-friendly environments like R Studio and iPython; others want wrapped packaged services; others want point-and-click UIs ,etc.

Disclaimer: I work on Google Genomics.


I'm really excited about moving to R/Python computing for my bioinformatics work, it's a much better workflow than the typical method of emailing matlab files to an in-house Xeon workstation, and it'll be easier to virtualize. This is something I've seen in dozens of labs and there's a huge amount of room to streamline workflows.

Disclaimer: I work in a neurobio lab but I wish I worked for Google Genomics.


The problem with spot market and preemptible VMs is that much of the existing bioinformatics code often has run times of days to weeks with current sequencing depths. What's really needed is a way to have a VM suspend instead of shutdown when costs get high so it can resume once prices go down.


It depends. If you picture an entire pipeline what often happens is that you end up with both long and short running tasks. Tuning your pipeline to an environment like this involves figuring out what can go where. As an example we turn around a whole genome in just over a day but that's comprised of hundreds of individual units of work


Assuming you're running multiple pipelines at once you can add robustness in this sort of scenario by running the long-running tasks on guaranteed VMs (complete with live migration on GCE) and the short-running tasks on cheap preemptible VMs.

Another features that comes to mind for enabling this sort of work is GCE's support for attaching a persistent disk volume to multiple VMs (read-write for one, read-only for others) -- have each pipeline stage consume from a read-only PD containing the output of the previous stage and write its results to a PD shared with the VM(s) that handle the next stage. Have the VMs communicate with each other for control plane operations ("hey, this directory contains the complete results of stage <x>, one of you stage <y> workers can pick it up" or "I have completed processing the stage <x> outputs for job <5>; they can be removed or archived to GCS."). This strategy also allows independent scaling of the workers responsible for each stage based on the computational requirements of that stage, since a given PD can be shared read-only with many VMs (with limited performance impact in the many-reader case, especially if the readers are not hammering the same parts of the volume).

Another option for truly pipeline oriented processing (assuming the underlying workers can generate data suitable for it) is to leverage something like Cloud Dataflow -- then they'll handle the work balancing for you.

(note: I work on GCE)


I have to say that PDs attached read-only to multiple VMs is a very important part of my pipelines. And also a distinguishing feature of GCE.

Clound Dataflow on the other hand is difficult to use in practice - one of the main reasons is the large footprint of intermediate files (sometimes 10x the size of the input)

Currently we are running monolithic pipelines on the cloud where one instance does the job from start to finish (if it get's preempted - I restart). I plan the next version to be in some way "split-apart" but have not found a good strategy/tooling - any advice?


Hi! I'm an engineer on the Dataflow team. I'd love to understand the issue you mentioned about size of intermediate files, and help make Dataflow work well for you. Could you send a note to dataflow-feedback@google.com and elaborate?


@jsolson agreed on your points although I will say from experience that cloud dataflow doesn't work well with most bioinformatic workflows though unless one wanted to rewrite the tools from scratch. You end up fighting the system too much to make it work. Perhaps as the space evolves beyond the software wild west it is now this will change, I hope so.


Agreed. I'm more interested in the cheap semi-quick data storage ($0.01 / GB / month) than any of the workflows they offer. I'm also not interested in learning a new syntax to replace what I can do with UNIX / python / R (ggplot2) already.


If storage is cheap enough, and abstract, we can expect there to be a thriving ecosystem of tools to apply to it. They don't even need to be embedded in a single platform, but could be distributed, streaming, and on-demand.

Here are two examples from the iobio project.

A heads-up display describing a BAM (alignment) file streaming between two remote servers:

http://bam.iobio.io/?bam=http://s3.amazonaws.com/iobio/NA128...

Streams another BAM from the 1000 Genomes project through two variant callers, with dynamically-updated statistics about the results. You can modify the variant quality threshold for each caller and the results are recomputed in real time:

http://iobio.io/demo/variantcomparer/

Edit:

Very nice visualization of interesting statistics from a VCF file, provided by URL or uploaded:

http://vcf.iobio.io/?vcf=http://s3.amazonaws.com/vcf.files/A...


I agree the API is at this point is rather limited and it is difficult to use it in a way that would conveniently or efficiently replace any part of a genomic pipeline.

But I think google is onto something. If they manage to deconstruct BAM files and store them in "format-free" way that will allow search, aggregation, and computation/reasoning across hundreds of thousands of patients/samples it may accommodate the next big wave of personal genomics data.


I strongly agree with the idea of deconstructing legacy files (BAM, VCF, etc.) and moving to an API-centric (vs. file-format-centric) world. The Global Alliance for Genomics and Health (GA4GH.org) is working on that, with good early success -- for example, IGV (genome browser) can now work with either files or standard APIs, as can Picard (one of the popular suites of genome analysis tools). See https://www.youtube.com/watch?v=d8EvXtz2uiA for details on the Picard support.

Disclaimer: I work on Google Genomics, and on the GA4GH APIs.


Here's some good videos on the latest research on Hadoop and NGS data: https://www.youtube.com/playlist?list=PL5oElY7F-znCU_Ppb7YWJ...


Larry Page has been talking about this for a while. Interesting to see what alphabet does


I don't have first experience in genetics. However, some professors I met who work with Gene's complained about the high cost of running AWS services. I don't know how much of a market there is for this, but there is some.


is there a privacy policy? the only search results are G's standard policy and scarewords like 'G's expertise in privacy'. Are they mining this cross-client?


Short answer: it's your data. Period.

You decide who can see the data you store, and the default is private. That's spelled out in lots of words in the legal terms somewhere; sorry it wasn't more clear on the website.

Disclaimer: I work for Google Genomics.


This needs to be front and center. Data protections / PHI are the #1 issue for academic groups in the field. – bioinformatician


if this is subpoena-proof like you imply, I'm going to start storing all my data in this service.


There seems to be: https://cloud.google.com/terms/ . Section 5.2 seems relevant, but I'm not a lawyer, so I don't know whether it means much.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: