Hacker News new | comments | show | ask | jobs | submit login
200 TB of 1000 genomes data on Amazon S3 as a public dataset (aws.typepad.com)
173 points by bbgm 2063 days ago | hide | past | web | 55 comments | favorite

I know very little about genomics and biology and would love it if someone who knows more could answer my silly questions to help me understand this more:

- This has genome data from 1,700 people. In theory could a company build a machine that read one of these "files" as input and made a baby as output?

- It says the dataset is 200TB. If 1 person has 3 billion dna bases(which I understand can be 1 of 4 values or 2 bits), it seems that even uncompressed the dataset would be 6billion bits (<1GB per person) and so only ~ 1.7TB uncompressed for 1,700 people. What am I missing?

I haven't looked at the data yet, but I suspect the large data size is due to full sequencer output. Next-gen sequencing methods produce a LOT of noisy overlapping data that can later be reduced to high quality contiguous sequences.

When compressed relative to a reference sequence, a ~3.3Gbp human genome can be reduced to ~3.3MiB of data[1].

For more information about the methods currently being used, check out these videos from the WUSTL Genome Center: http://gep.wustl.edu/curriculum/course_materials_WU/introduc...

1. http://dhruvbird.com/genome_compression.pdf

1000 genomes includes many datasets including sequencing. That said, I'm sure most of this is sequencing data. The way sequencing is done is using billions of reads (usually b/w 50-100 nucleotides long), that are mapped back to the genome. Average coverage of any individual nucleotide is usually 30-50x. What people usually shared are compressed versions of the mapped reads that show all the sequencing information. The reason why this is done is that due to the limitations of the sequencing technologies, how you call particular mutations, structural variations, and deletions depends on the algorithms you use to reconstruct the genome. Thus the raw data could be in the form of compressed reads, or the compressed mapped reads (or in this case, likely both). Even this though is a vast reduction in size for the raw data, which are usually images that are used to call each base, which are usually 1000x larger than this data set; these usually aren't stored now for obvious space constraints.

An analogy is like making a panorama photograph by taking lots of individual photographs and then stitching them together at the parts that appear to overlap to make one big picture, right?

Sort of; though there are more complications.

First, you have to deal with sequencing noise, which can be as high as 1%.. Thus you need many reads to get consensus even for any individual base.

Second, you have to deal with the fact that a human genome has two copies of any individual chromosome, so you have to get enough reads to tell if your mutations are diploid or haploid for instance.

Third, the human genome is repetitive, so there are many mapping errors. This is dealt with by having high coverage and usually by having different size jumps between paired sequencing reads (e.g., read 100bp on each end of a 200bp, 2000bp, and 10,000bp parts of the genome) to help algorithms stitch together the proper sequence.

There's even more complications to complicate the complications ...

"you need many reads to get consensus even for any individual base"

Not necessarily. If you know that the minor allele frequency it close to 50% in related populations and you only have a few reads, say 6, and three are minor allele non reference and three are "major" allele reference (and the quality is good and there's otherwise low regional noise), then it's a good bet it's HETEROZYGOUS. You don't necessarily need a lot of reads.

"second, you have to deal with the fact that a human genome has two copies of any individual chromosome"

Well. Sort of. Men only have one Y chromosome and one X chromosome for instance, women have zero Y's. (a nitpick, granted). Disease sample often have 0,1 and 3 or more copies of a genomic area.

That works if you're only performing a few such tests. But if you have an extremely accurate test, say 99.999% correct, and you apply it at 3 billion base pairs, then you're still going to have tens of thousands of incorrect calls, and no way to distinguish the good calls from the bad calls, without more data.

> In theory could a company build a machine that read one of these "files" as input and made a baby as output?

Not quite. We can read DNA sequences, but we can't reliably write them yet beyond a few dozen base pairs. In fact, we can't even take a fully-assembled human genome and "make a baby as output." That process is called cloning, and it is fraught will all manner of practical and ethical difficulties. But there's nothing that prevents it in principle, and given the pace at which technology is advancing "printing a baby" may well be possible within the lifetimes of some of the people reading this.

I also think that current technology also has problems sequencing the bp near the start of a sequence.

Sort of. Current DNA sequencing technology breaks the DNA up into short pieces, sequences the pieces, and then uses software to reassemble them by looking for places where the fragmentary sequences overlap. It's basically the mother of all DIFFs. Telomeric DNA (the DNA at the ends of chromosomes) consists of many repetitions of the same short (<10) sequence of base pairs, so there is no pattern for the reassembly process to latch on to. So it's not the ends per se that are problematic, it's any long sequence that is devoid of structure. Chimps, for example, have telomeric DNA in the middle of one of their chromosomes. This is an example of "smoking gun" evidence for macroevolution. See http://www.gate.net/~rwms/hum_ape_chrom.html

If chimps have 24 chromosomes and humans have 23, that implies 2 chromosomes have fused in humans. This fusion would put telomeric DNA in the middle of a human chromosome, not an ape/chimp chromosome.

I wonder if nanopore sequencing will ever be good enough to fill in those gaps?

Depends, there are several sequencing technologies and they all have strengths/weaknesses. At my job, we use 454 sequencing from Roche and since it relies on PCR it has problems with tandem repeats and homopolymers that might not trip up other methods. (Sanger for example)

- Could you make a baby?

No, there's more to us than just DNA. For example methylation, the addition of methyl chemical groups to some bases, which isn't tracked in "normal" DNA sequencing controls which genes get expressed by which cells. Plus there's a reasonable chance of errors in the sequencing due to the need to copy the DNA repeatedly to identify.

- Data size

Most bioinformatics data formats are plain ASCII. So even the reference data would be 3Gb per person. But 1,000 genomes contains sequencing reads where each DNA bases is sample multiple times (20-40 is typical "read depth") so that errors in identifying bases can be minimised. Each base of each of these samples has a quality score associate (which is about a 6 bit value). Plus identifiers for all billion odd reads per person.

For fairly simple organisms we have 'cold booted' DNA and swapped a cells DNA with artificial DNA from a closely related organism and it still worked. So, while humans have a lot of Genetic information I suspect you could get it to work if your willing to accept fairly low success changes.

As to read errors that effectively just a 'mutation' which are generally fairly harmless. If you stay below say 1,000 mutations, which would still take vary high accuracy, you have not significantly reduced your chances for success.

True, but there's also the fact we don't know the whole human genome - only about 98% of it.

We have the raw data on the actual DNA for several people which is what you need to make a copy. What we don't know is what the data means and what all the mutations are in the wild. Which is what you want to know if we are going to start making changes.

Actually we don't have the full DNA sequence for any human. For example, if you look at the data from say the Genome Reference Consortium the first 10,000 bases on Chromosome are designated as N - unknown.

True that we don't have a full sequence, but that's not the best example. The telomeres (ends) consist of the same set of bases repeated thousands of times. Recent research suggests that the length is probably super important. We're good at approximating length, but not detecting exactly.

There are a bunch of regions of 'N' in the reference sequence, most are just repeats.

The genome is incredibly complex, and yes, much we still can't represent accurately. As one example, some genes are given a location in the reference genome, while every person actually has multiple copies that are scattered across the genome.

Maybe some of the people contributing answers to this comment thread could also suggest some literature on bioinformatics? (specifically for Software Engineers with limited biology backgrounds)

More specifics about what you're interested in? I work in the space (human genetics - we analyse the output of 1000 Genomes) and happy to help, but "bioinformatics" is pretty broad.

I'll also note so much of the human genetics space is so new that most knowledge is still couped up in academic papers - there aren't very good resources for the general public. A colleague helped edit a textbook in 2007 on genetic analysis (which would still be considered "new" in textbook space), and none of the tools in that book can be used with 1000 genomes data.

Maybe some broad-sweeping entry level books or literature reviews would be good. For me (and possibly many others on hn), the extent of my knowledge in the field as a whole is pretty small so I think some introductory resources geared towards someone coming from a strong software or math background would be best.

(The following are all for genetics, others can chime in on other areas or bioinformatics in general.)

-- I'm bias, but I think the best entry level book is still from 1998 - Genome by Matt Ridley [1]. It's what got me interested in genomics as a CS undergrad (I read it in 2009). Another decent one is The $1000 Genome [2], it gives a good cross section of genetics in 2010.

-- Our group has a series of intro lecture videos from last academic year [3]. They are fairly up to date, and this year's videos will probably be posted soon.

-- Genomes Unzipped is great if you prefer a blog [4]

-- I think the best way to actually learn this stuff is to just play with the software tools. They all point to open data in the tutorials. Biopython's tutorial is particularly good - just google the biological terms as you go. Bioconductor has some good (though more targeted) tutorials too.

-- Going to talks can be a great way to get a broad overview of the space. Drop me a note if you happen to be located in Boston.

-- Finally, this goes without saying, but don't take articles in the mainstream media at face value. (Including, and in fact in particular, the NYT.) Every time I get together with relatives I have to argue against the latest grand prediction.

[1] http://www.amazon.com/Genome-The-Autobiography-Species-Chapt...

[2] http://www.amazon.com/The-000-Genome-Revolution-Personalized...

[3] http://www.broadinstitute.org/scientific-community/science/p...

[4] http://www.genomesunzipped.org/

Thanks a lot for these - been looking for a solid set of resources on this for a while.

"Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids" by Durbin, Eddy, Krogh, and Mitchison is a classic, very readable intro to sequence analysis algorithms (dynamic programming, HMMs) that would be interesting for someone coming from a math/cs background.

I'm not a biology expert, but from my understanding, they wouldn't be able to "make a baby" as output. But they could make a cell that contains the specified DNA that would grow into a baby under the right conditions. Of course, we are still a ways out from having that technology.

What I think you're alluding to is something like this:


I haven't been following the progress for the past year and a half or so, but it was state of the art then.

Long story short, we're not synthesizing de-novo babies any time soon! Well, at least not in the lab.

There are technological issue and then there are ethical issues. I can say we are several decades away from making a baby just by stitching raw genome data. There are 30,000 genes in human genome. Think the number of possible combinations that can arise only choosing few good genes are optimizing for a developing a tailor made human being. And the ethical challenge is huge too.

The best dataset for this audience is probably the most recent curated set of variant calls on 1000genomes.org. (Incidentally, the data access links appear down at the moment...)

This provides data in VCF format [1], which I would argue is the lowest level you want to go with this data unless you are doing variant calling methods development.

One tool you can use to analyze these data is PLINK/SEQ [2] (disclaimer: I work on the project). If any C++ devs are interested in contributing, let me know. (We'll have the source on Github soon...still pushing my group to learn Git :)

[1] http://www.1000genomes.org/node/101

[2] atgu.mgh.harvard.edu/plinkseq/

Since you are involved with PLINK/SEQ, you might be able to answer this.. Can you recommend any reading for learning how to perform admixture analysis/visualization? I'm interested in seeing the workflow behind starting with raw genomes and producing visualizations similar to Dodecad.

I was just having a conversation with my supervisor yesterday about how amazing the PLINK project is, and how much easier it makes my work. I use it every day and I can't count the number of times it's saved me from writing embarrassingly messy R scripts to accomplish something.

The burden has shifted from the bio to the informatics. We need to be able to sort through this much data without needing access to university-scale computer facilities.

At Ion Torrent we are working this problem as well. We are always hiring too http://iontorrent.com All of our software is open source http://github.com/iontorrent

The link at: http://www.iontorrent.com/career/

just re-directs me to the homepage.

which language the software is written in?

This is a question you can answer yourself by simply clicking on their github link. Their four public repositories are in C, Lisp, C++, and Java.

All of my code is Python and JavaScript

Agreed. At EuPathDB, we're trying to provide just that for the genomic data of parasites that cause diseases like sleeping sickness and malaria. And we're hiring! See http://jobs.eupathdb.org/

I think it's important to give clear examples here - research questions that biologists have posed but are struggling to answer with existing software tools. They certainly exist, but I think some overestimate the bottleneck. For example, I would argue that the bioinformatics challenges at Ion Torrent are less important than, say, chemistry that creates longer reads :)

that's exactly what I'm during at my day job: http://nextbio.com

How many person's curated genome does the nextbio has? I am just curious.

I'm still not 100% sold on these big sequencing projects. Sure, you get the genomes of a bunch of individuals, and then you do a bunch of GWAS on it to see if you can find some mutations that are linked to any phenotypes. It just doesn't seem very exciting. I love the idea that if you're working on any cell line/individual, you can just get it sequenced for decreasing amounts of money, but I'd love to be sold on some big picture stuff as to why 1000 genomes matters (and if GWAS actually has lived up to its promise).

That said, I've got a paper coming out real soon now that looked at the relationship between mutations across individual strains of Arabidopsis, and how (surprise surprise) you have fewer mutations in areas of proteins that potentially carry function, so I do think it's somewhat useful!

The real reason for a lot of this is personalized medicine, or rather more specific medicine. A gene translocation or even a SNP may be able to differentiate cancer types and change the course of treatment.

> why 1000 genomes matters

Gotta start somewhere. And 1000 data points is more than enough to get some statistically significant results.

Unfortunately, as far as I know there is very little useful phenotypic data along with these genomes to work with. I would rather have fewer genome sequences with better phenotypic information. Even with just 35 genome sequences of "individual" yeast genomes I could do a lot more interesting analysis because they are well studied in turns of phenotypes.

The point is to look at genomic variations across the entirety of the human genome, not so much to match genomic variation to phenotype. What you're saying makes sense, but I think that is more a next step.

I dig this. However, unless I'm understanding S3 usage improperly, to use it you essentially have to clone that entire S3 bucket and pay for the monthly usage yourself of 200TB- correct?

Not the biggest deal in the world if you're seriously crunching on it... but something to consider.

The 1000 genomes bucket is open to the public, so you don't need to clone it for your own purposes. Any data you generate will need to be in your own bucket.

And that bucket that you put your intermediate data in will be by itself pretty large. Or any processing that you do on this data will be computationally expensive. Either way, Amazon is attracting some new customers.

Oh... that does change things.

Has anyone tried writing a life emulator that could read this gene code and then virtually generate a baby? This would be really cool for injecting one's self into an MMRPG, at least one where you could be a baby.

Similar to other questions, but can anyone explain what exactly this data is and what analysis is being done on it?


Here's the project explanation, but the data itself is 1000 complete (hopefully!) genome sequences. The hope is to find the variants, be it single nucleotide differences, copy number difference, etc. Where I work we use it to help figure out SNP (single nucleotide polymorphisms) to see if something we sequenced is a mutation or just a variant with no effect.

I would be interested in how much this dataset costs to run every month. Anyone able to shed some light on this?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact