I know very little about genomics and biology and would love it if someone who knows more could answer my silly questions to help me understand this more:
- This has genome data from 1,700 people. In theory could a company build a machine that read one of these "files" as input and made a baby as output?
- It says the dataset is 200TB. If 1 person has 3 billion dna bases(which I understand can be 1 of 4 values or 2 bits), it seems that even uncompressed the dataset would be 6billion bits (<1GB per person) and so only ~ 1.7TB uncompressed for 1,700 people. What am I missing?
I haven't looked at the data yet, but I suspect the large data size is due to full sequencer output. Next-gen sequencing methods produce a LOT of noisy overlapping data that can later be reduced to high quality contiguous sequences.
When compressed relative to a reference sequence, a ~3.3Gbp human genome can be reduced to ~3.3MiB of data.
1000 genomes includes many datasets including sequencing. That said, I'm sure most of this is sequencing data. The way sequencing is done is using billions of reads (usually b/w 50-100 nucleotides long), that are mapped back to the genome. Average coverage of any individual nucleotide is usually 30-50x. What people usually shared are compressed versions of the mapped reads that show all the sequencing information. The reason why this is done is that due to the limitations of the sequencing technologies, how you call particular mutations, structural variations, and deletions depends on the algorithms you use to reconstruct the genome. Thus the raw data could be in the form of compressed reads, or the compressed mapped reads (or in this case, likely both). Even this though is a vast reduction in size for the raw data, which are usually images that are used to call each base, which are usually 1000x larger than this data set; these usually aren't stored now for obvious space constraints.
First, you have to deal with sequencing noise, which can be as high as 1%.. Thus you need many reads to get consensus even for any individual base.
Second, you have to deal with the fact that a human genome has two copies of any individual chromosome, so you have to get enough reads to tell if your mutations are diploid or haploid for instance.
Third, the human genome is repetitive, so there are many mapping errors. This is dealt with by having high coverage and usually by having different size jumps between paired sequencing reads (e.g., read 100bp on each end of a 200bp, 2000bp, and 10,000bp parts of the genome) to help algorithms stitch together the proper sequence.
There's even more complications to complicate the complications ...
"you need many reads to get consensus even for any individual base"
Not necessarily. If you know that the minor allele frequency it close to 50% in related populations and you only have a few reads, say 6, and three are minor allele non reference and three are "major" allele reference (and the quality is good and there's otherwise low regional noise), then it's a good bet it's HETEROZYGOUS. You don't necessarily need a lot of reads.
"second, you have to deal with the fact that a human genome has two copies of any individual chromosome"
Well. Sort of. Men only have one Y chromosome and one X chromosome for instance, women have zero Y's. (a nitpick, granted). Disease sample often have 0,1 and 3 or more copies of a genomic area.
That works if you're only performing a few such tests. But if you have an extremely accurate test, say 99.999% correct, and you apply it at 3 billion base pairs, then you're still going to have tens of thousands of incorrect calls, and no way to distinguish the good calls from the bad calls, without more data.
> In theory could a company build a machine that read one of these "files" as input and made a baby as output?
Not quite. We can read DNA sequences, but we can't reliably write them yet beyond a few dozen base pairs. In fact, we can't even take a fully-assembled human genome and "make a baby as output." That process is called cloning, and it is fraught will all manner of practical and ethical difficulties. But there's nothing that prevents it in principle, and given the pace at which technology is advancing "printing a baby" may well be possible within the lifetimes of some of the people reading this.
Sort of. Current DNA sequencing technology breaks the DNA up into short pieces, sequences the pieces, and then uses software to reassemble them by looking for places where the fragmentary sequences overlap. It's basically the mother of all DIFFs. Telomeric DNA (the DNA at the ends of chromosomes) consists of many repetitions of the same short (<10) sequence of base pairs, so there is no pattern for the reassembly process to latch on to. So it's not the ends per se that are problematic, it's any long sequence that is devoid of structure. Chimps, for example, have telomeric DNA in the middle of one of their chromosomes. This is an example of "smoking gun" evidence for macroevolution. See http://www.gate.net/~rwms/hum_ape_chrom.html
Depends, there are several sequencing technologies and they all have strengths/weaknesses. At my job, we use 454 sequencing from Roche and since it relies on PCR it has problems with tandem repeats and homopolymers that might not trip up other methods. (Sanger for example)
No, there's more to us than just DNA. For example methylation, the addition of methyl chemical groups to some bases, which isn't tracked in "normal" DNA sequencing controls which genes get expressed by which cells. Plus there's a reasonable chance of errors in the sequencing due to the need to copy the DNA repeatedly to identify.
- Data size
Most bioinformatics data formats are plain ASCII. So even the reference data would be 3Gb per person. But 1,000 genomes contains sequencing reads where each DNA bases is sample multiple times (20-40 is typical "read depth") so that errors in identifying bases can be minimised. Each base of each of these samples has a quality score associate (which is about a 6 bit value). Plus identifiers for all billion odd reads per person.
For fairly simple organisms we have 'cold booted' DNA and swapped a cells DNA with artificial DNA from a closely related organism and it still worked. So, while humans have a lot of Genetic information I suspect you could get it to work if your willing to accept fairly low success changes.
As to read errors that effectively just a 'mutation' which are generally fairly harmless. If you stay below say 1,000 mutations, which would still take vary high accuracy, you have not significantly reduced your chances for success.
We have the raw data on the actual DNA for several people which is what you need to make a copy. What we don't know is what the data means and what all the mutations are in the wild. Which is what you want to know if we are going to start making changes.
Actually we don't have the full DNA sequence for any human. For example, if you look at the data from say the Genome Reference Consortium the first 10,000 bases on Chromosome are designated as N - unknown.
True that we don't have a full sequence, but that's not the best example. The telomeres (ends) consist of the same set of bases repeated thousands of times. Recent research suggests that the length is probably super important. We're good at approximating length, but not detecting exactly.
There are a bunch of regions of 'N' in the reference sequence, most are just repeats.
The genome is incredibly complex, and yes, much we still can't represent accurately. As one example, some genes are given a location in the reference genome, while every person actually has multiple copies that are scattered across the genome.
More specifics about what you're interested in? I work in the space (human genetics - we analyse the output of 1000 Genomes) and happy to help, but "bioinformatics" is pretty broad.
I'll also note so much of the human genetics space is so new that most knowledge is still couped up in academic papers - there aren't very good resources for the general public. A colleague helped edit a textbook in 2007 on genetic analysis (which would still be considered "new" in textbook space), and none of the tools in that book can be used with 1000 genomes data.
Maybe some broad-sweeping entry level books or literature reviews would be good. For me (and possibly many others on hn), the extent of my knowledge in the field as a whole is pretty small so I think some introductory resources geared towards someone coming from a strong software or math background would be best.
(The following are all for genetics, others can chime in on other areas or bioinformatics in general.)
-- I'm bias, but I think the best entry level book is still from 1998 - Genome by Matt Ridley . It's what got me interested in genomics as a CS undergrad (I read it in 2009). Another decent one is The $1000 Genome , it gives a good cross section of genetics in 2010.
-- Our group has a series of intro lecture videos from last academic year . They are fairly up to date, and this year's videos will probably be posted soon.
-- Genomes Unzipped is great if you prefer a blog 
-- I think the best way to actually learn this stuff is to just play with the software tools. They all point to open data in the tutorials. Biopython's tutorial is particularly good - just google the biological terms as you go. Bioconductor has some good (though more targeted) tutorials too.
-- Going to talks can be a great way to get a broad overview of the space. Drop me a note if you happen to be located in Boston.
-- Finally, this goes without saying, but don't take articles in the mainstream media at face value. (Including, and in fact in particular, the NYT.) Every time I get together with relatives I have to argue against the latest grand prediction.
"Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids" by Durbin, Eddy, Krogh, and Mitchison is a classic, very readable intro to sequence analysis algorithms (dynamic programming, HMMs) that would be interesting for someone coming from a math/cs background.
I'm not a biology expert, but from my understanding, they wouldn't be able to "make a baby" as output. But they could make a cell that contains the specified DNA that would grow into a baby under the right conditions. Of course, we are still a ways out from having that technology.
There are technological issue and then there are ethical issues. I can say we are several decades away from making a baby just by stitching raw genome data. There are 30,000 genes in human genome. Think the number of possible combinations that can arise only choosing few good genes are optimizing for a developing a tailor made human being. And the ethical challenge is huge too.
The best dataset for this audience is probably the most recent curated set of variant calls on 1000genomes.org. (Incidentally, the data access links appear down at the moment...)
This provides data in VCF format , which I would argue is the lowest level you want to go with this data unless you are doing variant calling methods development.
One tool you can use to analyze these data is PLINK/SEQ  (disclaimer: I work on the project). If any C++ devs are interested in contributing, let me know. (We'll have the source on Github soon...still pushing my group to learn Git :)
Since you are involved with PLINK/SEQ, you might be able to answer this.. Can you recommend any reading for learning how to perform admixture analysis/visualization? I'm interested in seeing the workflow behind starting with raw genomes and producing visualizations similar to Dodecad.
I was just having a conversation with my supervisor yesterday about how amazing the PLINK project is, and how much easier it makes my work. I use it every day and I can't count the number of times it's saved me from writing embarrassingly messy R scripts to accomplish something.
I think it's important to give clear examples here - research questions that biologists have posed but are struggling to answer with existing software tools. They certainly exist, but I think some overestimate the bottleneck. For example, I would argue that the bioinformatics challenges at Ion Torrent are less important than, say, chemistry that creates longer reads :)
I'm still not 100% sold on these big sequencing projects. Sure, you get the genomes of a bunch of individuals, and then you do a bunch of GWAS on it to see if you can find some mutations that are linked to any phenotypes. It just doesn't seem very exciting. I love the idea that if you're working on any cell line/individual, you can just get it sequenced for decreasing amounts of money, but I'd love to be sold on some big picture stuff as to why 1000 genomes matters (and if GWAS actually has lived up to its promise).
That said, I've got a paper coming out real soon now that looked at the relationship between mutations across individual strains of Arabidopsis, and how (surprise surprise) you have fewer mutations in areas of proteins that potentially carry function, so I do think it's somewhat useful!
The real reason for a lot of this is personalized medicine, or rather more specific medicine. A gene translocation or even a SNP may be able to differentiate cancer types and change the course of treatment.
Unfortunately, as far as I know there is very little useful phenotypic data along with these genomes to work with. I would rather have fewer genome sequences with better phenotypic information. Even with just 35 genome sequences of "individual" yeast genomes I could do a lot more interesting analysis because they are well studied in turns of phenotypes.
The point is to look at genomic variations across the entirety of the human genome, not so much to match genomic variation to phenotype. What you're saying makes sense, but I think that is more a next step.
And that bucket that you put your intermediate data in will be by itself pretty large. Or any processing that you do on this data will be computationally expensive. Either way, Amazon is attracting some new customers.
Has anyone tried writing a life emulator that could read this gene code and then virtually generate a baby? This would be really cool for injecting one's self into an MMRPG, at least one where you could be a baby.
Here's the project explanation, but the data itself is 1000 complete (hopefully!) genome sequences. The hope is to find the variants, be it single nucleotide differences, copy number difference, etc. Where I work we use it to help figure out SNP (single nucleotide polymorphisms) to see if something we sequenced is a mutation or just a variant with no effect.