- This has genome data from 1,700 people. In theory could a company build a machine that read one of these "files" as input and made a baby as output?
- It says the dataset is 200TB. If 1 person has 3 billion dna bases(which I understand can be 1 of 4 values or 2 bits), it seems that even uncompressed the dataset would be 6billion bits (<1GB per person) and so only ~ 1.7TB uncompressed for 1,700 people. What am I missing?
When compressed relative to a reference sequence, a ~3.3Gbp human genome can be reduced to ~3.3MiB of data.
For more information about the methods currently being used, check out these videos from the WUSTL Genome Center:
First, you have to deal with sequencing noise, which can be as high as 1%.. Thus you need many reads to get consensus even for any individual base.
Second, you have to deal with the fact that a human genome has two copies of any individual chromosome, so you have to get enough reads to tell if your mutations are diploid or haploid for instance.
Third, the human genome is repetitive, so there are many mapping errors. This is dealt with by having high coverage and usually by having different size jumps between paired sequencing reads (e.g., read 100bp on each end of a 200bp, 2000bp, and 10,000bp parts of the genome) to help algorithms stitch together the proper sequence.
"you need many reads to get consensus even for any individual base"
Not necessarily. If you know that the minor allele frequency it close to 50% in related populations and you only have a few reads, say 6, and three are minor allele non reference and three are "major" allele reference (and the quality is good and there's otherwise low regional noise), then it's a good bet it's HETEROZYGOUS. You don't necessarily need a lot of reads.
"second, you have to deal with the fact that a human genome has two copies of any individual chromosome"
Well. Sort of. Men only have one Y chromosome and one X chromosome for instance, women have zero Y's. (a nitpick, granted). Disease sample often have 0,1 and 3 or more copies of a genomic area.
Not quite. We can read DNA sequences, but we can't reliably write them yet beyond a few dozen base pairs. In fact, we can't even take a fully-assembled human genome and "make a baby as output." That process is called cloning, and it is fraught will all manner of practical and ethical difficulties. But there's nothing that prevents it in principle, and given the pace at which technology is advancing "printing a baby" may well be possible within the lifetimes of some of the people reading this.
No, there's more to us than just DNA. For example methylation, the addition of methyl chemical groups to some bases, which isn't tracked in "normal" DNA sequencing controls which genes get expressed by which cells. Plus there's a reasonable chance of errors in the sequencing due to the need to copy the DNA repeatedly to identify.
- Data size
Most bioinformatics data formats are plain ASCII. So even the reference data would be 3Gb per person. But 1,000 genomes contains sequencing reads where each DNA bases is sample multiple times (20-40 is typical "read depth") so that errors in identifying bases can be minimised. Each base of each of these samples has a quality score associate (which is about a 6 bit value). Plus identifiers for all billion odd reads per person.
As to read errors that effectively just a 'mutation' which are generally fairly harmless. If you stay below say 1,000 mutations, which would still take vary high accuracy, you have not significantly reduced your chances for success.
There are a bunch of regions of 'N' in the reference sequence, most are just repeats.
The genome is incredibly complex, and yes, much we still can't represent accurately. As one example, some genes are given a location in the reference genome, while every person actually has multiple copies that are scattered across the genome.
I'll also note so much of the human genetics space is so new that most knowledge is still couped up in academic papers - there aren't very good resources for the general public. A colleague helped edit a textbook in 2007 on genetic analysis (which would still be considered "new" in textbook space), and none of the tools in that book can be used with 1000 genomes data.
-- I'm bias, but I think the best entry level book is still from 1998 - Genome by Matt Ridley . It's what got me interested in genomics as a CS undergrad (I read it in 2009). Another decent one is The $1000 Genome , it gives a good cross section of genetics in 2010.
-- Our group has a series of intro lecture videos from last academic year . They are fairly up to date, and this year's videos will probably be posted soon.
-- Genomes Unzipped is great if you prefer a blog 
-- I think the best way to actually learn this stuff is to just play with the software tools. They all point to open data in the tutorials. Biopython's tutorial is particularly good - just google the biological terms as you go. Bioconductor has some good (though more targeted) tutorials too.
-- Going to talks can be a great way to get a broad overview of the space. Drop me a note if you happen to be located in Boston.
-- Finally, this goes without saying, but don't take articles in the mainstream media at face value. (Including, and in fact in particular, the NYT.) Every time I get together with relatives I have to argue against the latest grand prediction.
I haven't been following the progress for the past year and a half or so, but it was state of the art then.
Long story short, we're not synthesizing de-novo babies any time soon! Well, at least not in the lab.
This provides data in VCF format , which I would argue is the lowest level you want to go with this data unless you are doing variant calling methods development.
One tool you can use to analyze these data is PLINK/SEQ  (disclaimer: I work on the project). If any C++ devs are interested in contributing, let me know. (We'll have the source on Github soon...still pushing my group to learn Git :)
just re-directs me to the homepage.
That said, I've got a paper coming out real soon now that looked at the relationship between mutations across individual strains of Arabidopsis, and how (surprise surprise) you have fewer mutations in areas of proteins that potentially carry function, so I do think it's somewhat useful!
Gotta start somewhere. And 1000 data points is more than enough to get some statistically significant results.
Not the biggest deal in the world if you're seriously crunching on it... but something to consider.
Here's the project explanation, but the data itself is 1000 complete (hopefully!) genome sequences. The hope is to find the variants, be it single nucleotide differences, copy number difference, etc. Where I work we use it to help figure out SNP (single nucleotide polymorphisms) to see if something we sequenced is a mutation or just a variant with no effect.