
200 TB of 1000 genomes data on Amazon S3 as a public dataset - bbgm
http://aws.typepad.com/aws/2012/03/the-1000-genomes-project.html
======
breck
I know very little about genomics and biology and would love it if someone who
knows more could answer my silly questions to help me understand this more:

\- This has genome data from 1,700 people. In theory could a company build a
machine that read one of these "files" as input and made a baby as output?

\- It says the dataset is 200TB. If 1 person has 3 billion dna bases(which I
understand can be 1 of 4 values or 2 bits), it seems that even uncompressed
the dataset would be 6billion bits (<1GB per person) and so only ~ 1.7TB
uncompressed for 1,700 people. What am I missing?

~~~
mlwarren
Maybe some of the people contributing answers to this comment thread could
also suggest some literature on bioinformatics? (specifically for Software
Engineers with limited biology backgrounds)

~~~
bthomas
More specifics about what you're interested in? I work in the space (human
genetics - we analyse the output of 1000 Genomes) and happy to help, but
"bioinformatics" is pretty broad.

I'll also note so much of the human genetics space is so new that most
knowledge is still couped up in academic papers - there aren't very good
resources for the general public. A colleague helped edit a textbook in 2007
on genetic analysis (which would still be considered "new" in textbook space),
and none of the tools in that book can be used with 1000 genomes data.

~~~
mlwarren
Maybe some broad-sweeping entry level books or literature reviews would be
good. For me (and possibly many others on hn), the extent of my knowledge in
the field as a whole is pretty small so I think some introductory resources
geared towards someone coming from a strong software or math background would
be best.

~~~
bthomas
(The following are all for genetics, others can chime in on other areas or
bioinformatics in general.)

\-- I'm bias, but I think the best entry level book is still from 1998 -
Genome by Matt Ridley [1]. It's what got me interested in genomics as a CS
undergrad (I read it in 2009). Another decent one is The $1000 Genome [2], it
gives a good cross section of genetics in 2010.

\-- Our group has a series of intro lecture videos from last academic year
[3]. They are fairly up to date, and this year's videos will probably be
posted soon.

\-- Genomes Unzipped is great if you prefer a blog [4]

\-- I think the best way to actually learn this stuff is to just play with the
software tools. They all point to open data in the tutorials. Biopython's
tutorial is particularly good - just google the biological terms as you go.
Bioconductor has some good (though more targeted) tutorials too.

\-- Going to talks can be a great way to get a broad overview of the space.
Drop me a note if you happen to be located in Boston.

\-- Finally, this goes without saying, but don't take articles in the
mainstream media at face value. (Including, and in fact in particular, the
NYT.) Every time I get together with relatives I have to argue against the
latest grand prediction.

[1] [http://www.amazon.com/Genome-The-Autobiography-Species-
Chapt...](http://www.amazon.com/Genome-The-Autobiography-Species-
Chapters/dp/0060932902)

[2] [http://www.amazon.com/The-000-Genome-Revolution-
Personalized...](http://www.amazon.com/The-000-Genome-Revolution-
Personalized/dp/1416569596)

[3] [http://www.broadinstitute.org/scientific-
community/science/p...](http://www.broadinstitute.org/scientific-
community/science/programs/medical-and-population-genetics/primers/primer-
medical-and-pop)

[4] <http://www.genomesunzipped.org/>

~~~
joeroot
Thanks a lot for these - been looking for a solid set of resources on this for
a while.

------
bthomas
The best dataset for this audience is probably the most recent curated set of
variant calls on 1000genomes.org. (Incidentally, the data access links appear
down at the moment...)

This provides data in VCF format [1], which I would argue is the lowest level
you want to go with this data unless you are doing variant calling methods
development.

One tool you can use to analyze these data is PLINK/SEQ [2] (disclaimer: I
work on the project). If any C++ devs are interested in contributing, let me
know. (We'll have the source on Github soon...still pushing my group to learn
Git :)

[1] <http://www.1000genomes.org/node/101>

[2] atgu.mgh.harvard.edu/plinkseq/

~~~
apaprocki
Since you are involved with PLINK/SEQ, you might be able to answer this.. Can
you recommend any reading for learning how to perform admixture
analysis/visualization? I'm interested in seeing the workflow behind starting
with raw genomes and producing visualizations similar to Dodecad.

------
rickyconnolly
The burden has shifted from the bio to the informatics. We need to be able to
sort through this much data without needing access to university-scale
computer facilities.

~~~
gourneau
At Ion Torrent we are working this problem as well. We are always hiring too
<http://iontorrent.com> All of our software is open source
<http://github.com/iontorrent>

~~~
biopharma_guy
which language the software is written in?

~~~
vedant
This is a question you can answer yourself by simply clicking on their github
link. Their four public repositories are in C, Lisp, C++, and Java.

------
hirenj
I'm still not 100% sold on these big sequencing projects. Sure, you get the
genomes of a bunch of individuals, and then you do a bunch of GWAS on it to
see if you can find some mutations that are linked to any phenotypes. It just
doesn't seem very exciting. I love the idea that if you're working on any cell
line/individual, you can just get it sequenced for decreasing amounts of
money, but I'd love to be sold on some big picture stuff as to why 1000
genomes matters (and if GWAS actually has lived up to its promise).

That said, I've got a paper coming out real soon now that looked at the
relationship between mutations across individual strains of Arabidopsis, and
how (surprise surprise) you have fewer mutations in areas of proteins that
potentially carry function, so I do think it's somewhat useful!

~~~
lisper
> why 1000 genomes matters

Gotta start somewhere. And 1000 data points is more than enough to get some
statistically significant results.

~~~
pedrobeltrao
Unfortunately, as far as I know there is very little useful phenotypic data
along with these genomes to work with. I would rather have fewer genome
sequences with better phenotypic information. Even with just 35 genome
sequences of "individual" yeast genomes I could do a lot more interesting
analysis because they are well studied in turns of phenotypes.

~~~
gpurkins
The point is to look at genomic variations across the entirety of the human
genome, not so much to match genomic variation to phenotype. What you're
saying makes sense, but I think that is more a next step.

------
tibbon
I dig this. However, unless I'm understanding S3 usage improperly, to use it
you essentially have to clone that entire S3 bucket and pay for the monthly
usage yourself of 200TB- correct?

Not the biggest deal in the world if you're seriously crunching on it... but
something to consider.

~~~
bbgm
The 1000 genomes bucket is open to the public, so you don't need to clone it
for your own purposes. Any data you generate will need to be in your own
bucket.

~~~
hirenj
And that bucket that you put your intermediate data in will be by itself
pretty large. Or any processing that you do on this data will be
computationally expensive. Either way, Amazon is attracting some new
customers.

------
pruman
Has anyone tried writing a life emulator that could read this gene code and
then virtually generate a baby? This would be really cool for injecting one's
self into an MMRPG, at least one where you could be a baby.

------
apawloski
Similar to other questions, but can anyone explain what exactly this data is
and what analysis is being done on it?

~~~
gpurkins
<http://en.wikipedia.org/wiki/1000_Genomes_Project>

Here's the project explanation, but the data itself is 1000 complete
(hopefully!) genome sequences. The hope is to find the variants, be it single
nucleotide differences, copy number difference, etc. Where I work we use it to
help figure out SNP (single nucleotide polymorphisms) to see if something we
sequenced is a mutation or just a variant with no effect.

------
ayers
I would be interested in how much this dataset costs to run every month.
Anyone able to shed some light on this?

