

Download the whole genome of Cannabis Sativa - scottshea
http://csativa.elasticbeanstalk.com/

======
wisty
To people who make big datasets available:

Please provide a sample of the dataset, so people don't just download the
whole lot (costing you bandwidth), then realize they don't have the ability to
do _anything_ with it.

------
rickdale
I am a novice programmer, but full time gardener. Can someone please explain
to me what the significance of this is? I can't understand what data is in the
datasets and what I will be able to figure out if I do go for the 400gig
download.

I have lots of marijuana/entrepreneurial posts and this one seems to have
gained some good interest. I am curios if someone can explain more. Thanks.

------
yuvadam
Nice stuff.

It would be really nice if someone with some background in bioinformatics
could provide examples for operations that could be done an such a huge data
set, and what could be learned from such computations.

~~~
epistasis
I don't have much interest in this dataset, but I can describe what would
generally be done.

The data here comes from tiny fragments of the genome. On each end of the
fragment, 100 base pairs of DNA are sequenced, and that's what are in the text
files, along with confidence estimates of each called base pair. These bits of
sequence are called "reads."

The first step would be to take all these fragments and assemble them into a
full genome. The page notes that they have 327x coverage, meaning that on
average, there will be 327 reads that overlap a given region of the genome. To
assemble the genome, you find these parts that overlap, and try to expand out
from there.

Genome assembly is complicated by many things, but among them are some regions
that get very few fragments, and repetitive regions. If you have a 150bp
segment repeated many times, often you won't know how large it is, because
your reads are only 100bp, and unless you have really long fragments or
varying fragment size, you won't be able to estimate the length of the repeat.
Also, if the repeat occurs in other, or likely many places, you have
additional difficulties matching what's on the left and right side of all
these repeats. This is an active area of research, and people are building new
assemblers and new assembler technologies even today, to deal with the changes
in sequencing technology. It's unlikely that from a dataset like this that
you'd be able to get much of an assembly, one would likely have to perform
many more targeted sequencing projects to fill in the gaps.

However, with the beginnings of an assembly, the first thing to do is compare
it to the genomes of related species. This will identify both the genes, and
what has changed the most in the genes. One could attempt to identify which
genes are responsible for interesting metabolism or other features of the
species. If you have a trait of interest, and sequence other variants of the
species, you can perform QTL to find out which parts of the genome (and which
genes) are most highly associated with your traits of interest. This has been
used extensively in agricultural genetics.

After that, my interest would be to get gene expression levels, to learn how
genes act in combination.

All this is a huge undertaking that takes a lot of manual intervention, either
from an expert or a highly-motivated and highly-talented non-expert. DNA
sequencing costs are falling dramatically, but the cost of using that sequence
productively is not falling much. This dataset probably cost < $20k to
generate, but it's going to cost at least that much in person time before
anything useful can come from it.

------
jamesbkel
Since they also make the data available as a public snapshot on AWS it would
seem more practical to just look at the data that way instead of downloading.

------
Luyt
Prepare yourself for a 400 GB download. The software to process this quantity
of data must be truly massively scaled.

~~~
tommi
It depends on how you want to process it. It's not always that you want to
apply complex operations on the whole data set. You may want to process only
certain subset of it.

------
desaila
this is the sort of project Steve Yegge was talking about in his presentation
where everyone thought he quit Google. A lot of good could come from data
mining this data, but everyone's too busy sharing cat pictures

------
mannicken
Whoa, dude.

