
Big Data: Astronomical or Genomical? - chuckcode
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
======
glofish
Genomic data is storage is very redundant, SRA stores the raw data the way it
is collected and that is extremely inefficient.

For example if scientists were to collect data that matched the human genome
perfectly they would still store all the data, tens or even hundreds of
gigabytes - yet the information content of all that is actually one bit: the
data matches perfectly the human genome, nothing more needed to be stored at
all.

In addition the current genomic data collection is a "prisoner" of the
technology. It produces lots of small "reads" (DNA fragments) instead of a
single genomic measurement. Because the data is so fragmented we need to
measure the same thing many times over (hundreds even). Hence even more
redundancy at least 100 fold.

Now magnify this over tens or hundreds of thousands of samples and naturally
we have a problem.

If we could sequence one human genome it would take 3x10^9 bytes --> 2.3
gigabytes.

But instead if we were to store the only the changes relative to the reference
representing the human genome would take 5x10^6 bytes --> 4 megabytes. Hardly
a big challenge to store.

But here is another thing, why store that data when you could re sequence it?
All we need is a faster sequencing methodology and all of a sudden data does
not need to be stored only the sample does.

~~~
collyw
Lots of stuff genomics researchers do is inefficient. Generally they store
sequence in strings and compare those, when 4 bases can be represented by two
bytes. Then you can compare them using bitwise operators, which is very close
to the way processors work, and orders of magnitude faster than comparing
strings.

~~~
mbreese
Sometimes genomics researchers do things inefficiently on purpose. You very
rarely want to look to see if two strings are exactly the same. Instead, you
usually try to look at how similar the sequences are, taking into account
ambiguities or mismatches. In this scenario, storing sequences compressed as
2-bit encoding isn't significantly faster (if it is at all).

~~~
collyw
It can be if you represent them correctly.

I count mismatches between sequences using bitwise operators by representing
the strings as two binary numbers. One number stores a 1 for an A/C, 0 for G/T
and one number with 1's A/G as 0 for C/T. And it was at least an order of
magnitude faster. I benchmarked it. (It wasn't just comparing one sequence to
another, but also other factors, such as having a red or green lazer in a
particular position).

~~~
mbreese
And how do you represent an unknown base in that representation? Or, how would
you do the comparison with a gap in one of the sequences (which is incredibly
common). I've benchmarked it all too, and it is really difficult to represent
DNA with just two bits. Once you start to get out from the simplest use-cases,
it gets significantly more difficult. The most efficient mechanisms require
three bits or at least some mechanism for denoting where the ambiguous calls
are (see UCSC's 2-bit format). And now that you're dealing with three bits,
the numbers start to get funky quickly - you end up dealing with 24 bits for 6
consecutive bases, and you still have the problem of representing gaps
efficiently... have you ever tried to insert 4 bits in the middle of another
number?

~~~
collyw
I didn't need to for the problem I was solving.

------
etrain
For anyone else confused about the Twitter numbers reported in Table 1, I
believe it should read "0.5-15 billion tweets/day" as opposed to "0.5-15
billion tweets/year" on the first line, which is consistent with the claim
that Twitter produces 500m tweets/day currently. When you make this
adjustment, you recover their annual storage estimates.

Still - this estimate is based on 3KB/tweet, which is probably derived from
looking at the raw twitter XML feed - I'd expect this to compress easily down
to 10-30x smaller than the author's claimed numbers - more with proper data
modeling.

Nevertheless - huge problems to be dealt with in the sciences and video.

------
chuckcode
Seems like better representation could save a few orders of magnitude in data
size. Humans only differ at about 1/1000 nucleotides so storing just
differences rather than all genomic positions should greatly alleviate the
issue. In theory you would just have to store less than a megabyte of
nucleotides per personal genome. Probably have to get rid of one-off
sequencing errors and quality scores at some point as scaling up to planet
wide genomic sequencing.

~~~
temujin
Better representations are already used for downstream data analysis and
continue to be improved; see e.g.
[http://arxiv.org/abs/1506.08452](http://arxiv.org/abs/1506.08452) .

However, algorithms for producing variant calls from raw sequencer output are
also improving over time, and the way to get the most benefit from them is to
save the old raw data so newer algorithms can be applied to them. That's where
the storage challenges come into play.

------
meeper16
Analyzing the source asking about analyzing itself would seem to be connected
to the largest data set - Genomics and all the biology, including the brain
and conciousness, associated. It's the most complex form of data we can
collect and get our hands on so to speak.

~~~
collyw
What data do we have about consciousness?

