Hacker News new | comments | show | ask | jobs | submit login
Big Data: Astronomical or Genomical? (plos.org)
24 points by chuckcode on July 10, 2015 | hide | past | web | favorite | 16 comments

Genomic data is storage is very redundant, SRA stores the raw data the way it is collected and that is extremely inefficient.

For example if scientists were to collect data that matched the human genome perfectly they would still store all the data, tens or even hundreds of gigabytes - yet the information content of all that is actually one bit: the data matches perfectly the human genome, nothing more needed to be stored at all.

In addition the current genomic data collection is a "prisoner" of the technology. It produces lots of small "reads" (DNA fragments) instead of a single genomic measurement. Because the data is so fragmented we need to measure the same thing many times over (hundreds even). Hence even more redundancy at least 100 fold.

Now magnify this over tens or hundreds of thousands of samples and naturally we have a problem.

If we could sequence one human genome it would take 3x10^9 bytes --> 2.3 gigabytes.

But instead if we were to store the only the changes relative to the reference representing the human genome would take 5x10^6 bytes --> 4 megabytes. Hardly a big challenge to store.

But here is another thing, why store that data when you could re sequence it? All we need is a faster sequencing methodology and all of a sudden data does not need to be stored only the sample does.

Note that CRAM format does use the reference genome to compress sequences.


CRAM files are alignment files like BAM files. They represent a compressed version of the alignment. This compression is driven by the reference the sequence data is aligned to.

Benchmarks here : http://www.htslib.org/benchmarks/CRAM.html

Classic compression problem. There is no free lunch in that you'd still need to compute the 'difference', which is a combinatorial search. If you could indeed compute the difference, technically you would throw away the raw data and store only differences. But you'd still need to temporarily have access to the raw data.

Lots of stuff genomics researchers do is inefficient. Generally they store sequence in strings and compare those, when 4 bases can be represented by two bytes. Then you can compare them using bitwise operators, which is very close to the way processors work, and orders of magnitude faster than comparing strings.

Where "orders of magnitude" here means "four times"? (and one suspects the additional complexity will not do good things to your instruction cache)

ETA: wait, hold on. To compare two pairs of bits, you can do an xor, mask out the bits you care about, compare that against 0, and jump based on the flag. To compare two bytes, you can do an xor, mask out the bits you care about if you're packing them into something larger, compare that against 0, and jump based on the flag. Totally not seeing the advantage here.

Sometimes genomics researchers do things inefficiently on purpose. You very rarely want to look to see if two strings are exactly the same. Instead, you usually try to look at how similar the sequences are, taking into account ambiguities or mismatches. In this scenario, storing sequences compressed as 2-bit encoding isn't significantly faster (if it is at all).

It can be if you represent them correctly.

I count mismatches between sequences using bitwise operators by representing the strings as two binary numbers. One number stores a 1 for an A/C, 0 for G/T and one number with 1's A/G as 0 for C/T. And it was at least an order of magnitude faster. I benchmarked it. (It wasn't just comparing one sequence to another, but also other factors, such as having a red or green lazer in a particular position).

And how do you represent an unknown base in that representation? Or, how would you do the comparison with a gap in one of the sequences (which is incredibly common). I've benchmarked it all too, and it is really difficult to represent DNA with just two bits. Once you start to get out from the simplest use-cases, it gets significantly more difficult. The most efficient mechanisms require three bits or at least some mechanism for denoting where the ambiguous calls are (see UCSC's 2-bit format). And now that you're dealing with three bits, the numbers start to get funky quickly - you end up dealing with 24 bits for 6 consecutive bases, and you still have the problem of representing gaps efficiently... have you ever tried to insert 4 bits in the middle of another number?

I didn't need to for the problem I was solving.

I believe you have a typo and meant 2 Bits.

yes you are right.

For anyone else confused about the Twitter numbers reported in Table 1, I believe it should read "0.5-15 billion tweets/day" as opposed to "0.5-15 billion tweets/year" on the first line, which is consistent with the claim that Twitter produces 500m tweets/day currently. When you make this adjustment, you recover their annual storage estimates.

Still - this estimate is based on 3KB/tweet, which is probably derived from looking at the raw twitter XML feed - I'd expect this to compress easily down to 10-30x smaller than the author's claimed numbers - more with proper data modeling.

Nevertheless - huge problems to be dealt with in the sciences and video.

Seems like better representation could save a few orders of magnitude in data size. Humans only differ at about 1/1000 nucleotides so storing just differences rather than all genomic positions should greatly alleviate the issue. In theory you would just have to store less than a megabyte of nucleotides per personal genome. Probably have to get rid of one-off sequencing errors and quality scores at some point as scaling up to planet wide genomic sequencing.

Better representations are already used for downstream data analysis and continue to be improved; see e.g. http://arxiv.org/abs/1506.08452 .

However, algorithms for producing variant calls from raw sequencer output are also improving over time, and the way to get the most benefit from them is to save the old raw data so newer algorithms can be applied to them. That's where the storage challenges come into play.

Analyzing the source asking about analyzing itself would seem to be connected to the largest data set - Genomics and all the biology, including the brain and conciousness, associated. It's the most complex form of data we can collect and get our hands on so to speak.

What data do we have about consciousness?

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact