For example if scientists were to collect data that matched the human genome perfectly they would still store all the data, tens or even hundreds of gigabytes - yet the information content of all that is actually one bit: the data matches perfectly the human genome, nothing more needed to be stored at all.
In addition the current genomic data collection is a "prisoner" of the technology. It produces lots of small "reads" (DNA fragments) instead of a single genomic measurement. Because the data is so fragmented we need to measure the same thing many times over (hundreds even). Hence even more redundancy at least 100 fold.
Now magnify this over tens or hundreds of thousands of samples and naturally we have a problem.
If we could sequence one human genome it would take 3x10^9 bytes --> 2.3 gigabytes.
But instead if we were to store the only the changes relative to the reference representing the human genome would take 5x10^6 bytes --> 4 megabytes. Hardly a big challenge to store.
But here is another thing, why store that data when you could re sequence it? All we need is a faster sequencing methodology and all of a sudden data does not need to be stored only the sample does.
CRAM files are alignment files like BAM files. They represent a compressed version of the alignment. This compression is driven by the reference the sequence data is aligned to.
Benchmarks here : http://www.htslib.org/benchmarks/CRAM.html
ETA: wait, hold on. To compare two pairs of bits, you can do an xor, mask out the bits you care about, compare that against 0, and jump based on the flag. To compare two bytes, you can do an xor, mask out the bits you care about if you're packing them into something larger, compare that against 0, and jump based on the flag. Totally not seeing the advantage here.
I count mismatches between sequences using bitwise operators by representing the strings as two binary numbers. One number stores a 1 for an A/C, 0 for G/T and one number with 1's A/G as 0 for C/T. And it was at least an order of magnitude faster. I benchmarked it. (It wasn't just comparing one sequence to another, but also other factors, such as having a red or green lazer in a particular position).
Still - this estimate is based on 3KB/tweet, which is probably derived from looking at the raw twitter XML feed - I'd expect this to compress easily down to 10-30x smaller than the author's claimed numbers - more with proper data modeling.
Nevertheless - huge problems to be dealt with in the sciences and video.
However, algorithms for producing variant calls from raw sequencer output are also improving over time, and the way to get the most benefit from them is to save the old raw data so newer algorithms can be applied to them. That's where the storage challenges come into play.