
The Desperate Quest for Genomic Compression Algorithms - yaseen-rob
https://spectrum.ieee.org/computing/software/the-desperate-quest-for-genomic-compression-algorithms
======
evandijk70
There seems to be confusion in this thread on what exactly is stored.

The human genome is about 3.25 billion bases, or about 6.5 billion bases if
you determine both copies of each chromosome. This corresponds to 13 billion
bits or about 1.5 Gigabytes, and storing this would be no problem at all, even
if there would be a lot of genomes sequenced.

However, the way human genomes are sequenced is different: The genome is
fragmented in random pieces of around 500 basepairs long. Illumina sequencing
machines typically read 300 basepairs of these fragments with a ~1% error
rate.

To be able to determine the entire genome, these reads are mapped to a
reference genome, and differences with the reference genome are marked as
mutations. However, to distinguish real mutations from random read errors, and
to make sure that mutations can be confidently called for regions that where
fewer reads mapped a lower coverage through random variation, a ~30x coverage
of the genome is needed.

If you want to store the raw sequencing data and not just the called results,
a single genome does not require 1.5 GB of storage, but now requires 45 GB
storage. It gets even worse when you consider samples from tumours, where
biologically relevant mutations are often found in only a subset of cells. To
discover these mutations, a coverage of 100x is often recommend, leading to
150 GB of storage for a single genome.

~~~
mmt
> It gets even worse when you consider samples from tumours, where
> biologically relevant mutations are often found in only a subset of cells.
> To discover these mutations, a coverage of 100x is often recommend

This point brings up another, implicit, scale multiplier that the average
person might assume doesn't exist: that a single individual could need
multiple sequences stored.

One can't just estimate required total (uncompressed) storage by multiplying
population (mod monozygotic multiple births and chimerism) by the size of one
instance of raw sequencing data.

~~~
breckuh
> a single individual could need multiple sequences stored.

And multiple sequences could be hundreds, thousands, or more. Example: single-
cell RNAseq pipelines that generate sequences for 10k cells at once.

Perhaps in the future--as tech improves--the devices themselves might be
equipped to run more real-time in memory QC and emit less--but more accurate--
data? I believe this is somewhat how CERN does it, where they only record a
filtered subset of the data that comes in.

------
Scaevolus
This reads like a puff piece for MPEG-G, which will presumbaly be as patent-
encumbered as every other MPEG standard. The double power law for mutation
distance is interesting, but the application is feeding an entropy coder more
accurate probabilities of match distances. Any decent entropy coder would
implicitly discover the distribution after a short period.

There are already _good_ compressors for genetic data. The article briefly
mentions them, but oddly doesn't name them explicitly. Here's a 2013 paper
summarizing the research (like compressing 6.3GB to 1.1GB or 750MB with a
reference):
[http://journals.plos.org/plosone/article?id=10.1371/journal....](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0059190)

------
chrchang523
It's necessary to distinguish between the raw output of the sequencing
instrument and the actual genomic variables of interest here.

Adequate compression of the latter, with appropriate confidence estimates
attached, is a solved problem for the most common applications, even when
hundreds of thousands of samples are involved: the UK Biobank has released a
dataset describing whole-genome variation for almost 500000 British people,
and many research groups around the world are successfully analyzing this data
today with freely available software running on commodity hardware.

It's the raw measurements that aren't compressing as well as we'd want today.
But there are multiple other solutions to this problem besides better
compression: we can become good enough at inferring the relevant genomic
variables from the measurements that, once we've done so, we can discard the
raw data (we don't do this today because the inference process is still very
much a work in progress, so the option of running tomorrow's algorithms on
today's measurements is too valuable); or measurement becomes cheap enough and
biological sample storage stable enough that the measurement process can
simply be repeated when necessary; or entirely different measurement
process(es) are developed that yield adequate information without
simultaneously generating so much nuisance data to preserve.

------
beagle3
The A, G, C, T system is apparently insufficient to completely describe the
genome; See[0]. I would not be surprised if, in 20-30 years, there will be a
Y2K-style "4 bases" effort to fix all software and databases to account for
the missing bases; It's obvious that the extra ones play (at most) a small
part, but at some point that small part might be what is stalling progress.

[0]
[https://www.sciencedaily.com/releases/2011/07/110721142408.h...](https://www.sciencedaily.com/releases/2011/07/110721142408.htm)

~~~
no_identd
This seems like a misleading press release. This article basically talks about
(newish[2011], additional) epigenetic mechanisms.

Epigenetic sequencing would indeed require even more storage.

------
T-A
Or we could just store it all in DNA:

[https://www.wired.com/story/the-rise-of-dna-data-
storage/](https://www.wired.com/story/the-rise-of-dna-data-storage/)

~~~
lainga
I'll have you know that I sequenced _and_ compressed both my parents' DNA at a
very young age, although the process was about 50% lossy.

~~~
fubar2018aug
hasn't everyone?

~~~
adyavanapalli
That's the joke XD

------
mjevans
I am completely unaware of all prior art in this specific field.

However the most obvious starting idea to me is that most humans will have
mostly the same DNA, with some small 'configuration' changes and maybe
variations in some smaller sections.

Thus, the pre-filter would find a good fit against the reference copy, compute
a diff that's sparse, and compress that.

Someone more versed in the field could probably provide a better estimate of
how to address that. Taking the 3.25Billion bases (addresses) mentioned in
another post, 32 bits of storage is sufficient for an unsigned index. The
reference variation should also be stored, but I don't know how many
variations each section might have.

Probably after a sufficiently large collection is amassed the collections
could be better sorted such that the most common have a prefix which is more
compressible assigned to them (and they should also be tested first by the
compressor).

~~~
rishav_sharan
I don't understand this field but wouldn't creating a reference model be an
ethical and cultural nightmare? I would assume that researchers will be forced
to make a very diluted version of this reference model which cannot be tied to
any race, which at the end of the day may not be much useful.

~~~
shusson
On average there is only a 0.6% difference between human genomes [1]. And the
reference genome already exists and is constantly being updated [2].

[1]
[https://en.wikipedia.org/wiki/Human_genetic_variation](https://en.wikipedia.org/wiki/Human_genetic_variation)
[2]
[https://en.wikipedia.org/wiki/Reference_genome](https://en.wikipedia.org/wiki/Reference_genome)

------
bra-ket
you can either compress the data for general use, or optimize for particular
application , i.e. search index, in the latter case the compression ratio can
be orders of magnitude better than general-purpose

I think improving general compression is of marginal importance given the
abundance of existing compression algorithms, it is mostly a solved problem
with many decades of productive research behind it

the biggest ‘bang for the buck’ is in optimizing for particular applications
such as search, gene detection, snp calling, alignment etc

------
TD-Linux
This article totally failed to justify why compression algorithms are so
important for this application. How big is the human genome, how much can we
compress it by, and why does this matter?

I took a look and zipped FASTA files are about 800MB. Large, sure, but people
are streaming Netflix movies far bigger than this every night.

It's really hard for me to see the MPEG-G effort as anything but a money grab
for compression patents and licensing.

~~~
mjburgess
1 Human = 100 GB

500,000 Humans is a small sample of the UK population one company is looking
to create.

All this is in the article.

~~~
epicureanideal
It seems to be more like 1.5GB rather than 100 GB, and someone else commented
that compressed is around 800 MB.

[https://bitesizebio.com/8378/how-much-information-is-
stored-...](https://bitesizebio.com/8378/how-much-information-is-stored-in-
the-human-genome/)

Even at 1.5 GB, this doesn't seem like an insurmountable problem at the
current cost of storage.

750,000 GB = 750 TB, and you can buy an 8 TB hard drive for $150. So for
$15,000 you can store 500,000 human genomes.

If $15,000 is too expensive for a company, that company is doing something
very wrong, when any salary will be many multiples of that.

~~~
mmt
> 750,000 GB = 750 TB, and you can buy an 8 TB hard drive for $150. So for
> $15,000 you can store 500,000 human genomes.

As others have mentioned [1], the raw sequencing data (still) has to be
stored, so 1.5GB is off by about 30x. Even at 6:1 compression, that's still
5x.

More importantly, though, the cost of storage, especially at scale, is
significantly higher than just the price of bare drives. Backblaze famously
complained about this [2] in 2009.

Even their particularly cheap, particularly low-performance solution claimed
to add almost 45% on top of the cost of the bare drives. A higher-performancd
storage system (using, say, standard SAS expander backplanes instead of their
non-standard SATA multipliers) has, in my experience, added 70% on top of the
bare drives (and that's being frugal, which seems remarkably rare, even among
startups).

Your $15k is now $128k, minimum. That's just purchase cost, and, while
operating cost isn't huge for that little storage, it's not zero. The cost
gets much, much worse, if the orgnization can't/won't hire someone like me or
won't allow that someone to implement such a frugal DIY assembly (which is the
vast majority of organizations today).

That leaves "enterprise" storage or cloud, which are both approximately as
expensive. I recently heard an estimate of half a million dollars per petabyte
for NetApp, which would translate to $1,875,000 here.

[1] Well detailed by
[https://news.ycombinator.com/item?id=17821523](https://news.ycombinator.com/item?id=17821523)

[2] famous on HN, anyway. [https://www.backblaze.com/blog/petabytes-on-a-
budget-how-to-...](https://www.backblaze.com/blog/petabytes-on-a-budget-how-
to-build-cheap-cloud-storage/)

~~~
wenc
Cloud: just did a quick calculation for AWS. 75 PB (= 150GB * 500,000) of
storage costs $300,000 on AWS Glacier, and $1.57mil on S3 Standard. And that's
before accounting for any read/write costs.

Wow.

~~~
mmt
I was using 3.75PB (assuming 45GB compressed 6:1 so 7.5GB per person). I'm not
_certain_ , but my intuition is that, even paying AWS for the CPU time for
compression, that would be cheaper than paying for the uncompressed transfer
charges.

But, yes, even for my lower storage numbers, for "EU (London)", even for just
3 years (a conservative enough lifespan for an HDD), S3-Standard would be
almost $2.97M, S3-Standard-IA would be $1.77M, S3-OneZone-IA $1.4M, and
Glacier would be _merely_ $608k. Of course, that excludes transfer charges.

------
sjg007
I thought one idea was to store only the differences from a reference genome.

~~~
phkahler
That is mentioned and IMHO seem like the best possible way given the 1/1000
mutation/variation rate. But they also have a quality score indicating
confidence that the data is correct, and they want to compress the quality
data as well.

~~~
tonto
Here is one approach for that
[https://github.com/jkbonfield/crumble](https://github.com/jkbonfield/crumble)
[https://blog.dnanexus.com/2018-07-23-breaking-down-
crumble/](https://blog.dnanexus.com/2018-07-23-breaking-down-crumble/)

------
phkahler
>> And it turns out that a histogram plot of the distance between adjacent
genetic variations, measured in DNA base pairs, looks like a double power law,
with the crossover point between the two power laws happening at around 1,000
DNA base pairs [see graph, “Double Power Law”]. It’s an open question as to
what evolutionary process created this distribution, but its existence could
potentially enable improved compression.

My first thought was it's due to some issue/error/phenomena with the
sequencing process and not inherent in the underlying data. Hopefully it's
known not to be that, but it would be interesting.

------
ziga
> Typically, a DNA sequencing machine that’s processing the entire genome of a
> human will generate tens to hundreds of gigabytes of data. When stored, the
> cumulative data of millions of genomes will occupy dozens of exabytes.

100GB * 1M = 100 petabytes, not exabytes.

It's also worth quantifying the cost of storing this data. Storing 100GB on
Amazon Glacier costs ~$5/year, which is still a small fraction of the total
cost of whole genome sequencing.

~~~
tofof
But researchers don't need to simply backup the data - that is to say write
the data once and have it sitting there as a recovery option. That's what
Glacier is good for.

We need to be able to interact and compute with the data, search it, compare
it, etc. We need to be able to practically store several hundred individuals'
genomes for even a modest (n=100) GWAS study. And as the article explains,
it's not simply a question of storage, but also of compressing in a way that
we can still quickly do computations on the data without having to resort to
just completely uncompressing everything.

I'm going to actually run the numbers to show why Glacier would be an
exceptionally poor option, but I'll disclaim up front: the poor fit will stem
from the fact that Glacier is optimal for store-forever read-never workflows,
which is not even close to what our workflow will be.

Let's actually do the math on that modest n=100 study. At 150 GB per
individual (see evandijk70's comment), you seem to be thinking 'ok, 15 TB[1],
$5/GB/year = $750, no problem'. Now let's actually add the cost of not just
the size of the data stored, but the retrievals and data transfers. Let's even
assume we're willing to wait the 6 hours (an entire workday!) for a standard
Glacier retrieval. If we're extremely conservative about how we use the data,
we can possibly get by with only 10 retrievals of each genome: 10 x 15 TB =
150000 gigabytes retrieved, at $0.01 = $1500. Plus the data transfer charge,
$0.09 for the first 10000 GB, $0.085 for the next 40000 GB, $0.07 for the
remainder: $11300 [2].

So for a small study, we're talking about not $750, but $13,550 using your
proposal of Amazon Glacier. And then the real kicker - if we actually tried to
do this, moving 165 TB (1 write, 10 reads) of data around would take _more
than 5 straight months_ of 24/7 uploading or downloading at 100 Mbps. And at
$13,550, that's literally more than half a biology grad student's salary here
at the University of Illinois. For two of these studies, you could just
instead hire a third scientist and pay them to do nothing but drive harddrives
around.

Obviously that's ridiculous and in no way a realistic solution to anything.
Admittedly, today, small GWAS studies are probably closer to n=25, with 30x
coverage rather than 100x. But lopping a zero off the costs and transfer times
doesn't change anything - it would still be an absurd use of Glacier. But the
example should hopefully be eye-opening to the importance of being able to
effectively manage this volume of data. The example is the output of a single
scientist working for perhaps a month on the actual sequencing, and a couple
more months in preparation to get a really good/narrow selection of subjects,
etc. So for a 10 person lab working at high efficiency, you can imagine how
quickly the data could stack up.

All our data operations -- search, compute on, transfer, and store -- would be
improved dramatically if we had compression schemes that approached the
efficiency of, say, HEVC or AV1.

As an comparison - uncompressed video, 1920x816 pixels (standard cinema 2.35:1
widescreen), at 23.976 fps, for 1 hr 45 minutes, in 24 bit color = 1920x816 x
23.976 x (60x60+60x45) x 24/8 = 700 GB of data. HEVC at reasonable settings
will chop that to about... 2 GB. That sort of 350-fold compression is what
this article is talking about hoping to achieve, which is a far cry from the
20-fold you can get from gzipping a genome.

[1] An individual archive on Glacier, by the way, is limited to 40 terabytes,
so we're already pushing against that boundary.

[2] The limit on data transfers per month is 500 TB, so with our single study
at 150 we're already pushing against that boundary as well.

~~~
ziga
The article mentioned storing data for a decades, which implies a backup use-
case with infrequent access (of course, data only would be deposited in
Glacier after the initial analysis is performed). If you expect to retrieve
the data frequently, that's clearly not the right storage tier to use. S3
Infrequent Access tier is ~3x the cost, which still supports my point about
the relative cost compared to total cost of sequencing.

To address some of your other points: the 40TB limit per archive is not a
limit on the amount of data your can store in Glacier. And assuming 100MBps
throughput is implying you'd use a single node to analyze the data, which does
not make sense at this scale.

------
otakucode
It is fairly reassuring that the sum total of that which results in human
development and complexity is not easily fit on a USB stick.

~~~
DoctorOetker
are you talking about all our DNA together, with all the genetic diversity of
the global human population? or for one specific individual?

I will assume you are talking about one specific individual. Would you
consider the genome of a freshly (before any cell division) fertilized egg
cell to contain this "sum total of that which results in human development and
complexity" ?

Unless you don't I will assume so. The article states ~3.25 billion base
pairs, thats ~6.5 billion bits worst case (either without compression, or for
a white noise genome). Thats less than a gigabyte... fits easily on a USB
stick!

Of course as the article mentions this does not apply to diversity in whole
populations [of humans, or of "body parts" that keep evolving within the human
body (such as white blood cells), or of cells from cancerous or virally
infected or irradiated tissues]

The reason a single genome (i.e. the freshly fertilized egg cell) does not yet
fit in 1GB is because we can not yet simultaneously reliably and cheaply
sequence DNA (so we need "error bars")

------
trhway
>The files are also very redundant, which stems from the fact that any two
human genomes are nearly identical. On average, they differ in about one
nucleotide per 1,000, and it’s typically these genetic differences that are of
interest.

git diff

