
New businesses eye the opportunities in managing genome data - roye
http://www.economist.com/news/business/21701143-new-businesses-eye-opportunities-managing-genome-data-all-about-base?fsrc=scn/tw_ec/all_about_the_base
======
greenleafjacob
At least on the storage side, the data is very compressible. A LZ77 variant
for genomic data called GDC [1] got a compression ratio of ∼9500, reducing the
incremental size from 100GB to about 10 MB.

[1]:
[http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&proj...](http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about)

~~~
_ihaque
GDC isn't directly relevant for most of the large-scale storage needs in
genomics. There are a handful of "most important" file formats in the kind of
genomics discussed by the article:

\- FASTA: a sequence or set of sequences representing an entire genome.
Usually used (in human genomics) to represent the "reference genome", but very
rarely individual genomes.

\- FASTQ: a set of sequences with associated probabilities at each position
(representing the probability that the base at that position is correct). Used
to represent the output from a sequencing experiment, where you may get 100s
of millions to billions of short reads (order of 100 to 10000 bases in length)
from a biological sample.

\- BAM: "binary alignment map". Stores the data from the FASTQ (sequence and
quality) in a way that "aligns" it to a reference genome -- identifying where
the read "came from" (more formally, mapping each base in a read to the most
likely corresponding base in the reference genome)

\- VCF/gVCF: "variant call format". You can think of this as a diff between
the individual sequenced and the reference genome.

In most cases in human genomics you wouldn't construct a full FASTA from the
individual (something you would do through a process called assembly).
Instead, you would sequence the sample (producing FASTQ), align it to a FASTA
reference genome (producing BAM), and call variants (producing VCF). The VCF
is _much_ smaller than the other formats, and that size differential is most
of where GDC would get its performance: all in all, we're usually not that
different from the reference.

The big storage problem usually ends up being storing and manipulating FASTQ
and BAM files, because these are the (almost complete) original data from the
sequencing run, and occasionally there's a need to keep them around:

\- you may want to run a new analysis that wasn't done and so wasn't
encompassed in the original VCF \- you may want to know the underlying quality
of the data that created a variant call. Sequencing is a stochastic process,
subject to a variety of types of error. Even though VCF calls typically have
an estimate of quality, in many cases there's no substitute for looking at the
original underlying data. \- you may have a legal or contractual obligation to
maintain this data. (e.g., under CLIA regulations laboratories offering
clinical sequencing may be required to store the raw data underlying a
clinical result for a number of years).

So, how do you store FASTQ or BAM more compactly? BAM is already compressed --
block gzipped, to be specific -- but it still stores all the information
explicitly. The obvious first step is reference-based compression (e.g., CRAM:
[http://www.ebi.ac.uk/ena/software/cram-
toolkit](http://www.ebi.ac.uk/ena/software/cram-toolkit)), which elides
sequence data that is identical to the reference genome, and actually gets rid
of almost all of the storage needed to store the sequences.

Other than some easily compressible stuff, like read names, the main thing
left over is the quality scores -- the sequence of numbers telling you how
"high-quality" each base sequenced was. In raw form, the qscores take an equal
number of bytes as the sequence reads, but once we've used reference -based
compression to elide the bases, they're by far the dominant component.
Unfortunately, compressing quality scores well is a difficult problem. There's
been a lot of work done in this area (e.g.,
[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3832420/](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3832420/)),
but it's by no means a solved problem.

So if you're into data compression, coming up with a better way to compress
FASTQ/BAM files is still a valuable line of work.

~~~
danieltillett
As someone who works on DNA sequencing software the dirty secret is qscores
are excessively precise. They technically span the range 0 to 99, but due to
limitations in being able to predict qscores (you can't really get closer than
+- 5) the real range is more like (0 to 7)*10. Using better bit-packing it
should be possible to compress much better.

~~~
_ihaque
Range isn't an issue (any reasonable scheme won't assign codes to scores that
never occur), but you're right that qscores are both precise but not
necessarily accurate. Quantization of quality scores is a common solution (it
is optional in CRAM, mentioned above). There's also recent work on better
methods for doing it [1,2].

The problem is that if you have to maintain BAMs for regulatory reasons, lossy
q-scores may not be sufficient for compliance, because you 1) have lost part
of the original data 2) may not be able to exactly reconstruct your analysis
results (unless, of course, you did the analysis on the quantized scores).

Thus, it would still be interesting to see better lossless compression
methods.

[1]
[http://bioinformatics.oxfordjournals.org/content/early/2014/...](http://bioinformatics.oxfordjournals.org/content/early/2014/05/02/bioinformatics.btu183.full)

[2]
[http://web.stanford.edu/~iochoa/publishedPublications/2015_q...](http://web.stanford.edu/~iochoa/publishedPublications/2015_qvz_paper.pdf)

~~~
danieltillett
The qscores do occurs, just the error bars on them are quite large in
practice. It certainly would be interesting to see if the analysis can be
constructed from quantized q scores.

