
Building Better Compression Together with DivANS – Dropbox Tech Blog - apu
https://blogs.dropbox.com/tech/2018/06/building-better-compression-together-with-divans/
======
phrh8
How does DivANS compare to state of the art compression algorithms? Is Brotli
widely acknowledged as the state of the art or you just compare to it because
it is fast?

My understanding is Zstd was designed to be fast, not necessarily efficient,
and the other algorithms listed (e.g. 7-zip, gzip) are quite ancient.

~~~
daniel_rh
Blog co-author here: With the settings in the blog post, the DivANS algorithm
skews towards saving space over speed. Brotli skews heavily towards
decompression speed without sacrificing much ratio. The reason we focus a lot
on comparing to Brotli, is that Brotli does extremely well on data stored at
Dropbox.

However, I don't think the performance is a fundamental law, and there are
clearly some clever optimizations yet-to-be done.

I was very surprised that the lzma-brotli mashup outperformed either. This
leads me to think that with enough community involvement we could discover
some really clever heuristics and algorithms for compression all together.

------
BeeOnRope
It is important to note that the IR here isn't exactly generic: it embeds the
assumption of an LZ77 compressor.

Now LZ77 compressors have the advantage of being really popular, and many of
the leading compressors at various points on the space/time frontier use it,
but there a large number of other interesting compression algorithms as well,
which this IR cuts out.

A big downside of LZ77 is that it is simple enough that we had optimal parsing
for it a long time ago, and it isn't likely to be the subject of any big
breakthough in the future: much of the interesting work will probably happen
in non-LZ77 style compressors.

One of the interesting areas for LZ77, however, is coding-sensitive-parsing:
i.e., basing your match-finding decisions on the actual code lengths to
represent specific matches, literals runs, etc. Since these code lengths are
usually dynamic, set by the entropy coding backend, you often want a back-and-
forth type of approach between the two components: either multiple passes or
two-way feedback.

The approach of generating IR once and then feeding it into the DivANS entropy
coding backend seems to preclude this.

------
bitwave
reminds me of zpaq [1]. It is an public domain compression tool, which save
the decompression algorithm as IR into the compressed archive. It is written
by Matt Mahoney, best known for the PAQ archiver series.

[1]:
[http://mattmahoney.net/dc/zpaq.html](http://mattmahoney.net/dc/zpaq.html)

~~~
daniel_rh
It has some similarities to zpaq, especially with the -mixing=2 parameter.
With -mixing=2, DivANS runs 2 models to estimate the upcoming nibble or byte
and dynamically chooses the best model based on past performance. One of the
two models is good for more diffuse patterns and the other is good for more
pronounced patterns. I believe the paq technology runs dozens of models and
uses similar but slightly more advanced models to mix between them. This
results in excellent compression but means you have to evaluate those models
to decompress the data when you need it later.

Without -mixing=2 (eg for the results present in the blog post), DivANS
actually relies on a static determination about which models to use in which
situations, serialized into the file in the mixing map. It only ever evaluates
a single model per nibble.

The idea of mashing up the intermediate representation with a bunch of
different compressors isn't present in zpaq as far as I know, but the spirit
of the idea is similar, but on a per-bit level, rather than on the IR level.

~~~
bitwave
Big thanks for the clarification.

------
ksoong2
Nice article!

I don't understand how probability tables can compress AND be deterministic.
Intuitively, I would think it's a trade-off between the two competing effect.

~~~
hellcatv
Here's an analogy that may help when thinking about probability tables: Many
types of standard codes like Morse code are like Huffman coding, where you
give different variable length codes for each letter.. In Morse E and T are
both a single dot and a dash respectively but X and Z get four.

You can view this Huffman Table or Morse codebook as a probability
distribution on the likely letters to follow. A symbol with a small number of
bits == high probability. A symbol needing large number of bits == low
probability.

The only innovation with Arithmetic coding and ANS is that they allow for
"fractional bits:" you don't need a whole dot for a very likely next-letter.
But the idea is the same: you are taking your understanding of the probability
distribution table and using it to come up with codes for the following
letters.

One caveat is that instead of agreeing to a fixed code upfront, you agree to
an algorithm to dynamically estimate the future probability tables as you read
through the file.

~~~
ksoong2
Makes sense. Thanks!

------
Machyume
Cool approach. Was wondering if you could chime in a bit about the compression
speeds observed? Thanks!

Will pass this to a few colleagues.

