
Improving compression with a preset DEFLATE dictionary - jgrahamc
https://blog.cloudflare.com/improving-compression-with-preset-deflate-dictionary/
======
polyfemos76
We use this technique in Just Dance Now, and I wholeheartedly recommend doing
it for usecases where you need to compress lots of small packets individually.
We save at least 50% data.

For our game, this consists of the normal communication going back and forth
between server and client, where individual messages are normally <100b.

By sampling some user sessions we built a dictionary, fed it into
deflateSetDictionary/inflateSetDictionary, and use that in both Node modules
for the server, as well as throughout the ecosystem of Android and iOS.

It's very simple to do; we base all communication on standard JSON (for ease
of portability and debugging) but then have the preset-dictionary fed into zip
(a simple large string!) to counter the weight of JSON.

Needless to say, binary packing the protocol would give additional gains; but
the productivity gains with using dictionaries to get good compression of pure
JSON was a great choice for us. And it's really _supersimple_ to do.

One nice benefit is that the protocol is extendable by any developer without
worrying about the wire-transport; the compression won't break is someone
decides to add a parameter to a message, it just most likely won't be
compressed as well as the rest of the message.

------
JoeAltmaier
Hey, somebody has a patent on pre-loading the compression tables with
frequently-encountered strings. So heads up.

It was patented in the context of headers for text messages, which are sent
for every message, are generally larger than the message but appear exactly
once in each text so never end up in the compression table. Each side of the
conversation runs the zip algorithm on the header strings, discards the output
up to that point, then compress the text message from that point. On receive
they preload the algorithm with the checkpointed data structures from the
preload-compression.

This Google technique is probably different enough to be off their radar. But
who knows.

I think its the guys who do DirectPC

~~~
makmanalp
Wow, do you have a link to the patent? It'd be surprising to me if there
wasn't prior art. Might be worth asking
[http://patents.stackexchange.com/](http://patents.stackexchange.com/) !

~~~
JoeAltmaier
Came across it 7 years ago when new at my current project. Did research into
efficient compression schemes. Had to change our scheme to avoid the patent.
It was Hughes I think.

~~~
JoeAltmaier
Ok I searched patents using 'compression dictionary' and came up with a
boatload. Folks have been using dictionaries since 2005 at least, to replace
common words e.g. XML tags with tokens. E.g. patent 20050027731: Daniel Revel

------
dantillberg
I love the explanation of how LZ77 works; compression algorithms often seem
like they must use some sort of arcane magic, so it's great to lift the veil a
bit.

I wonder: how many folks have Little Bunny Foo Foo stuck in their heads after
reading this?

~~~
userbinator
The "arcane magic" impression likely comes from the heavy use of information
theory concepts and maths that are encountered in the traditional literature.
Probably necessary to analyse in detail, but quite daunting for the beginner.
I find LZ to be particularly intuitive since it embodies the natural notion of
"abbreviation" by referring to something we've seen before, and writing
compressors/decompressors for the algorithm is relatively easy (especially
decompressors.)

On the other hand Huffman is somewhat less easy to grasp intuitively, but some
aspects of its principle of reducing bits/symbol based on their frequency
shows up in natural languages: the more common a word is, the shorter it tends
to be. It's also a bit more difficult to implement.

~~~
eru
And as words become more common (over time or in specific parts of the
population), they often become shorter, too.

Eg see the evolution from phone / mobile phone, to landline / phone.

------
gopalv
Annoying detail for DEFLATE + dictionary is that the dictionary has to be
provided within every reader (setDictionary in java).

The output from the DEFLATE stream is not standalone, which makes it really
hard to use - a more ideal approach would've been to provide training to the
Huffman tree & guess best window sizing from a sample.

But SDCH solves that problem by storing an external named dictionary - clients
will read it off a common HTTP location if it's missing in the cache.

Within that, there's a pretty neat standard which SDCH uses - RFC 3284.

I ran into it trying to optimize holding a few billion serialized blogs within
an in-memory store with transactional updates - the issue was mostly the in-
memory sizes of serialized objects.

I ended up using a versioned external dictionary for compression, which was an
extension to VCDIFF with 1 additional instruction in its streams - COPY_DICT.

That concept can be pushed further, to get to something more generic like
femtozip - [https://github.com/gtoubassi/femtozip/wiki/How-femtozip-
work...](https://github.com/gtoubassi/femtozip/wiki/How-femtozip-works)

------
ctz
I wonder if compression with just a preset dictionary (i.e. no learning from
the plaintext) would produce reasonable compression ratio. That would be
interesting because it would be entirely BREACH resistant.

------
halflings
Never thought I'd ever understand an article about compression :) Great
article (loved the "hopping bunny" example) and it almost makes me think that
this could replace the need for dynamically loading content in some cases
(when network latency, and not DOM rendering and such, is the issue).

------
jamix
Unfortunately, the efficiency of the proposed approach is only limited to the
first 32 KB of the page being compressed. Once we move past the first 32 KB,
LZ77's lookbehind window will no longer include the dictionary and the
compression ratio of the rest of the page will stay the same.

Pre-seeding the dictionary could have a better effect in an LZ78 scheme where
there is an actual dictionary built and stored in memory as opposed to a
moving lookbehind window.

This also makes me wonder how the dictionary is delivered to the end client.
If it's sent with each page request, then it hardly makes any sense as you are
hauling an extra 16 KB or 32 KB of the dictionary just to have a better chance
of compressing the page's first 32 KB.

I wonder if the size of the dictionary itself is factored in the compression
benchmarks provided in the article.

------
ck2
I've wondered why with storage being so cheap there isn't a compression method
with a 1TB dictionary that everyone stores locally and just send the "map" or
"tree" instead.

~~~
pestaa
But with computing becoming cheaper, you only need 2 integers to compress any
data.

[http://penduin.blogspot.hu/2006/10/pi-
compression.html](http://penduin.blogspot.hu/2006/10/pi-compression.html)

~~~
chriswarbo
Two integers and a load of computation? How wasteful.

Just treat your bits as the binary representation of a Natural number and
you're done. No computation required for encoding/decoding, and you only need
to store one number :)

------
bhouston
I'd just prefer something like LZHAM to be implemented as an alternative to
gzip, Deflate for http compression. It would be amazing for serving content
over the web to browser-based 3D applications like
[https://Clara.io](https://Clara.io)

Details:
[https://github.com/richgel999/lzham_codec](https://github.com/richgel999/lzham_codec)

~~~
anon4
Can't you just implement that in javascript, then load the entire webpage that
way?

~~~
bhouston
You can of course use things like:

[https://github.com/nmrugg/LZMA-JS](https://github.com/nmrugg/LZMA-JS)

But nothing beats native code for decompression speed and also if it is native
it can happen asynchronously outside of the JavaScript's limited threading
model (although it is always getting better) and most importantly, it can also
happen outside of the JavaScript memory space.

------
acqq
How is the new deflate preset dict passed to the browser? If the dict is
normally empty and the protocol even supports sending of different preset, how
is that shorter than before, if the new message has a new dictionary?

------
jkot
Some databases are creating compression dictionary during compaction.

~~~
ot
Interesting, can you mention them?

------
mamcx
Wonder if this trick could be used agains large binary files (like a database
backup)?

