
The Elegance of Deflate - ingve
http://www.codersnotes.com/notes/elegance-of-deflate/
======
beagle3
There were a lot of encoders at the time using this general scheme (a few more
values to indicate match length or distances). PKZIP won at the time because
it was faster, and PK had the name recognition from his PKARC, which was a
superfast implementation of SEA ARC (the dominant archiver on PCs at the
time).

PK had to stop working on PKARC because of a C&D request from SEA. He wrote
the first algos of PKZIP, which were on par with SEA ARC on compression (and
with PKARC on speed), but weren't much better. (And have been deprecated since
the 1990 if I'm not mistaken).

Then, the Japanese encoders started to rule the compression ratio (and had
comparably reasonable compression times) - LHArc, LZAri, don't remember the
rest of the names. LHArc or LHA (don't remember which), had basically the same
scheme that PKZIP converged on, except it used adaptive arithmetic coding. PK
replaced that with static huffman coding, trading a little compression for a
lot of speed, and the format we now know and love as "zlib compression" was
born (and quickly took the world by storm, being in a sweet spot of
compression and speed).

There's another non-trivial thing that PKZIP had going for it - it put the
directory at the end, which meant you could see the list of files in the
archive without reading the entire archive! This sounds simple, but back then
everyone adopted the tar-style "file header then data" appendable style, which
meant that just listing the files inside a 300KB zip file (almost a whole
floppy disk!) meant reading that entire floppy (30 seconds at least). PKZIP
could do it in about 2 seconds.

~~~
creshal
> There's another non-trivial thing that PKZIP had going for it - it put the
> directory at the end, which meant you could see the list of files in the
> archive without reading the entire archive!

Beginning _and_ the end, which continues to bite us in the ass until today,
where we regularly stumble over bugs when one part of a toolchain uses one
entry and the rest the other.

~~~
acqq
The directory is only at the end in the PKZIP, there is additional info before
each compressed file, like in TAR. It's actually good to have both, as it
allows recovering individual files even when there's a corruption in other
files or in the directory.

~~~
dalke
Yes, it's possible to use the information to recover a corruption.

The problem is, what happens if the two are different? This can be through
accident or maliciously. If one tool uses the first and another uses the
second, then you'll end up with different results, and it can be hard to
figure out why they are different.

It's like a CD, where audio players interpret the CD different than CD-ROM
players. Some anti-copying techniques tried to take advantage of this to
produce un-rippable CDs. A problem, however, was that car CD players used CD-
ROM drives, so these CDs weren't playable in those cars.
([https://en.wikipedia.org/wiki/Compact_Disc_and_DVD_copy_prot...](https://en.wikipedia.org/wiki/Compact_Disc_and_DVD_copy_protection)
)

Or some HTTP header attacks based on duplicate headers. Let's say there's a
firewall which allows only access requests for "Content-Type: text/plain". If
the attacker includes the header twice, once as text/plain and the other as
image/gif, then the firewall might only check the first while the back-end web
server interprets the second. This is a silly example; I couldn't remember the
real attacks which take advantage of the same mechanism.

~~~
creshal
Zip inconsistency vulnerability:
[https://nakedsecurity.sophos.com/2013/08/09/android-
master-k...](https://nakedsecurity.sophos.com/2013/08/09/android-master-key-
vulnerability-more-malware-found-exploiting-code-verification-bypass/)

Duplicate HTTP header vulnerabilities:

•
[https://bugzilla.mozilla.org/show_bug.cgi?id=376756](https://bugzilla.mozilla.org/show_bug.cgi?id=376756)

•
[https://splash.riverbed.com/thread/7772](https://splash.riverbed.com/thread/7772)

If a spec is open for interpretation, you can be sure that some software gets
it wrong.

~~~
acqq
The "zip inconsistency" linked is not a problem of the Zip file format
specification. The goal of the zip format was never to provide a tamper-proof
cryptographically secure binary package. Whoever uses this structure for some
more security-wise complex demands is responsible to make sure that the
security assumptions he needs are respected by the code he implements. One
simple method were signing the resulting archive, then no modification is
possible, and therefore also isn't possible adding a second entry with the
same name by the attacker.

I really don't care about HTTP headers or Reed-Salomon, as they are completely
irrelevant to the Zip format designed in eighties for very specific purposes
then (being fast on the floppy disk and 4 MHz 8088 processor, allowing to
recover as much as possible from the floppy even after the floppy disk sector
failure), I just claim, through the whole conversation here, that the Zip
format is not a bad format for having a central directory and the metadata
outside of it too, and I gave my arguments for that. NTFS also has some
metadata on more than one place and it's an intentional design decision
solving real problems: in Zip case allowing for fast access of the list of the
files in the archive but also easy recovery of the single files when some part
of the archive gets corrupted.

I also claim that the Zip format is not in any way "bad" for not making
impossible per the archive format design and specification of having the two
identical file names in the archive. I can even imagine use cases where
exactly allowing that is/was beneficial.

------
sebular
Really enjoyed reading this, I think I'd like to look more closely at an
implementation to understand it further.

Reading the article, I was reminded of a nagging question I've had in the back
of my mind for a bit.

The ASCII table was invented in the '70s, when the cost of disk storage was
much higher than it is today. But it's a shared dictionary, a standard that we
can all agree upon, something that's already on every computer.

The thing I've wondered about is whether there could be any advantage to
creating new generations of shared dictionaries that are domain-specific, and
much larger.

For example, in the specific (and over-simplified) case of transmitting large
quantities of English text from a server to a client, you could reduce the
amount of data sent over the wire if both parties shared a lookup table that
contained the entire English dictionary. In that case, you wouldn't transmit a
book as a collection of characters, but rather as a collection of words.

Furthermore, it would seem like you could apply the same traditional
compression methods like what's described in the article to furthermore reduce
the amount of data being sent. Rather than identifying repeating patterns of
letters, you would identify repeating patterns of words.

Of course, the obvious drawback is that an English lookup table is useless for
transmitting anything other than English text. But again, disk storage being
as cheap as it is, I wonder if it wouldn't be such a monumental problem to
store many domain-specific dictionaries.

Of course, you'd always want to keep the ASCII table as a fallback. Much in
the same way that a conversation in sign language (ASL, specifically) is
largely composed of hand gestures referring to words or concepts, but the
language still includes the complete alphabet as a fallback.

The thing I don't understand well enough is whether modern compression already
incorporates concepts that are similar enough such that a shared-word
dictionary would be useless. It seems like the "LZXX" style of compression
essentially includes a similar but dynamically-generated sort of dictionary at
the beginning of the transmission, and subsequently refers to that in order to
express the message itself. Would the gains of having enormous shared
dictionaries cancel out the advantage of that approach, only in a much more
"wasteful" way?

~~~
Lerc
I considered a project loosely along those lines. My idea was a One-Meg-of-
Data project. Define a megabyte of data that everyone would have have in a
bit-identical form.

The appeal of this to me would be in the code golf style of approach of
specifying the actual data. Apart from plain useful raw data, It would include
a bare minimum VM that can be used to generate larger outputs from the data.
More complex VMs would be permitted provided a reference implementation can be
run on the Minimal VM. The code for the reference Implementation would, of
course, be included in the one megabyte.

A program using the data could request

* raw bytes

* output from the VM given code and parameters (with perhaps a cycle limit to stop endless loop hangs)

* output from the VM given code and parameters specified at an index in the data.

* output from the VM given code and parameters specified at an index in the data plus user parameters.

The system would be stateless, the VM would only exist for the duration of the
data generation call. The same parameters would always produce the same
output.

I have doubts as to it's fundamental usefulness but it would certainly make a
fun collaborative project at the very least.

~~~
woliveirajr
Not exactly your idea, I think, but perhaps similar:

\- A patent from ibm for some previous dynamic dictionary [1]

\- Brotli: 120kb predefined dictionary [2]

[1]
[http://www.google.com.br/patents/US8610606](http://www.google.com.br/patents/US8610606)

[2]
[https://en.m.wikipedia.org/wiki/Brotli](https://en.m.wikipedia.org/wiki/Brotli)

------
Dylan16807
It's a good system. Though the full DEFLATE spec has quirks like storing
literal blocks prefixed by their length and then the ones complement of their
length.

The next step in improving a simple compression algorithm is to replace
Huffman encoding with arithmetic/range encoding. Huffman encoding proceeds one
bit at a time, treating everything as having a probability that's a power of
1/2\. Arithmetic/range encoding uses more accurate probabilities and then
packs them together, letting you save a fraction of a bit with every symbol.
As an analogy, consider how to encode 3 decimal digits. You could spend 4 bits
on each digit, and it would take 12 bits total. Or you could combine them into
a single 10 bit number.

------
brian-armstrong
I really enjoyed this article. It's just enough detail to leave you thinking,
"Hey, I should write my own DEFLATE library"

~~~
bbcbasic
Would be fun to do that in Haskell

Edit: [https://github.com/GaloisInc/pure-
zlib?files=1](https://github.com/GaloisInc/pure-zlib?files=1). Beautiful
they've done just the decompression! Challenge awaits

------
xxs
The article mentions LZW. In terms of algorithm elegance LZW should be head
and shoulders above 'deflate'. LZW got patented which (aside gifs) screwed up
its popularity. 'defalte' allows custom dictionaries that adds some extra
complexity.

I'd strongly disagree with _In many people 's eyes compression was still
synonymous with run-length encoding_ as arj was extremely popular to the point
it was used to cure full stealth viruses (v512 for example). _hard to provide
references now but wiling to search some old bbs_

Other than that - fairy nice/nostalgic article.

~~~
gpvos
I very recently had to look at an LZW decoder, and it is indeed more elegant
than Deflate as described here. However, LZW doesn't have a Huffman post-
encoding step, so it does not compress as well as Deflate.

~~~
beagle3
Asymptotically it doesn't matter - that's the genius of all LZ schemes (LZ77
=~ LZSS, LZ78 =~ LZW). All the other details just take care of the early pre
asymptotic stage.

~~~
TD-Linux
No, it still does - LZW assumes flat probabilities for each entry in its
dictionary. After a while, both the encoder and decoder will know more
accurate probabilities, but LZW has no way to take advantage of that.

~~~
beagle3
But asymptotically it doesn't matter because the LZ78/LZW code words represent
longer and longer sequences, with the more probable sequences getting shorter
codes. The additional letter (per each code word, in the case of LZ78) doesn't
matter asymptotically. Practically, of course, it does.

~~~
gpvos
How do the more probable sequences get shorter codes?

~~~
beagle3
That's basically the objective of every compression algorithm. Huffman coding
does it directly by construction, Lempel Ziv do it indirectly - in LZ78/LZW,
you get essentially and asymptotically (though randomly, per individual
sequence) code length proportional to log_2 prob(sequence), because the
sequences added to the dictionary are added by order of probability.

In LZ77/LZSS, it is basically the same, except the dictionary is not
constructed explicitly but rather implicitly from the history. In both cases,
you need to refer to Lempel and Ziv's proof that indeed, the codeword length
converges to log_2 P(sequence), which is optimal by the Shannon Source Code
Theorem.

I highly recommend Cover & Thomas book on information theory if you find that
interesting and are not afraid of math. And Shannon's original 1949 paper ("A
theory of communication") is surprisingly and exceptionally readable - though
it doesn't cover LZ for obvious reasons.

~~~
gpvos
Okay, so it's because the most probable sequences are more likely to be
encountered first, which assumes the data is more or less uniform. (I.e., it
doesn't work as well when the data starts with a large blob of random data,
and the more regular data comes afterwards. But I guess that's true for most
general compression algorithms. At least LZW has a reset code.)

I'll see if I can find the C&T book. Shannon's paper was part of the course
material at uni, IIRC; at least I think I read it long ago.

~~~
beagle3
It's not just a matter of encountered first, it's of encountered more often
("encountered first" is a special case of "more often"). But you really need
to delve into the proof to figure out why that's enough for optimality. At the
very least, _I personally_ don't know of a simple intuitive way to explain
that.

It is important to remember what each compression algorithm is optimal for.

LZ77 and LZ78 are optimal for observable markov sources -- that is, sources in
which given x recent consecutive outputs (where x is the markov order), the
distribution of the next symbol is fixed. While this is rarely ever the case,
it is reasonable to assume that e.g. English text is a 10-th order markov
model (with respect to characters) or 3-rd order (with respect to words).

The source you described is NOT markov overall, although it might be markov in
the tail (past the random blob). Asymptotically, of course, it is markov :)

~~~
gpvos
Hmm... I thought that in LZW, once a sequence has been assigned a code, it can
never be assigned a shorter code; codes are assigned in order, and they only
get longer over time (until a reset is triggered). So the frequency of a
particular sequence cannot influence the length of the code assigned to it. I
guess I'll read up on it a bit more until I comment further. I do see that it
could end up to be optimal in the asymptotic case.

~~~
beagle3
You are, of course, correct. However, in the same way that you cannot reassign
codes to the letter A-Z and yet you can compress by assigning a _longer_ code
to a _sequence_ (the code is longer but is still shorter than the individual
codes needed to represent the sequence), then even though you cannot reassign
a code, you will get a new code for a longer sequence that includes the
shorter sequence, represents more letters, and overall is more likely to occur
than the shorter sequence followed by something else.

Assuming no reset, over time the shorter codes fall out of use because the
longer ones eventually include everything they represent, extended by all
possibilities (with the more probable extensions getting shorter codes by
virtue of appearing more often / earlier).

------
jamesfisher
> This paragraph has five-hundred and sixty-one lower-case letters in

It must have taken some trial-and-error to make this sentence correct, since
the number itself adds some letters.

~~~
to3m
Pick a value that's close enough, and tweak everything else until it's true :)
This post has one hundred and thirty-two chars in it.

~~~
vintermann
This, ten

------
shenberg
The most clever part of deflate is actually how they store the huffman trees
without storing the frequencies but rather the length in bits of each symbol.
Combining the length of each symbol with its lexicographical ordering is
enough to uniquely specify the actual tree.

Ok, some research showed me that the idea can probably be traced back to
Schwartz & Kallick 1964 (can't find the text), popularized by TAOCP vol. 1...

~~~
userbinator
The technique is known as "Canonical Huffman" and is quite elegant indeed. It
is best understood by considering the codes as binary numbers, then the tree
structure naturally appears. There's a very short algorithm to generate the
appropriate encoding and decoding tables in the JPEG standard and it involves
not much more than shifts and adds.

------
Jasper_
I wouldn't say that entropy coding is "separate" from dictionary / window
compression. DEFLATE surely isn't the first algorithm to combine LZ77 and some
form of entropy coding, and surely won't be the last.

I highly recommend that everybody go out and build an INFLATE implementation;
it's a lot of fun.

Mine can be found at
[https://github.com/magcius/libt2/blob/master/t2_inflate.h](https://github.com/magcius/libt2/blob/master/t2_inflate.h)

~~~
dunham
I did this in javascript years ago (I had done snappy first, but wanted
smaller data). It's fun and not too difficult.

The deflate side seems a little more complex. I played with it while on
vacation about a month ago, but stalled after doing the LZ77 and huffman table
generation, because I really just wanted to learn on the LZ77 bits worked, and
I don't actually have a project that needs it.

------
flohofwoe
Anyone know what's the history behind that setjmp/longjmp as 'poor man's
exception' error handling? Since C works perfectly fine without exceptions,
why add such a hack? I've seen this in libpng and I think libjpeg has it too
(one of many reasons I'm avoiding those libs).

~~~
johncolanduoni
C lacks "zero-cost" exceptions, where the code should be just as fast as it
would be if no exceptions were involved as long as one isn't thrown. Although
they are _much_ more expensive when thrown than checking a return value and
using indirection for any logical return values, they are a bit cheaper when
you don't expect the exception to happen very often, or don't care about
performance if it does (e.g. a corrupt png or jpeg file).

My dream is something like the UX of Swift's exceptions (basically normal C
error handling except for it does all the tedious pointer work for you) with
the ability to tell the compiler to either implement the exceptions as return
values or stack unwinds for a particular code path with the flick of a switch.

~~~
vog
My big hope in this regard is Rust, which has a very good track record if
implementing good, usable, zero-cost abstractions.

If they manage to implement zero-cost futures [1][2], maybe they will also
implement sime kind of zero-cost error handling abstraction (either via
exceptions, or via some other useful abstraction for that purpose).

[1]
[https://aturon.github.io/blog/2016/08/11/futures/](https://aturon.github.io/blog/2016/08/11/futures/)

[2]
[https://news.ycombinator.com/item?id=12268988](https://news.ycombinator.com/item?id=12268988)

~~~
barrkel
Rust's abstractions are only zero cost if you're looking in the wrong place
for your costs. The Result / try! stuff in particular I would be very
surprised if it is faster than a good exception approach, for example,
particularly the happy path. It's amortised throughout though, while different
exception throwing mechanisms have dramatically different performance, so
people's intuitions aren't reliable.

I like Rust a lot. Zero costs is not a reason though; my reason is no GC and
not as bad as C or C++.

~~~
steveklabnik
Don't forget exceptions aren't free themselves; you need landing pads, etc,
all of which are not required.

------
mozumder
Deflate is so ubiquitous (being a web standard) that I'd like to see a
hardware Deflate encoder/decoder.

It really needs to be in the core instruction set of x86 & ARM.

~~~
userbinator
[http://www.xilinx.com/products/intellectual-
property/1-7aisy...](http://www.xilinx.com/products/intellectual-
property/1-7aisy9.html)

------
banachtarski
Using additional bits to encode lengths isn't really that different from
having a marker indicating if the payload is a match vector or not. In fact,
it could be argued to be less efficient for certain types of data depending on
the huffman distribution since all "words" are larger than a byte. It also
prevents you from encoding longer matches. I treat the decision as a
compromise between the two extremes that "worked" for the majority of types of
data "good enough" for some definition of both quoted words. The entire idea
works well enough but I wouldn't go out of the way to declare it a marvel of
engineering or anything.

~~~
HisGraceTheDuck
"it could be argued to be less efficient for certain types of data" This is
true of all lossless compression.

~~~
banachtarski
You're invoking a stronger "technically correct" interpretation of the
statement I'm making. Let me qualify. In a blob with many short repeated
sequences, the algorithm as presented is better. In a blob with many long
repeated sequences, a marking bit is better. Honestly, neither example is that
farfetched.

