
Dissecting the GZIP format (2011) - siromoney
http://www.infinitepartitions.com/art001.html
======
vog
I would love to see a Python (or Haskell) implementation of gzip (and others)
that is optimized for:

* _readability_ and

* _introspection_

rather than speed and memory usage. That way, you could step into every
primitive, and see intermediate results in a human-readable representation (as
provided by the programming language). Encoding that stuff to binary would
always happen in a separate step.

That way, those compression algorithms, with all details, would be easier to
learn and to understand. Also, development of new, customized compression
schemes would be a lot easier than it is today, where you either have to take
existing ones, or have to start from scratch.

A custom implementation might be as simple as a rearrangement and
recombination of existing building blocks, or may introduce completely new
(possibly domain specific) building blocks.

~~~
joosters
GNU gzip is surprisingly well commented (as in, I was surprised to find some
multi paragraphed comments at all!)

For understanding the algorithm, surely it's better to look at sources like
this document rather than a code interpretation of it?

~~~
EdwardCoffin
I'd say that being able to interact with a readable implementation designed
for pedagogy, while it works on data chosen by me, would be an excellent
supplement to reading this document.

------
jvns
This article is pretty great. I used it as a reference to write gunzip in
Julia: [https://github.com/jvns/gzip.jl](https://github.com/jvns/gzip.jl)

~~~
epsylon
Have to mention that your blog is pretty awesome as well:
[http://jvns.ca/blog/2013/10/23/day-15-how-gzip-
works/](http://jvns.ca/blog/2013/10/23/day-15-how-gzip-works/)

~~~
siromoney
I agree, I spent yesterday evening on the blog (and that's where I found the
link).

------
notlisted
Nice write-up. Brings back good memories. (Pak, LHarc and above all ARJ).
BiModem FTW! I feel ancient.

~~~
tssva
You feeling ancient thinking back about Pak, lharc, arj and bimodem makes me
feel really ancient. I remember dialing the BBS on the phone and then sticking
the handset in the acoustic coupler of my Novation CAT which was connected to
the LNW System Expansion attached to my TRS-80 Model 1.

~~~
AceJohnny2
You had 300b/s both ways and you were _grateful_ for it!

To compare, apparently the download bandwidth with the Mars Reconnaissance
Orbiter is between 3500b/s and 12000b/s:
[http://www.astrosurf.com/luxorion/qsl-mars-
communication3.ht...](http://www.astrosurf.com/luxorion/qsl-mars-
communication3.htm)

------
conexions
Slightly related. This was one of my favorite assignments from my algorithms
class on coursera. We implemented part of bzip compression and decompression.
[http://coursera.cs.princeton.edu/algs4/assignments/burrows.h...](http://coursera.cs.princeton.edu/algs4/assignments/burrows.html)

------
koralatov
I hadn't seen this before, but it does a great job of explaining how
compression works in a way that the less technically minded can understand. I
think I finally have an understanding _how_ compression works, rather than
treating it as a semi-magical process as I've done previously.

~~~
dunham
I have a pretty solid understanding of compression algorithms and the BW
transform in bzip2 still seems semi-magical to me:

    
    
      ftp://apotheca.hpl.hp.com/gatekeeper/pub/dec/SRC/research-reports/SRC-124.pdf

------
SanjayUttam
This link was on HN a few weeks ago, but it's a pretty neat visualization of
gzip at work: [http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-
equal...](http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-equals-
awesome/)

------
joosters
A great article. I wonder if there are similarly thorough descriptions of the
compression part of gzip, and how to write that well?

The uncompression is perhaps the easy part of the process, and while you can
understand the principles of compression from it, you can easily miss the
massive complexity involved. A naive search for backpointers is O(n^2) or
worse, so a usable gzip implementation has to do cunning tricks to find
repeating data efficiently.

~~~
gizmo686
I just peeked at the gzip source (actually, the algorithm.doc file in the root
of the source). It looks like they limit back-pointers to point to the
previous 32k bytes, immidietly reducing this to a naive O(n).

From my brief reading, it looks like they use a hash table to speed up
lookups. Each 3 byte string is entered into the hashtable, with a pointer to
the relevent location in the original text. Each has index has a linked list
of entries. To find what backpointer to use for a given string, gzip finds the
relevent linked list from the hash table, and scans it looking for the longest
match.

More recent entries in the list are preferred (as smaller distances make the
pointers smaller because of the encoding). Additionally, the linked lists are
arbitrarily truncated to a certain length (determined by the '-1' through '-9'
run-time argument).

gzip also has a 'lazy evaluation mecanism'. This is where, after it matches a
string, it checks the next byte anyway (even if this byte is contained in the
previous match). If a longer match is found, gzip discards the previous
pointer and uses the new match. This is disabled in the fast speed modes.

If they are doing anything particularly clever, it is in the implementation
details. The algorithm itself seems relatively straightforward.

