

How gzip uses Huffman coding - StylifyYourBlog
http://jvns.ca/blog/2015/02/22/how-gzip-uses-huffman-coding/

======
sbilstein
When I was in college at Rice, one of the harder projects in a sophomore level
software engineering class was implementing Gzip according to RFC 1951 and RFC
1952. We had about an hour lecture explaining what huffman coding was and then
were left to our own devices for three weeks.

Half the challenge was just understanding what the RFCs were talking about,
teaching ourselves huffman coding, and then figuring out how to build this low
level stuff in Java in a reasonable way. This was definitely one of the
hardest projects I had up until that point.

We had a class discussion afterwards about what was so difficult about
implementing GZIP and a few of us mentioned the RFCs were tough to figure out.
Professor laughed and said something along the lines of "Besides just learning
computer science, we're hoping to teach you the value of good documentation!"

~~~
andrewf
I've implemented gzip... I found that implementing the decompressor first
(with a very verbose --debug flag) was critical to understanding the bitstream
well enough to implement compression :)

~~~
voltagex_
Would you recommend it as a learning exercise?

~~~
andrewf
Maybe. There are lots of finicky details, like inconsistent bit ordering and
arbitrary lookup tables. I think it's a long week of work at least, probably
more.

On the other hand, it is probably the simplest compression format in
widespread use that uses variable-length encoding, and there's a few
implementation around for you to look at. Things like bzip2 and LZMA2 really
don't have independent specifications and multiple implementations, and are
more complicated to boot.

Byte-oriented Lempel-Ziv formats like LZ4, LZO and LZJB would be a good way to
dip your toes into a production compression format. The LZW encoding used by
GIF files is also "simpler" in principle but I personally find it harder to
wrap my head around.

If you want to experiment with binary formats, it might also be interesting to
write a decoder for a simple image format, like PCX files, or even for old
game file formats like Doom's
[[http://doom.wikia.com/wiki/WAD](http://doom.wikia.com/wiki/WAD)].

~~~
unwind
LZJB is fantastic. Here's my take on a Python implementation of both
compression and decompression: [https://github.com/unwind/python-
lzjb](https://github.com/unwind/python-lzjb). It's ~150 lines for both, and
BSD 2-clause-licensed. It's not very high-performant (I wrote it for fairly
small amounts of data) but hopefully clear enough to learn from.

------
jdonaldson
Huffman coding is a gateway drug for data science. Once you see raw input as
data, you see as much benefit from optimizing as you do from analyzing.

