
An In-Depth Look At Huffman Encoding - Anon84
http://www.dreamincode.net/forums/blog/324/entry-3150-an-in-depth-look-at-huffman-encoding/
======
pohl
I recently went through the exercise of implementing a Huffman codec. Instead
of the heap (or priority queue) I used the method that uses two queues to
build the tree. See the second algorithm given here:

<http://en.wikipedia.org/wiki/Huffman_coding#Basic_technique>

Even though I was writing it in Java, it was a lot of fun. I ended up with
something that I could use on the server side, or compiled down to javascript
in the client with GWT.

~~~
SeveredCross
I found the two-queue technique to be much more intuitive than the binary heap
technique, which definitely resulted in it being more fun to code.

As an aside, the two-queue technique is also more performant, as you can build
the tree in O (n) time instead of O (n log n) time.

~~~
eru
You still have to sort to build the queues. So you won't get below O(n log n)
for the whole algorithm.

If you squint just about right, you can probably see the equivalence between
sorting plus queues and the tree based techniques. Especially if you use a
tree-structured sorting algorithm.

~~~
pohl
In my particular case, the character frequencies are (more or less) fixed and
the sorting could be done ahead-of-time before feeding the leaf nodes into the
first queue. So I really did get O(n) at instantiation time. However, my n is
so small that it didn't matter much.

~~~
eru
You can also get fix your whole encoding then.

~~~
pohl
You can if it's absolutely fixed. But what about when it's only "more-or-less-
fixed"? :-)

~~~
eru
Then you need the sorting, again. You can't have your cake and eat it, too.

~~~
pohl
Being able to relegate the cost of sorting to be a one-time cost on a server
and allowing clients to build the tree they need for decoding in O(n) allows
the client to [something about cake].

~~~
eru
I don't get it. If the probabilities are fixed, you might as well do the
complete prepocessing on the server side.

~~~
pohl
I guess I chose to build the tree on the client side because the list of
character frequencies is a fairly compact representation of the encoding, and
I didn't want to serialize the whole tree and send it across the wire to the
client. Are you suggesting that the client doesn't need the tree at all? If
so, how would the client do the decoding? I confess I'm doing the naive
crawling down the tree to the left or right as a 0 or 1 is read. Is there some
other way the client could be doing this?

~~~
eru
If you send the tree, you don't need to send character frequencies.

~~~
pohl
Yes that's true, but they don't have equal cost to transmit over the wire. Why
should I transmit the larger, more expensive tree when the character
frequencies are a smaller representation of the same information?

~~~
eru
Why should the frequencies be more compact than the tree? If at all, they
should take more space.

~~~
pohl
I was explaining my tradeoff within the context of the GWT RPC object
serialization. My tree is a graph with nodes of type BinaryTree<E>, where E in
my case is a HuffmanNode object containing a char and an int. My alphabet has
66 characters plus an EOM character. The number of total nodes in the tree is
133. Although the extra internal nodes have HuffmanNode instances that do not
contain a char, they do still contain the int sum of the frequencies of all
the characters under that subtree. The BinaryTree class itself has left,
right, and parent references. So the way the GWT serialization format works
out for this beast ends up being larger than the characters and frequencies
alone.

I know what you're thinking: why not transform the tree into an Ahnentafel
list and marshall it to a string containing only the alphabet characters in
the appropriate slots, and use another chosen character to represent null
slots, and write another implementation of the decoder that's driven off of
the array? _(Edit: Or, better still, transform it to a canonical Huffman
codebook)_

Yes, of course I could do that — and I actually want to. Alas, I have bigger
fish to fry than writing the tree transformation code, the marshaling code,
and the additional decoder implementation. So, when it's all said & done,
according to the tradeoffs I was willing to make, my clients get to build the
tree in O(n). I can tell that you're vexed that it worked out that way, but
that's life.

~~~
eru
Oh, you have practical considerations! Then by all means do what works best.

I was just thinking about purely theoretical optima.

------
xtacy
Related: Arithmetic coding, which achieves compression rates much closer to
the theoretical limit.

<http://en.wikipedia.org/wiki/Arithmetic_coding>

EDIT: Note that in all cases, we assume that the symbol occurrences are
independent. LZW outperforms Huffman because it doesn't make any such
assumption.

~~~
sqrt17
Arithmetic coding is patent-encumbered, and a bit harder to implement. The
point where arithmetic coding outperforms Huffman coding is when you have very
frequent symbols (i.e. less than two bit or so); you can circumvent this if
you apply Huffman coding to larger units such as words or tokens (i.e.,
instead of characters '4','2', have a single integer token "42").

Another thing that's fun to implement is the Burrows-Wheeler transform which
is at the heart of bzip2 (and you then need another compressor for the BW-
transformed text, but still...)

~~~
eru
Shouldn't the patents on arithmetic coding be at the end of their live by now?

You can probably view Huffman-coding as a very special case of arithmetic
coding.

~~~
dalke
Yep. The original paper was in 1979 or so, and pretty much all the US patents
have expired.

------
eru
Somebody should tell the author not to use JPEG for diagrams. There are ugly
artefacts. How do you comment on that site?

