In any case, an LZ decompressor fits in less than 100 bytes of machine instructions and a simple compressor can be implemented in not more than several hundred, all the while providing extremely high compression for its simplicity. It will easily outperform order-0 static or dynamic Huffman on practical files like English text, and would probably make a good assignment in an undergraduate-level beginning data structures/algorithms/programming course; yet it seems more popular to give an assignment on Huffman using trees, which is somewhat ironic since in the real world Huffman is implemented using bit operations and lookup tables, not actual tree data structures.
To give a trivial example, LZ will easily compress ABCDABCDABCDABCD while order-0 Huffman can't do much since each individual symbol has the same frequency.
My guess is that the "late" development of LZ is due to mainly two reasons:
i) At that moment the pattern matching algorithms were not so advanced. E.g. suffix tree was very recent, and in the next years lots of advances occurred in that area...
ii) Although LZ can appear easier or more intuitive than Huffman, I think it is much less intuitive to prove a good bound in the compression achieved by LZ. OTOH, Huffman is build in a way that shows that it achieves zeroth-order compression.
And Huffman isn't optimal unless you are lucky, unlike arithmetic coding.
To start your range is [0, 1). For each symbol you want to encode you take your range and split it up according to your probabilities. E.g. if your symbols are 25% A, 50% B and 25% C, then you split up that range in [0, 0.25) for A, [0.25, 0.75) for B and [0.75, 1) for C.
Encoding multiple symbols is just applying this recursively. So to encode the two symbols Bx we split up [0.25, 0.75) proportionally just like we did [0, 1) before to encode x (where x is A, B or C).
As an example, A is the range [0, 0.25), and AC is the range [0.1875, 0.25).
Now to actually turn these ranges into a string of bits we choose the shortest binary representation that fits within the range. If we look at a decimal number:
The beauty of arithmetic coding is that after encoding/decoding any symbol we can arbitrarily change how we split up the range, giving rise to adaptive coding. Arithmetic coding can perfectly represent any data that forms a discrete string of symbols, including changes to our knowledge of data as we decode.
A Huffman tree for digits might assign 0-5 to 3 bits and 6-9 to 4 bits. Encoding three digits will use on average slightly more than 10 bits. Using AC will let you give the same amount of space to each possibility, so that encoding three digits always uses less than 10 bits.
"0" = 0.0b = 0 falls in the range [0,0.25) so it's a valid encoding for "A"; but isn't it also a valid encoding for "AA", "AAA", etc.?
AA = [0,0.25) * [0, 0.25) = [0, 0.125), and so on.
It seems that adding "A"s to a string in general doesn't change its encoding.
It's the equivalent to pretending a Huffman stream never ends and is padded with infinite 0s.