When I was interested enough in compression to read about it, the majority of the information I could find seemed highly theoretical and made me think that Huffman was a good first step so I managed to write a simple static/dynamic Huffman compressor/decompressor, but that didn't work so well. It was "fast enough", but the compression ratio was disappointing. When I discovered LZ the difference was amazing: an extremely simple algorithm, requiring about an order of magnitude less code in implementation, that compressed much better than Huffman for data which would intuitively be compressible.
In retrospect, it's quite clear that Huffman and other "variable length code" techniques only work because of nonuniformity in symbol probabilities, so it is ineffective on data containing repeating patterns but uniform probability. E.g. take a string 256 bytes long containing all the byte values 0, 1, 2, ... 253, 254, 255; and repeat that string multiple times in a file. LZ can easily compress that pattern but Huffman can't. (Does LZ rely on probability nonuniformity? It doesn't look like it.) The fact that many common methods like FLATE use Huffman only to compress the output of LZ on their input probably reflects this, but I don't think I've seen such an intuitive explanation for it before.
Recently I've been following the progress of the Finite State Entropy algorithm. From the readme on the github project :
>FSE is a new kind of Entropy encoder, based on ANS theory, from Jarek Duda.
>It is designed to compete with Huffman encoder and Arithmetic ones.
>While huffman is fast but can only represent probabilities in power of 2 (50%, 25%, etc.) arithmetic coding can represent probabilities with much better accuracy, but requires multiplications and divisions.
> FSE solve this dilemma by providing precise probabilities, like arithmetic does, but using only additions, masks and shifts.
> This makes FSE faster, on par with Huffman speed, and even suitable for low-power CPU environment.
This also highlights a fundamental issue: the coding scheme is tailored to the expected data input. That's why there's no one-size-fits-all compression algorithm, and you can achieve far greater efficiency if you have a good model of the shape of the input data. (The same applies to lossy)
The MCR is also a nice way to demonstrate how random data can't be compressed. Let's say you have 256 symbols, each with probability 1/256 (independent etc). The Shannon Entropy = -sum(P(x) * log2(P(x) for x in 0..255) = -2^8 * 2^-8 * -8 = 8 bits. No compression possible.
Otherwise a great text, just its presentation is a bit distracting (at least for my taste/eyes).
Although somewhat different than what you're describing, you might be interested in arithmetic coding, which encodes data to a single numerical value that can later be recovered by an inverse process.
isn't lz77 the most efficient text compression algorithm ?