In the comments Jarek Duda discusses some of the more technical details of the implementation and of his ANS paper that enabled this writeup. One cool thing I noticed is this:
Another advantage of ANS is that we can slightly perturb
the initialization procedure using a pseudo-random number
generator initialized with cryptographic key (e.g. and
also the number of block): for example choosing between
the lowest weight symbol and the second best one.
This way we get simultaneously a decent encryption for free.
I toyed around with entropy encoding at one point based on a republished copy of an old Dr. Dobbs' Journal article I found, and I managed to cleanly reimplement the described algorithm in Python and reproduce the example mentioned in the article... but eventually I hit a problem I wasn't smart enough to debug. I'm eager to see whether this new algorithm is more approachable.
Didn't IBM's Q-Coder also make the same claims?
Here's another one: http://www.gvu.gatech.edu/people/official/jarek/courses/7491...
"The Q-Coder is a new form of adaptive binary arithmetic coding"
A lot of places seem to credit Rissanen, but I remember from when this was actually my field, that Elias did most of the work in the early '60s (and actually went on a tangent that Shannon considered in the '50s), and Rissanen's contribution was getting the finite word length arithmetic considerations properly done.
In fact, when I studied this at the university (early 90s), it was introduced as "Elias (aka Arithemetic) coding".
I've been meaning to get notes from his classes as well as some transform mathematics notes from one of David Huffman's classes online.
It builds a table of character sequences based on a probability. If you feed it a p=0.5 you get a sequence that halves each time:
0000000011111223 (shortened example, the real total length is 4096 so you get 2048x0, 1024x1 ...)
This is then used as an alphabet probability to build random sequences in the test file.
The algorithm scores better than huff for higher P, meaning that the alphabet shortens with a few dominant ones. This means that (statistically) you get a lot of words in your dictionary that resemble each other. And that's where the fractional probability plays its role.
To try to answer your question I'd guess that this technique shows its muscle for data sets in which the dictionary is large and consists of similar sequences. Example: a csv file with a lot of numbers, gis or kml datasets, Gerber files, ...