
Understanding Asymmetric Numeral Systems - adamnemecek
https://ro-che.info/articles/2017-08-20-understanding-ans
======
terrelln
zstd uses ANS [1][2] to encode its (literal length, match length, offset)
triples.

[1]
[https://github.com/Cyan4973/FiniteStateEntropy](https://github.com/Cyan4973/FiniteStateEntropy)

[2]
[https://github.com/facebook/zstd/blob/dev/doc/zstd_compressi...](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#sequences-
section)

------
gumby
> The algorithm tries to assign each list a unique integer so that the more
> probable lists get smaller integers.

This sounds like basic Huffman encoding. What am I not understanding?

~~~
gopalv
> This sounds like basic Huffman encoding. What am I not understanding?

The smallest item in a Huffman encoded tree is a single bit.

You can end up with fractional bits with ANS.

And this puzzled me too, till I ran into a fractional bit-shift problem in
some of my work.

We encounter fractional bits very often when dealing with number systems which
don't encode very well into binary.

When encoding a (yes, no, maybe) into a binary structure, it uses up all 2
bits, while we do know that it doesn't cover the entire range properly -
because we can pack 4 values into 2 bits, while this loses a single value.

Basically to encode tristate data into booleans, you need to be able to shift
the input by Log2(3) bits - or 1.58 bits, so that you can use the other 0.42
bits into the next one etc, until you get to a point where you end up with a
whole bit in real "bits".

Shifting by 1.58 bits sounds rather odd, but that's what a CPU does when you
do a * = 3 (and the other one was really * 4).

So instead of being able to store 8 (yes, no, maybe) into 16 bits by using
boolean logic, instead you can store 10 (yes, no, maybe) by using *3 as a
multiplier instead of 4. Because 3^10 is 59049, the maximum encoding still
comes in way under 65536.

Anyway, I'm currently juggling with a fractional bit problem of my own - but
with a more familiar numbering system, Decimal.

I need to represent 38 digits and a sign within two 64 bit registers, however
splitting those two across Decimal values wastes fractional bits.

To represent all positive 19 digits in binary, you end up using 63.1166338029
bits.

So if you store 38 digits in binary as 2 independent 19 digits, you end up
consuming 128 bits, because you round up the 2 representations to 64 bits
each.

But if you store 38 digits in straight binary, you only end up needing
126.233267606 bits = 127 bits rounded up, which leaves a bit available to do a
sign bit.

Decimal(38,0) doesn't encode very well into 2 Longs in decimal, but it encodes
nearly perfectly by straddling the sign bit across the two values if you're
careful to encode it, so that you can squeeze in a sign bit into the higher
value.

Anyway, fractional bits are super interesting because it jumps from
engineering, all the way through computer science to end up with pure
information theory in mathematics.

PS: Genetics data is probably the next place where I'd look at this - ATGCU,
for instance

------
rmrfrmrf
Anyone else just stare at this stuff hoping to understand it someday?

~~~
thatcherc
For me at least, the barrier to understanding this was all the Haskell. I'm
familiar with functional programming (Scheme), but I couldn't understand
enough of the code to really grasp what was going on after the first section.

