Hacker News new | past | comments | ask | show | jobs | submit login
Space-Efficient Huffman Codes Revisited (arxiv.org)
54 points by belter 35 days ago | hide | past | favorite | 13 comments



"Canonical Huffman code is an optimal prefix-free compression code whose codewords enumerated in the lexicographical order form a list of binary words in non-decreasing lengths. Gagie et al. (2015) gave a representation of this coding capable to encode or decode a symbol in constant worst case time. It uses σlgℓmax+o(σ)+O(ℓ2max) bits of space, where σ and ℓmax are the alphabet size and maximum codeword length, respectively. We refine their representation to reduce the space complexity to σlgℓmax(1+o(1)) bits while preserving the constant encode and decode times. Our algorithmic idea can be applied to any canonical code."

https://arxiv.org/pdf/2108.05495.pdf


> the problem we study becomes interesting if σ = ω(1), which is the setting of this article.

What does “ω(1)” mean in this instance?


Typically ω-notation means an asymptotic lower bound on a problem. Generally these can be hard to find.

f(n) is in ω(g(n)) if for all c > 0 there exists N > 0 such that c|g(n)| <= |f(n)| for all n >= N


Aren't all problems solvable with Huffman codes solvable more efficiently with arithmetic codes?

Patents used to be the reason for choosing Huffman, but now even that reason is gone.


My "Introduction to Data Compression (5th Edition) 2017" from Khalid Sayood has a section named

"Comparison of Huffman and Arithmetic Coding" and the conclusion is around the following lines:

"...This means that for most sources we can get rates closer to the entropy using arithmetic coding than by using Huffman coding. The exceptions are sources whose probabilities are powers of two. In these cases, the single-letter Huffman code achieves the entropy; and we cannot do any better with arithmetic coding, no matter how long a sequence we pick..."

"...for Huffman codes, we are guaranteed to obtain rates within 0.086 + pmax of the entropy, where pmax is the probability of the most probable letter in the alphabet. If the alphabet size is relatively large and the probabilities are not too skewed, the maximum probability pmax is generally small. In these cases, the advantage of arithmetic coding over Huffman coding is small; and it might not be worth the extra complexity to use arithmetic coding rather than Huffman coding. However, there are many sources, such as facsimile, in which the alphabet size is small; and the probabilities are highly unbalanced. In these cases, the use of arithmetic coding is generally worth the added complexity...."


So practically, yes :) Arithmetic coding is dope.


ANS can achieve both arithmetic encoding like space efficiency and Huffman like coding performance.

Many modern compression algorithms use ANS for entropy coding.

https://en.wikipedia.org/wiki/Asymmetric_numeral_systems


Both Huffman coding and arithmetic coding requires a separate coding for the input distribution and it might be easier to code the (canonical) Huffman tree than the full probability distribution. This is also why many uses of arithmetic coding is adaptive.


That sounds highly unlikely. To build the Huffman tree you would have to first find the full probability distribution, so arithmetic coding will pretty much always be less work.

The reason most arithmetic coding is adaptive is because that is often more efficient, and it is slow and cumbersome to do with Huffman coding.


> The reason most arithmetic coding is adaptive is because that is often more efficient, [...]

Adaptive coding can be (technically) thought as a trade-off for stationary distributions, since it takes time for an the current distribution to catch up with the target distribution. Yes, it is more efficient for non-stationary distributions and even bearable for stationary distributions, but that doesn't make stationary adaptive coding is more useful than stationary Huffman coding.

As a practical example for my original reply, Zstandard does implement stationary coding for both FSE (the tANS variant of arithmetic coding) and Huffman coding. The structure of the FSE decoding table is not simple, but the minimum number of bits required for each symbol is initially 5. By comparison the direct representation of Huffman tables initially require 4 bits per symbol, with an additional compression scheme available. (Both numbers subsequently decrease as the number of possibilities diminshes.)


Besides from several typos (I was half-asleep at that time), I should have said "that doesn't make stationary arithmetic coding is as useful than stationary Huffman coding as adaptive arithmetic coding is to adaptive Huffman coding" instead. I really meant that stationary arithmetic coding remains useful but its usefulness is limited.


Huffman codes can be seen as a special/constrained case of arithmetic codes, where each symbol has a single canonical encoding as a bit sequence. That might still be desirable in many applications.


"Many"? I struggle to think of a single one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: