96 points by nemoto on Oct 28, 2012 | hide | past | favorite | 8 comments

 While I believe the author's approach is the correct one, I am very curious to see a proof that an arithmetic encoding will always fall within the 1MB limit. The pigeonhole principle guarantees an average of slightly less than seven bits per number, but it doesn't follow that a particular encoding will always consume nearly the same number of bits--especially for the worst case input. Prefix codes are useful, but not that efficient when the probability density function is uniform and there are a lot of potential numbers to encode. Golomb coding works well on an exponential distribution, not a flat one. Does an adaptive encoding save the day here?

 jbri on Oct 28, 2012 Is the probability density actually uniform? Once all million numbers have been processed (which is the only case that matters - anything with <1M numbers has a strictly smaller encoding than it would if you tacked a bunch of zero deltas on to the end to reach 1M) , you're guaranteed that the 'average' delta value will be <= 100.This means that even in a worst-case input set, "low" delta values will be substantially more common than "high" ones.
 I think in the worst case it can be uniform: you can have an average delta value of 100 where the values range uniformly from 0 to 200. I think that case is fine, as you can still encode 200 in less than eight bits, but I'm worried that some worst-case number spacing will waste bits in any encoding scheme chosen, and that wastage will be enough to ruin the result. You have to pick an encoding scheme in advance but you don't know the distribution. Say you encode using 8-bit words. When the high bit is not set, the word ranges from 0-127. When the high bit is set, what follows is the full 27 bits. This will fail for certain distributions. For example, say a quarter of the numbers are spaced 257 apart and the rest are spaced 47 apart. You'll go over budget. Perhaps an adaptive scheme can be proven to work in all cases, but I don't know what that scheme is. Or maybe there's a simple scheme which can be proven for all distributions.
 jbri on Oct 28, 2012 Arithmetic coding avoids the "wastage" of entropy you're talking about entirely - there are no "waste bits" in the middle of the stream like there would be with a fixed-size encoding.
 [deleted]
 A sequence of 1M arbitrary numbers may take 3.32x10^6 bytes, but a sorted sequence of 1M arbitrary numbers is a different matter.
 Point well taken. You are completely right.
 "The laws of mathematics." That's good! I've updated the previous post to use this wording, hope you don't mind :) http://preshing.com/20121025/heres-some-working-code-to-sort...

Search: