
1MB Sorting Explained  - nemoto
http://preshing.com/20121026/1mb-sorting-explained
======
codex
While I believe the author's approach is the correct one, I am very curious to
see a proof that an arithmetic encoding will always fall within the 1MB limit.
The pigeonhole principle guarantees an average of slightly less than seven
bits per number, but it doesn't follow that a particular encoding will always
consume nearly the same number of bits--especially for the worst case input.
Prefix codes are useful, but not that efficient when the probability density
function is uniform and there are a lot of potential numbers to encode. Golomb
coding works well on an exponential distribution, not a flat one. Does an
adaptive encoding save the day here?

~~~
jbri
Is the probability density actually uniform? Once all million numbers have
been processed (which is the only case that matters - anything with <1M
numbers has a strictly smaller encoding than it would if you tacked a bunch of
zero deltas on to the end to reach 1M) , you're guaranteed that the 'average'
delta value will be <= 100.

This means that even in a worst-case input set, "low" delta values will be
substantially more common than "high" ones.

~~~
codex
I think in the worst case it can be uniform: you can have an average delta
value of 100 where the values range uniformly from 0 to 200. I think that case
is fine, as you can still encode 200 in less than eight bits, but I'm worried
that some worst-case number spacing will waste bits in any encoding scheme
chosen, and that wastage will be enough to ruin the result. You have to pick
an encoding scheme in advance but you don't know the distribution. Say you
encode using 8-bit words. When the high bit is not set, the word ranges from
0-127. When the high bit is set, what follows is the full 27 bits. This will
fail for certain distributions. For example, say a quarter of the numbers are
spaced 257 apart and the rest are spaced 47 apart. You'll go over budget.
Perhaps an adaptive scheme can be proven to work in all cases, but I don't
know what that scheme is. Or maybe there's a simple scheme which can be proven
for all distributions.

~~~
jbri
Arithmetic coding avoids the "wastage" of entropy you're talking about
entirely - there are no "waste bits" in the middle of the stream like there
would be with a fixed-size encoding.

