Hacker News new | comments | ask | show | jobs | submit login
1MB Sorting Explained (preshing.com)
96 points by nemoto on Oct 28, 2012 | hide | past | web | favorite | 8 comments

While I believe the author's approach is the correct one, I am very curious to see a proof that an arithmetic encoding will always fall within the 1MB limit. The pigeonhole principle guarantees an average of slightly less than seven bits per number, but it doesn't follow that a particular encoding will always consume nearly the same number of bits--especially for the worst case input. Prefix codes are useful, but not that efficient when the probability density function is uniform and there are a lot of potential numbers to encode. Golomb coding works well on an exponential distribution, not a flat one. Does an adaptive encoding save the day here?

<http://en.wikipedia.org/wiki/Combination#Number_of_combinati...; tells you there are N=10^8+10^6-1 over 10^6 different multisets of 10^6 8-digit numbers (including numbers starting with zero digit(s))

A quick and dirty Python program computing the logarithm in base 2 of N in Python gives me 8093729.48175.

So, one can encode a number in the range [0, N) in 8093730 bits, or 1011717 bytes (possibly a few more due to rounding, but that is far enough from 1MB=1048756 to not worry about that)

Arithmetic coding replaces that [0,N) integer by a floating point number in [0,1) by dividing by 2^(8MB), but that does not make a difference.

So, there is plenty of room to store one such number. That leaves the problem of replacing one such number for K eight-digit numbers by one for K+1 eight-digit numbers. I'll leave that as an exercise :-) because you did not ask for it and because I cannot think of an easy proof.

Is the probability density actually uniform? Once all million numbers have been processed (which is the only case that matters - anything with <1M numbers has a strictly smaller encoding than it would if you tacked a bunch of zero deltas on to the end to reach 1M) , you're guaranteed that the 'average' delta value will be <= 100.

This means that even in a worst-case input set, "low" delta values will be substantially more common than "high" ones.

I think in the worst case it can be uniform: you can have an average delta value of 100 where the values range uniformly from 0 to 200. I think that case is fine, as you can still encode 200 in less than eight bits, but I'm worried that some worst-case number spacing will waste bits in any encoding scheme chosen, and that wastage will be enough to ruin the result. You have to pick an encoding scheme in advance but you don't know the distribution. Say you encode using 8-bit words. When the high bit is not set, the word ranges from 0-127. When the high bit is set, what follows is the full 27 bits. This will fail for certain distributions. For example, say a quarter of the numbers are spaced 257 apart and the rest are spaced 47 apart. You'll go over budget. Perhaps an adaptive scheme can be proven to work in all cases, but I don't know what that scheme is. Or maybe there's a simple scheme which can be proven for all distributions.

Arithmetic coding avoids the "wastage" of entropy you're talking about entirely - there are no "waste bits" in the middle of the stream like there would be with a fixed-size encoding.


A sequence of 1M arbitrary numbers may take 3.32x10^6 bytes, but a sorted sequence of 1M arbitrary numbers is a different matter.

Point well taken. You are completely right.

"The laws of mathematics." That's good! I've updated the previous post to use this wording, hope you don't mind :) http://preshing.com/20121025/heres-some-working-code-to-sort...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact