

Sorting a million 32-bit integers in 2MB of RAM using Python - astrec
http://neopythonic.blogspot.com/2008/10/sorting-million-32-bit-integers-in-2mb.html

======
chmike
Yes the question is bogus since 1 million 32bit integers require 4MB storage.

Here is my solution. Select the 500.000 smallest values, sort them using an in
place sorting algorithm and spit them out, then select the 500.000 biggest
values, sort them and spit them out.

The problem then boils down to select the n/2 smallest values of the list. To
do this I store the n/2 first values in the 2MB buffer. I then organize them
into a heap with the biggest at the top. I then process the remaining n/2
values. When one of these values is bigger than the heap top value, it is
skipped. If it is smaller, the heap biggest value is replaced with the smaller
value and the heap is restored to get the biggest value at the top. When all
the n/2 values have been processed, the 2MB buffer contains the n/2 smallest
values of the million integer list.

Since it is already organized as a heap one can use heap sort to sort it. Pick
the biggest value at the top and swap it with the last value of the list and
restore heap to get the next biggest value at the top. Swap it with the last
unsorted value and proceed until the heap length is reduced to one element.

Apply the same algorithm with the n/2 biggest value but using a heap with the
smallest value at the top.

This algorithm requires two passes over the million integer, and uses nothing
more than the 2MB buffer.

The complexity is O(nlog).

~~~
eru
I am in the process of writing a program to sort one million 32bit integers in
2MB in RAM.

The trick is to notice that sorting the numbers loses information. One million
sorted numbers take (in theory) log_2 (1000000!) less bits to store than the
same numbers shuffled randomly. That's around 18488884.82 bits.

So you get an ideal storage requirement of +32000000-18488884=13511116 bits.
That's about 1.6MB (thus less than 2MB). If you're clever you'll stay close
enough to the theoretical limit in your implementation.

Strangely bz2 seems to be able to compress my list of random numbers (in a
binary file) to 164KB. That's even less than what my theory says.

At first I suspected my pseudo-random number generator (/dev/urandom on a
recent Ubuntu) my produce hidden patterns. But I since tested it on true
random numbers from random.org. Same result.

I'll investigate.

~~~
slamb
I believe your ideal storage requirement, but what's the time complexity of
your algorithm, and could you describe it in more detail than "clever"?

I'm pretty sure random reads and writes on any compressed form will be worse
than O(1), and I'm pretty sure your sort will do O(n log n) of those, so I
believe your algorithm is asymptotically (and in practice for n = 1 million)
slower than chmike's two-pass algorithm.

~~~
FutureNerd
I was thinking along eru's lines. The specifics I came up with: 1) Compress
sorted lists by encoding the differences between successive numbers using a
code that's shorter for smaller numbers. 2) Use in-memory merge sort. That way
you don't need random access to the lists, and it's overall O(n log n). 3) In
order to merge large lists when memory is already almost full, divide the
lists into blocks, allocate blocks for the merged list while you're
deallocating blocks from the two source lists. 4) To do this in Python, use
the "array" module.

