So imagine you have a bit field where 'one bits' indicate you have seen the number and 'zero bits' indicate you haven't. And you compress it with run-length encoding.
Your initial data structure is '99999999:0' (all zeros, haven't seen any numbers) and then lets say you see the number 3,866,344 so your data structure becomes '3866343:0,1:1,96133654:0' as you can see the numbers will always alternate between number of zero bits and number of '1' bits so you can just assume the odd numbers represent 0 bits and the even numbers 1 bits. This becomes (3866343,1,96133654)
Anyway, as you get numbers you will split and coalesce the bit space until you've either seen all numbers (0, 99999999) or you run out of memory because a hacker has sent you only the even numbers from the space.
Its a clever 'math trick' which explores how you think about numbers (are they a thing or a representation). But I never thought it really gave a good indication of whether or not you would be a good candidate for the company.
With all the discussion I decided to code up a quick version in perl. The code is pretty straight forward, start with the tuple (99999999,0) and on each number do one of three actions:
1) Break the number representing a string of zeros into two segments.
2) Extend the length of an existing region of one bits.
3) Collapse two regions of one bits separated by a single zero bit.
Now the two key to the reasons it doesn't work are that one, it takes two 32 bit numbers to identify non-adjacent numbers, and two, for a uniform distribution random number generator the ratio of a million numbers to a space of 100 million means that most of the numbers won't be adjacent. So each 1 bit will cost 8 bytes to store. A million numbers * 8 bytes is 8 million bytes (7.6MB if you use 2^20th as 1MB). Its fun watching the algorithm because it gets slower and slower (its doing a linear search of segments to insert 'bits' and memory goes up and up, until you have about 50M uniques and then it starts getting faster and faster again as it collapses more and more segments.
Storing it on disk with an uncompressed bitmap, runs in O(n) time, and of course you 12,500,000 bytes, just about 12MB (to count multiples you need to multiply that by the number of bits you want to reserve per-number) but doing it in memory only requires a better compression algorithm than simple run-length encoding.
You rock. I wish more people at HN/SO/Google were like you.
Edit: the exact amount of bits necessary is ceil(log_2((1e10+1e6-1) choose 1e6)), which is ~1.756 megabytes.
Edit edit: see udiv's comment below: that 1e10 should be 1e8 of course.
The reason this was an interview question was because it allowed the interviewer to see if the candidate could see 'past' the requirement of sorting (which pre-loads your mental algorithm cache with all sorts of algorithms) to the somewhat more subtle concept of embedding partial computation into the data structure.
Early in my career at Sun I was tasked with putting mandatory file and record locking into UFS in the kernel, the challenge had always been how do you avoid deadlock? Of course one level deadlock was fairly easy, you need to know if process A is holding a lock and waiting on process B which is holding a lock elsewhere, and you are in the context of process B trying to get this lock then you can see that you will deadlock A. But 'n' level deadlock starts getting harder because the possibilities multiply. I discovered that you could create a doubly linked tree structure which had modest memory growth per-process but any process could walk it one way and look for 'itself' on the list to discover if there was a deadlock possible. Here the 'code' was walk a list and note that you were already on it, so you would deadlock if you got the lock.
The number 'sorting' problem is very much like that because while numbers have different values, they are intrinsic values. Unlike strings which consist of an arbitrary sequence of tokens. So the goal of the question was to identify people who could see the difference between 'sorting numbers' as a problem and 'general sorting.'
I think the parent's solution works. The typical separation between numbers is about 2^7: that works out to 8 bits per number -- 7 for the length of the '00...0', plus one length-one '1'. 8 bits * 10^6 is 1 MiB exactly. You need a variable-length encoding, so you can fit 2^7 in 7 bits while still allowing larger numbers to be encoded.
1) It splits a region (it causes a string of zeros to have a one in them) (adds 8 bytes)
2) It coalesces a region (it removes the last remaining zero between two regions) (subtracts 8 bytes)
I'm sure there are other solutions, and as was pointed out it doesn't really deal with duplicates. I suspect a multi-value bit field might cover that.
I simulated it with a Python program and you get about 10^4 coalesces. That doesn't even begin to make a dent in 10^6, and storing this will still take ~7 megabytes.
and you'll end up with 8 bytes x 10^6 = 7.6 megabytes.
Not if you encode cleverly. The successive differences will be on the scale of 10^8/10^6 = 100, which is a very small number. It takes 7 bits to store, or at least 8 bits in a variable-width encoding.
An argument that you need context info: if you don't use context info, each bit pattern needs to be rounded up to the nearest number of whole bits. That waste's about N/2 bits. According to my calculations, that would raise the bound from 0.96MiB to 1.02MiB, i.e. too much.
But really, it's even worse. Let's assume you can indeed define such a compression and you manage to use the context sufficiently to avoid wasting too many bits. Further, you manage to do so without using much temporary space at all.
That means you'll need to be updating your sorted list in place. Each update is likely to be a non-trivial task, since you need to find the place to update (with variable-width encoding and no space for luxuries like an index, that's probably a linear scan), and you need to take into account some amount of context - and then, because you've changed some data, recode possibly unbounded amounts (so N) of following context. That makes this sorting algorithm quadratic; and it'll have a high scaling factor too since the operations we're counting here aren't cheap: we're not moving bits around, we're decoding and recoding them all the time in a context-sensitive way and that means doing lots and lots of branches.
Assuming the 1MB limit is reasonably inspired because the chip neads to be tiny, you're probably not dealing with the fastest chip in the world. As an estimate, I'd guess this algorithm (if we could make it) would be about 1000000 times slower than a normal sort (assuming a 13 times higher constant factor), and you'd be running it on something very very slow. On a modern processor, a plain sort takes about 75ms, making this thing 75 ks, or approximately 1 day. On that very limited 1MB proc...
So even you manage to pull it off (very hard), it's almost certainly going to be very slow.
So that solution does not cover all possible cases. Lovely of Google to ask that question.
Edit: not only that, the running time of Google's "solution" would be terrible, aren't there other possible data sets not covered or am I missing something?
From my experience interviewing for Google, I beg to differ. Many other accounts seem to confirm this.
Or it might have just been some silly way to eliminate candidates for arbitrary reasons. The usual.
Anyway, sorting is a big deal I think. If you can do it faster even by just a little bit than everyone else, that's a competitive advantage. Just my personal opinion.
Big problem #1: insertions for 1M integers would take ages.
Big problem #2: And some distributions can't be covered this way. For example, 1m integers with distances 0:99 (e.g. +99 each one). Now think the same but with random distance in the range of 0:99. (Note 99999999/1000000 = 99.99, so it's possible input)
Google's approach is nonsense, too.