Hacker News new | comments | show | ask | jobs | submit login

This was a Google interview question for a while (I know I got it as one) I think it has since made it to the banned question list. The 'trick' was, as answer #2 elided, to use bits to store the fact that you had observed the number and then you could dump them back in sorted order.

So imagine you have a bit field where 'one bits' indicate you have seen the number and 'zero bits' indicate you haven't. And you compress it with run-length encoding.

Your initial data structure is '99999999:0' (all zeros, haven't seen any numbers) and then lets say you see the number 3,866,344 so your data structure becomes '3866343:0,1:1,96133654:0' as you can see the numbers will always alternate between number of zero bits and number of '1' bits so you can just assume the odd numbers represent 0 bits and the even numbers 1 bits. This becomes (3866343,1,96133654)

Anyway, as you get numbers you will split and coalesce the bit space until you've either seen all numbers (0, 99999999) or you run out of memory because a hacker has sent you only the even numbers from the space.

Its a clever 'math trick' which explores how you think about numbers (are they a thing or a representation). But I never thought it really gave a good indication of whether or not you would be a good candidate for the company.

So as it turns out, this is wrong. Several folks have mentioned that in that the original Google interview question is 'sort the numbers' / 'remove duplicates' and the solution is to create a bitmap in a disk file to do a single pass sort. This question mixes that up a bit and my particular bitmap solution fails to meet the criteria of staying under a MB of memory.

With all the discussion I decided to code up a quick version in perl. The code is pretty straight forward, start with the tuple (99999999,0) and on each number do one of three actions:

1) Break the number representing a string of zeros into two segments.

2) Extend the length of an existing region of one bits.

3) Collapse two regions of one bits separated by a single zero bit.

Now the two key to the reasons it doesn't work are that one, it takes two 32 bit numbers to identify non-adjacent numbers, and two, for a uniform distribution random number generator the ratio of a million numbers to a space of 100 million means that most of the numbers won't be adjacent. So each 1 bit will cost 8 bytes to store. A million numbers * 8 bytes is 8 million bytes (7.6MB if you use 2^20th as 1MB). Its fun watching the algorithm because it gets slower and slower (its doing a linear search of segments to insert 'bits' and memory goes up and up, until you have about 50M uniques and then it starts getting faster and faster again as it collapses more and more segments.

Storing it on disk with an uncompressed bitmap, runs in O(n) time, and of course you 12,500,000 bytes, just about 12MB (to count multiples you need to multiply that by the number of bits you want to reserve per-number) but doing it in memory only requires a better compression algorithm than simple run-length encoding.

> So as it turns out, this is wrong. [...] I decided to code up a quick version

You rock. I wish more people at HN/SO/Google were like you.

It is extremely unlikely that you'll be able to sort a random sequence of 1 million 10 digit numbers like that, or ANY method for that matter, unless the particular sequence is highly compressible. You won't even be able to store the answer in the average case. You'll need 10^6*lg(10^10) bits to store the raw input, from which you'll be able to save about lg(10^6!) bits because it's sorted. That comes out to more than 1.7 megabytes.

Edit: the exact amount of bits necessary is ceil(log_2((1e10+1e6-1) choose 1e6)), which is ~1.756 megabytes.

Edit edit: see udiv's comment below: that 1e10 should be 1e8 of course.

You've missed the second part, which is reading out the sorted list. By storing which numbers you've seen, in a virtual bit field you have 'codified' in the index of the bit field the seen number. Thus to 'print the sorted list' you simply go through the bit field starting at index zero and print any index in the bit field that is a 'one'. That is your sorted list of numbers.

The reason this was an interview question was because it allowed the interviewer to see if the candidate could see 'past' the requirement of sorting (which pre-loads your mental algorithm cache with all sorts of algorithms) to the somewhat more subtle concept of embedding partial computation into the data structure.

Early in my career at Sun I was tasked with putting mandatory file and record locking into UFS in the kernel, the challenge had always been how do you avoid deadlock? Of course one level deadlock was fairly easy, you need to know if process A is holding a lock and waiting on process B which is holding a lock elsewhere, and you are in the context of process B trying to get this lock then you can see that you will deadlock A. But 'n' level deadlock starts getting harder because the possibilities multiply. I discovered that you could create a doubly linked tree structure which had modest memory growth per-process but any process could walk it one way and look for 'itself' on the list to discover if there was a deadlock possible. Here the 'code' was walk a list and note that you were already on it, so you would deadlock if you got the lock.

The number 'sorting' problem is very much like that because while numbers have different values, they are intrinsic values. Unlike strings which consist of an arbitrary sequence of tokens. So the goal of the question was to identify people who could see the difference between 'sorting numbers' as a problem and 'general sorting.'

Your solution still doesn't in any way fit into the required space...for a random sequence there will be hardly any coalescing going on, and you'll blow way past 1MB with the method as described.

They're 8-digit decimals, not 10, so it works out to 0.96 MiB.

I think the parent's solution works. The typical separation between numbers is about 2^7: that works out to 8 bits per number -- 7 for the length of the '00...0', plus one length-one '1'. 8 bits * 10^6 is 1 MiB exactly. You need a variable-length encoding, so you can fit 2^7 in 7 bits while still allowing larger numbers to be encoded.

There is fairly intuitive way to reason about the storage requirement. Lets start with you've seen no numbers in the range, you have the tuple (99,999,999, 0). That is two 32 bit numbers or 8 bytes of storage. Now lets say you've seen all the numbers in the range, you've got (0, 99,999,999), again 8 bytes of storage, two 32 bit numbers. Since we're describing a linear bit field, the worst case is : (1,1,1,1,1,1,1,1,1, ... 1,0) every odd number and none of the even numbers. So the memory to encode the virtual bit field will be (assuming 32 bit numbers always) 4 bytes * n where 'n' is the number of distinct regions in the bit field. With a perfectly uniform random number generator each number generated has possible outcomes:

1) It splits a region (it causes a string of zeros to have a one in them) (adds 8 bytes)

2) It coalesces a region (it removes the last remaining zero between two regions) (subtracts 8 bytes)

I'm sure there are other solutions, and as was pointed out it doesn't really deal with duplicates. I suspect a multi-value bit field might cover that.

With 10^6 random numbers between 0 and 10^8 almost all operations will be splits, and you'll end up with 8 bytes * 10^6 = 7.6 megabytes. So indeed the chances of this working for a random sequence are extremely slim. You have to have a huge number of coalesces, which is just extremely unlikely when you choose 10^6 random numbers in a range of 10^8. So for almost all inputs this method doesn't work.

I simulated it with a Python program and you get about 10^4 coalesces. That doesn't even begin to make a dent in 10^6, and storing this will still take ~7 megabytes.

With 10^6 random numbers between 0 and 10^8 almost all operations will be splits,


and you'll end up with 8 bytes x 10^6 = 7.6 megabytes.

Not if you encode cleverly. The successive differences will be on the scale of 10^8/10^6 = 100, which is a very small number. It takes 7 bits to store, or at least 8 bits in a variable-width encoding.

Sure, if you encode cleverly. That is the essential difficulty of the problem which he glosses over (note that he explicitly says that he will store the differences as 4 byte numbers). Getting it under 1.5MB (or even 1.1MB) is probably easy, but I have yet to see an encoding which is sufficiently compact to fit in 1MB.

You're right, I misread that. Still, that's an information theoretic bound, and I'd like to see anybody achieve that. That's not a lot of room to play with until you hit 1 megabyte. Unless you're doing something complicated across multiple numbers, that variable length encoding will already waste 1 bit at which point you're back to 8 bits and already way over budget. Additionally, if I generate a random array, sort it, and compute adjacent differences, then around 30% of the numbers are greater than 2^7.

Even though it's almost certainly possible and even practically achievable to compress the required data into the 1MB of space, doing so without using more than 1MB even in temporaries might be impossible. (To get this close to the bound you'll need to use "excess" entropy from one bitpattern to save space in the next; i.e. a delta # doesn't straightforwardly correspond to a bit pattern, but instead depends on context.) Clearly it would be hard, as you say.

An argument that you need context info: if you don't use context info, each bit pattern needs to be rounded up to the nearest number of whole bits. That waste's about N/2 bits. According to my calculations, that would raise the bound from 0.96MiB to 1.02MiB, i.e. too much.

But really, it's even worse. Let's assume you can indeed define such a compression and you manage to use the context sufficiently to avoid wasting too many bits. Further, you manage to do so without using much temporary space at all.

That means you'll need to be updating your sorted list in place. Each update is likely to be a non-trivial task, since you need to find the place to update (with variable-width encoding and no space for luxuries like an index, that's probably a linear scan), and you need to take into account some amount of context - and then, because you've changed some data, recode possibly unbounded amounts (so N) of following context. That makes this sorting algorithm quadratic; and it'll have a high scaling factor too since the operations we're counting here aren't cheap: we're not moving bits around, we're decoding and recoding them all the time in a context-sensitive way and that means doing lots and lots of branches.

Assuming the 1MB limit is reasonably inspired because the chip neads to be tiny, you're probably not dealing with the fastest chip in the world. As an estimate, I'd guess this algorithm (if we could make it) would be about 1000000 times slower than a normal sort (assuming a 13 times higher constant factor), and you'd be running it on something very very slow. On a modern processor, a plain sort takes about 75ms, making this thing 75 ks, or approximately 1 day. On that very limited 1MB proc...

So even you manage to pull it off (very hard), it's almost certainly going to be very slow.

It is possible. Just a bit complicated but possible with proper encoding. Definitely not easy to code in an interview. Or in a day of work.

How does that handle duplicate elements?

It doesn't. And it looks easily broken the more I think about it. Anyway, running time would be unusable. This is not the right path. Don't drink the Google kool-aid ;)

Thanks for the context.

So that solution does not cover all possible cases. Lovely of Google to ask that question.

Edit: not only that, the running time of Google's "solution" would be terrible, aren't there other possible data sets not covered or am I missing something?

Actually it can. I'll leave it up to you to figure it out. Proving that the adjusted algorithm works in all cases gets into some interesting number theory (think Galois Fields).

I rather spend my time with a useful solution than this path. Also it's a BS question anyway.

Well in the case of Google's interviewing techniques they don't really care about the question per se, what they look for is how folks reason about things. You can start down one path, back up, head down another, and explore a variety of paths to possible solutions. The only bad answer is "That's a silly question, why would I ever need to do that?" or something similar. An unwillingness to engage in a discussion, even about 'impossible' problems, was the most common reason I saw on feedback where Google passed on the candidate.

> Well in the case of Google's interviewing techniques they don't really care about the question per se

From my experience interviewing for Google, I beg to differ. Many other accounts seem to confirm this.

Maybe they're just looking for improvements in the way they do sorting? Asking questions like these in interviews might be a way to find people who could teach their code monkeys new tricks?

Or it might have just been some silly way to eliminate candidates for arbitrary reasons. The usual.

Anyway, sorting is a big deal I think. If you can do it faster even by just a little bit than everyone else, that's a competitive advantage. Just my personal opinion.

In thought the popular Google version allowed you to use secondary storage, and the challenge was to be efficient in how often you swapped.

"The list of numbers may contain duplicates, which I must not discard."

I guess the googlers woud answer having "0:1" for duplicates.

Big problem #1: insertions for 1M integers would take ages.

Big problem #2: And some distributions can't be covered this way. For example, 1m integers with distances 0:99 (e.g. +99 each one). Now think the same but with random distance in the range of 0:99. (Note 99999999/1000000 = 99.99, so it's possible input)

Google's approach is nonsense, too.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact