

A Fun Developer Interview Question - dangoldin
http://www.dangoldin.com/2011/01/08/fun-developer-interview-question/

======
davidst
I've done quite a bit of work with dictionaries going as far back as 8-bit
machines with tight memory constraints. Bear with me here as it will get
interesting.

A Bloom filter is a great data structure but probably not the best choice for
this problem. It's not as memory efficient as you think.

Here's why: Think of a Bloom filter as a collection of hash values where each
individual hash value is expanded by 44%. The cost of maintaining the same
false positive rate of an explicit log2(n) hash is to use 1.44log2(n) bits
(see <http://en.wikipedia.org/wiki/Bloom_filter> for more detail.) So, if you
want a false positive rate of 1 in 64k you could store a 16-bit hash in an
array or a 23-bit hash in a Bloom filter. That's a significant tax to pay when
memory is tight.

In this case it's better to keep a bit array of explicit hashes than to use a
Bloom filter.

An array of hashes gives you some options to make it smaller:

1\. Sort the array by hash value and store only the differences using a delta
coding scheme. The downside is you have to search the entire array (or
maintain some indexes into it) to look up a single word.

2\. Break the array up into sub-arrays where each sub-array is chosen so the
first N bits of the hash are the same (and discarded.) Each sub-array uses a
smaller hash value and can still be binary searched.

Of course, strip prefixes and suffixes from the words so the number of hashes
that must be stored is reduced.

1.5 bytes per word is easy and with some work you can approach one byte per
word with a low error rate.

If you're allowed to do off-line preprocessing on the word list you can do
better.

Look up the history of Doug McIlroy's Unix spell checker. He got 75,000 words
to fit in about 52k (going from memory here and my copy of Bentley's
"Programming Pearls" is at home.) McIlroy traded away random access and had a
fairly high error rate to get that much compression. Essentially, what he did
was to create a large 1-bit Bloom filter and delta-encoded the differences
between set bits (similar to option #1 above.)

If you want to maintain fast random access for individual words you can do
this with a Directed Acyclic Word Graph (DAWG). A simple DAWG with fixed node
sizes and no stemming of any kind will get two bytes per word on English. Add
variable node sizes, stemming and a few other tricks and you can have fast
lookup with a near-zero error rate at a cost of about one byte per word.

~~~
dangoldin
This is a great response - thanks for the info and I will definitely read up a
bit more on it.

Just comes to show that there are almost always better solutions than you can
think of!

~~~
davidst
> Just comes to show that there are almost > always better solutions than you
> can think of!

It's true. One of the characteristics I always looked for when interviewing
programmers was whether they were open to other ways of doing things or
whether they fixated on the first solution that presented itself. The best
programmers are always looking for a better way.

------
Sandman
It seems to me that this is a classic case of "guess the answer I'm looking
for". The interviewer is obviously impressed with the concept of a Bloom
filter, and I'm not saying that there's anything wrong with Bloom filters, but
that kind of thinking in an interview is dangerous. What if the interviewee
disagrees with your views, and has strong arguments to use a completely
different approach? Would you accept them, would you at least aknowledge them
as valid, or would you stick with your solution as 'the only right way to do
it'?

~~~
Mesmoria
I agree. I can only assume that the article is an example of one possible line
of solution.

Given the performance characteristics of human typing speed I would use a
disc-based database. Simple to implement.

~~~
dangoldin
I forgot to mention that I do hear the disc based solution quite a bit and
it's definitely a valid response given the problem description. But I would
still say "let's say you don't have any disk space, how would you go about the
problem?" while still acknowledging that the answer was valid.

~~~
wccrawford
So at all times, you continue to lead them to the answer you want by
continuing to add restrictions until you get there. And if they don't
understand your favorite new concept immediately, they aren't judged worthy?

This isn't good interview technique.

~~~
dangoldin
I'm being defensive now but I think that's a pretty extreme view. Every answer
is part of the interview process and it's not about the final destination. The
more answers that people have the bigger the conversation and the more I can
understand how someone thinks which is what you want to get out of an
interview.

The fact that someone does quickly grasp the concept is a plus - why do you
think it's the wrong approach? I definitely adapt my approach as I do more
interviews so any feedback helps.

~~~
Xk
First of all, let me say this is a significantly better question than many
I've seen. However, It disturbs me when someone is interviewing me and they
are looking for a particular answer and they consider everything else wrong.
Even when I find more efficient answers, if it's not theirs, it's wrong.

Think about it like those little kid riddles: "A man is found dead with a
noose around his neck and a puddle of water at his feet. How is it he was
killed?" Put on a stool and the stool was removed? Then add the restriction
"There are no stools." He was lifted with the rope? "The rope was tied to a
bar on the ceiling and couldn't be raised higher." Raised in the air by some
strong people and then let hang? "No one raised him from the air." You can
make guesses all day long and I can make up excuses about why you're wrong
until you finally guess what I wanted you to guess -- that he was standing on
a block of ice.

More specific to your question: when I read it the first thing that came to
mind was just to have an array of booleans and for each word in the table set
H(word) to true and then when looking up a word check if H(word) is true. I
don't know what bloom filters are, so I'm not saying this answer is better
than them -- but after I suggest that instead of imposing some other
restriction, ask me to think about what is wrong with my answer. Talk to me
about the false-positives/false-negatives [or lack thereof]. Talk to me about
what hash function I could use. Or about why I picked that way. Or what
problems could happen if the hash function wasn't perfectly uniform.

~~~
dangoldin
Fair points but I don't expect people to get to the Bloom filter on their own
- I view as more of a process that both of us are driving towards. I also
enjoy hearing new solutions - there's always room for discussion and I will
try to dig deeper into answers if I sense something is wrong.

At the end I do like to get to the Bloom filter piece though in order to see
what the response is as well as talk about something that is new to them.

------
gfodor
If your interview question's success or failure is highly correlated with the
person's ability to apply a fairly esoteric data structure it's probably a bad
question.

The person who has never come across Bloom filters before has a huge
disadvantage in your question, and the person who has seen them before has a
huge advantage. I'd guess that as soon as a candidate says "Bloom filter"
before you do in an interview, it's an instant "hire" for you, but this is a
bad sign since you will get lots of false negatives and false positives.

False negatives will be good devs who never needed a Bloom filter (there are
many) and false positives will be green devs that have read the wikipedia
article on Bloom filters just before the interview out of dumb luck, or had a
question where that data structure was introduced in the way you're
introducing it, etc.

~~~
hinathan
So you're saying this interview question is a probabilistic (with the
potential for false positives) tool for detecting membership in the set of
people who are equipped to understand the concept? Sounds like a Bloom Filter
to me.

------
harryh
I'd probably just keep a hash of each of the words in the first list rather
than all the actual words. Simple & easy to understand for whoever has to deal
w/ the code after me.

I'm not 100% convinced that "ability to understand bloom filters after 5
minutes of explanation on them" correlates very highly with what I tend to
look for when making a hire. Then again, hiring is such a crap shoot anyways.
Mostly you just want to get someone talking about programming for a while and
try to determine if they know their shit or not. The exact topic is close to
irrelevant.

~~~
dangoldin
You pretty much nailed it on the head there with the "get someone talking
about programming" piece but I try to start with one problem and dig deeper
into it rather than having many questions as if going down a checklist.

That's why you also have more than one person do the interview so you can get
multiple perspectives.

------
Adrock
My personal favorite solution to this problem is to use a trie.

When asked to reduce the memory usage, you can recognize that the trie is
actually a DFA and apply a DFA-minimization algorithm. There are several other
interesting approaches here:

<http://en.wikipedia.org/wiki/Trie#Compressing_tries>

~~~
dangoldin
That's actually pretty neat - no one has really suggested this approach yet.

I don't think that this will work as memory keeps on decreasing - is there a
way to make a trie probabilistic?

~~~
Adrock
Instead of being probabilistically wrong, you can save memory by removing some
branches of the trie and becoming deterministically wrong.

You could remove the largest branches that correspond to the fewest words.
Alternatively, you could remove the branches that correspond to the least
queried words if you know something a priori about the query pattern.

------
hackinthebochs
The responses here are amazing. No wonder hiring is so broken for developers.
Someone actually comes up with a good question and he's practically attacked
for it.

~~~
dangoldin
Well it's an ongoing process to come up with the question and if I can get
better at interviewing it helps me as well.

I usually ask what people think of the question after the interview and the
responses tend to be pretty positive (although it may be due to the fact that
they are interviewing). It's a great feeling exposing a new concept to a
developer though and seeing them get excited about it - those are the exact
people you want to work with.

------
kqueue
The way I see it, you wrote an article to tell us you know about bloom filter
and your interviewee doesn't.

------
alok-g
Here is a similar interesting problem that I was once posed. Not sure how
common or rare is this:

You get a stream of words (using functions like getFirst(), getNext(),
boolNoneRemaining(), etc.). Once you reach the end of this stream, you have to
output a random word uniformly chosen from this list.

The first answer is of course to store the entire list and maintain the size
of this list. The moment it finishes, you multiply the size by a random number
...

Then comes the twist. What if there isn't enough memory to hold the whole
list?

The thing that stuck me first is that someone posing this problem by itself
meant that there was probably a solution to this, which I could then think of
and program shortly.

Now frankly, when I interview, this is usually the last question on my list.
Most candidates do not reach this far. Believe it or not, some half of the
applicants are not able to draw a finite state machine that could identify if
given binary number is odd or even (when its input stream is bits with LSB the
last...).

