FWIW, Ning turned me down back in April 2008 so feel free to ignore my advice :)
I once had an interviewer give me the kobayashi maru without telling me. I thought the guy was a jerk. He could just have pulled a gun on me and demanded my wallet if he wanted to judge my personality. I declined the second round.
As well as, of course, to showcase the interviewer's vast wealth of experience and keen insight into the human condition that enables him/her to assess as much in 5 minutes from candidate's response to a completely contrived question.
I never worked in a place where I had to solve tons of small little puzzles all day, or have to memorize a set of obscure language features, and can't look them up online if I need to.
As you said, the best interviews are the ones by a white-board, working together to solve something resembling a real problem. Because, eventually that is what I will end up doing if I work there.
Doing this requires more effort and guts from the interviewer, but it's still easier than being on the other side of the desk. Finding a supply of suitable questions is probably the trickiest part - you could try looking up some problems from competitions just prior to the interview, or you could perhaps take a real problem you are trying to solve at that point in time (abstracted as required).
I can only recommend doing it with a candidate you are more selling to then evaluating, and whom you think will be a good fit.
A strong candidate will jump in and you'll have a great session (I had one in particular with a great young erlanger whom we wound up losing to someone else, but the interview left a lasting impression and we kept in touch long afterwards).
Essentially a high-level design session where they got to see my thought process and I got to ask lots of questions about the firm. The best part is that when I started they were pretty comfortable with my approach to certain projects as it was previously covered, and I knew what I was getting into. No big surprises for either side.
The tough part here is to choose a good question.
More often than not, however, people "design" interview questions to cause problems for prospective employees. You might think this is clever. You might think this helps weed out people who don't possess the ability to think quickly or react to feedback on their feet. All you really are doing is artificially limiting the prospective candidate pool.
Every dork with a chip on their shoulder can manufacture an a question to be unanswerable. Deep down inside what most of these bad interviewers really want is to design a question that creates one of two atmospheres in the interview room: (a) a submissive one, where the candidate is forced to placate the interviewer and reinforce their apparent intellectual superiority; or, (b) a confrontational one where the candidate and interviewer battle it out for dominance as alpha-geek. Either (a) or (b) can be seen as a positive depending on the type of interviewer that goes in. Ultimately, any interviewer that uses this technique is a bad one and any input they make into the hiring process should be seriously discounted.
The interviewer should adapt their interview style to the candidate based on reading their body language, communication skills, and overall demeanor in the interview room. Its the interviewer's responsibility to create the best interview experience possible. Some people are nervous; some don't think on their feet well; most rely on Google heavily to point them in the right direction; lots of them know more than they can demonstrate in the 5 minutes you give them to react to your stupid question.
Your job is find diamonds in the rough. People who are the best, not people who act, think, behave, or answer the way you think they should. Questions like the one in this blog post are tell-tale signs of alpha-geek interview dominance gone wild, and out of that you'll typically only get one style of candidate that makes it through the gauntlet, certainly guaranteeing you're not getting the best talent available.
The question presented is a simple, open-ended question with a number of acceptable solutions. It's not even remotely "unanswerable". It's at least mildly interesting as indicated by the number of responses at a similar question on StackOverflow: http://stackoverflow.com/questions/2573653/given-a-1tb-data-... . It's not far from the problems Ning solves. It's not an "aha" problem that you either get or you don't get. It's complex enough to reveal the candidate's thought process and problem solving methodology, but it's small enough for good candidates to get to a reasonable solution in the 45-60 minutes of an ordinary interview.
Nothing about the article indicates that the author has a chip on his shoulder, nor that he's a dork who wants to "battle it out for dominance as alpha-geek."
Nothing about the article indicates that author is looking for a specific solution reflective only of his particular prejudices in programming.
What in the world is the basis for your entire diatribe here?
My diatribe, as you so eloquently put it, is based on the article. The list of things that the author expects to see in the answer. The way the question is presented. The lead off with "interrogation". The "gotcha" follow-up regarding how things might fail. These questions are great at finding a very specific type of candidate, but the entire point of my comment is that this one candidate type is typically not indicative of the best engineer. Of course you'll disagree, I'm sure, seeing as you were part of this process.
Not one candidate I can think of has been either surprised by this question, or felt stuck, fwiw. Again, I may be completely self delusional, but I hope not!
If questions like the one posted find a type of candidate who can think well under mild pressure, who can quickly brainstorm sensible approaches to technical problems, and who can communicate their ideas and reasoning to their peers, then that sounds like a pretty good outcome to me.
I only participated in Ning's interviews as an interviewee. I worked for Ning for six weeks before the layoffs.
The only thing I'd notice when I blur my eyes a little bit is that Ning employees were better engineers than most I'd previously worked with. The same applies at my current employer, also known for its rigorous interviews. The similarities ended there.
You complain here about the things the author expects to see in an answer. He's looking to see what they think of that scale of problem and how their thoughts comport with their resume. He's looking to see what questions they ask to get more detail. He's looking to see that the solution is in reasonable bounds for hardware--that the candidate understands the limitations of the silicon we have to work with. He's looking to see whether they would reinvent the wheel or use proven technologies. He's looking to see what languages and technologies a candidate would choose to use. He's looking to see how well the candidate understands those tools. He's looking to see how cognizant candidates are of the failure modes of the system they're designing.
Maybe I'm dense, but I don't see any issue with the list of things the author expects to see expects to see. Can you be more specific, perhaps?
Likewise I don't understand the problem with the way the question is presented. He describes the data you have and the load you need to support. Is there something wrong with that presentation?
The crossed-out "interrogation" is obviously a joke. I'm certainly not known for my intuitive perceptiveness and I still saw that (and no, I never met brianm while working at Ning, to my knowledge).
How is asking about failure modes a "gotcha"? Every good engineer needs to consider how his designs and systems can fail. What in that question do you expect trips up good engineers?
The only "specific type of candidate" this question seems oriented toward finding is "good engineers." He's not asking for particular technologies. He's not even looking for a particular solution. He's just asking people an open-ended question about how they would design a system. All the best engineers I know would be able to answer that question in 30 minutes.
I don't disagree because I was part of the process, I disagree because I can't understand your objections. I hate to sink to personal attacks like you have, but have you considered that your opinion on questions like this may have been formed by your abilities? It seems a more likely possibility than that questions like this are fundamentally flawed in some way that you've as yet been unable to describe. Both companies I've worked with that have had rigorous interview questions of this sort have hired the highest quality talent I've ever had the pleasure of working with.
The author clearly maintains some expectations about the level of response from the candidate, but overall it's designed to be interactive. Chances are that unless the candidate gives a really superior answer, it's not going to be the only question asked.
Generally, I would say that any time an author takes the extra trouble to write in an extra word and cross it out, they are making some kind of meta-comment deliberately, rather than straightforwardly revealing that the crossed-out word is representative of their true state of mind. The crossed-out word is a meta-comment on what the author thinks about the shared idea-space of the author-plus-audience, usually to say "hey I know that we all know that this concept exists", usually to also say "and that this concept is an unmentionable", and often to also say "and I think that's a bit funny".
In this case, we all know that some people conduct interviews as interrogations (and some people think that indeed they should do so), and on the other hand some people think that this is offensive. And brianm wants you to know that he knows we all know it.
MAYBE it's on his mind because he's an aggressive domineering jerk who likes to interrogate people, but he's sensitive enough to be a little defensive about it. MAYBE it's on his mind because he's NOT aggressive or domineering, and he's frustrated by how easily interviews become interrogations, and his whole deal is trying to prevent that without making the whole interview meaningless. MAYBE something else. How're you going to find out? Read the article!
My point, tl;dr: If you see a word crossed-out, by an author you don't already know, and you think "Oh this author really means the crossed-out thing", I think that's a poor comprehension strategy.
Given that the author presents the exact question and includes a moderately lengthy discussion on how he uses the question, what is your evaluation in this specific case, rather than the general one?
Consider, for example, this comment from the article:
I’m looking for quite a few things as we go through the question. The first is their opinion of “a terabyte of data” and “5000 lookups per second.” Do they consider this to be a lot of data, or a fairly boring amount, same with the lookups per second. Leaning either way isn’t a failure, it is just information gathering for me, and referencing it against how you represented yourself in your resume, cover letter, and phone screen.
Your lip service about finding diamonds in the rough and catering to candidates who are nervous or can't think on their feet leads to just one comment: "Put up or shut up."
The OP has given us a very good, concrete example of his interview process, the traits it reveals in candidates, and how he tailors it to meet individual candidate reactions. Until you do the same your comment here is nothing but five paragraphs of whining about how some people don't interview well.
To which I reply, "Duh."
We all know the 45-60 minute interview is imperfect, and that both interviewer and candidate are just trying to make the best of a very contrived interaction to begin with. So, please, lose the attitude and give us something we can learn from rather than bashing the OP with vague generalities.
Do you consider this question to be unanswerable? As I stated, I'll accept pretty much any viable answer, my goal is to get folk to design something ~realistic (okay, this is a simplistic system, but we have finite time in an interview slot).
Honestly, I think the question I blogged about is pretty straightforward, and exists for purposes of framing a discussion about real systems and constraints. There is no answer I was fishing for at any point I used this question.
"Questions like the one in this blog post are tell-tale signs of alpha-geek interview dominance gone wild," is so far off base from what I am trying to convey that I think I utterly failed to convey my point, which was quite literally the opposite :-(
So yes, with your interviewing technique (orthogonal to the technical content of the question itself) you've succeeded in conveying a stance and mindset that is quite literally the opposite of what you intended.
Because all I got from the article is that he has lots and lots of follow-up directions. I.e., that he is prepared.
It provides insight, yes, but a certain cost: many candidates fine this approach simply alienating, and more to the point, it upsets their thought processes (such that they don't really get a chance to shine).
As a candidate you can train for this kind of confrontational approach, of course, but I think lrm242 nailed what's most alienating about this interviewing style -- that ultimately the candidate is put into a position that's basically uncomfortable to be in, where they have to _placate_ their interlocuter according to some hidden criteria set -- and mostly they just want that particular thread to be overwith as soon as possible.
I just think we can get the same insight into candidates from different ways, where things are more on an even keel, and there's more "flow." It takes more tact and creativity (and perhaps a bit more time), but it's not that hard, is more mutually enjoyable and (I think) ultimately provides a lot more insight.
 "flow" being of course a very important concept; e.g. in the sense of Murakami.
Yes. Given a little time I could come up with 20 or more so ways of doing this. As to which one I would use that's not really answerable in an interview. The problem with questions like this is two weeks is a fairly long period of time. If it's literally all your working on then researching a few options and then asking some more questions is the best place to start. aka Is the access pattern random or would caching be useful? Is the data private? etc.
This isn't an unanswerable question. I've asked and gotten great answers to a very similar question about handling data at scale. And what I'm looking for includes the opposite of a chip on the shoulder.
On the other hand, if you have a perfect hash function for the keyspace into a hash addressing space (you'll want something a bit larger than 2^32), you could simply choose the device with the first few bits, then do a single seek into a (large) file on disk with the position indicated by the remaining bits of the hash addressing space, shifted over appropriately for the 128-byte stride of values. You don't even need a perfect hash: with a merely good hash function, you can cheaply read in more than 128 bytes from disk (OS will probably be reading in 4K anyhow), and use open addressing with linear probing.
You'll need enough devices to spread the load to get good latency for random seeks; probably a combination of mirroring and what amounts to sharding, using a portion of the key hash to select the device. But that's a relatively cheap numbers game.
But I'm not really seeing what a DB buys you here, in this specific scenario.
EDIT: Further investigation of tc and cdb (as linked in the article) suggests that tc (Tokyo Cabinet) may work in its hash form, but I'd rule out cdb for using two seeks, which I would expect to double the number of devices required.
This is why interviews are so tricky - it's hard as a candidate to juggle answering the question and trying to decipher what traits the interviewer is really trying to expose with this question.
For example with this one: is the interviewer trying to see if I understand my data structures and the size of this data to the point where I could design such a system from the ground up? Or are they trying to see how practical I can be given real-world time constraints?
Ofttimes, though, candidates have started designing a general purpose on-disk hash database. That is fine and dandy, and we always need a better database, but it is pretty orthogonal to the immediate problem.
Actually, ofttimes engineers have implemented general purpose on-disk hash databases as part of j. random project, so that path certainly does occur on real projects :-)
You can think of two extreme answers to this as "buy oracle" and "buy some transistors". Those are both bad answers. The first is a bad answer because, unless the candidate can answer the follow-up question "how does oracle work?" it probably just amounts to wishful thinking and they almost certainly won't be able to say when it will fail. The second is a bad answer because you can't do that in two weeks by yourself.
Take for example, Foursquare's recent outage. From what I read online, it seems like they went with "use MongoDB" as a strategy without actually understanding what Mongo was doing under the hood. Presumably they didn't figure out how it would fail (the author's follow-up question, natch) and thus they discovered that the hard way. So they chose bad answer one. On the other hand, had they decided to implement their own database, they'd know all about its characteristics, but they wouldn't have launched yet.
I try to emphasize that I want to know what they would actually do, on the job. Other questions (typically other interviewers) will explore algorithmic stuff, and folks I work with have used this question with that intent -- I use it to see how they think about building systems.
Another thing I didn't mention in the blog post is that up front I tell them exactly what I am looking to learn. I am not going on a fishing expedition, or trying to trick anyone. I find that kind of "second guess the interviewer" thing to be asinine, so strive to avoid doing it myself.
In terms of randomly distributed requests:
I'm surprised the topic of "working set" is not part of the question. Even with huge amounts of data, if the working set can be stored in RAM and only rarely swap data in and out, its a much different problem than 5000 disk reads a second. Lots of databases are optimized to efficiently maintain cached data in memory and swap as little as possible and as efficiently as possible.
It's a problem when asking interview questions: there are many ways in which the question may be unclear and the interviewer is mislead that the applicant can't answer. Another problem may be that the interview interferes with the thinking style of the applicant: you try to discuss it with him while he needs time to think by himself, or just the opposite: he thinks better in a discussion and you stay quiet while he sweats in stress.
yeah, when I was interviewed for my university, they asked me to design an algorithm to work out the n'th term of the fibonnacci sequence. They repeatedly said I didn't need to use recursion or loops.
When I finally gave up, they said: well here's one way:
f(n) = f(n-1)+f(n-2)
I just looked at them and quietly said "well... yeah... I mean... obviously". By that point I was too depressed to say "but that's frickin' recursive, wtf?!?" but that's what I was screaming in my head.
Amazingly I still got offered a place.
edit: still bugs the crap out of me. Did I mis-hear them? Were they being deliberately misleading? Was it some kind of test to see how I reacted to an impossible problem? Were they really expecting me to come up with some mathematical algorithm to calculate the nth term without recursion or looping? Argh!!!
: actually it's not impossible: http://mathworld.wolfram.com/BinetsFibonacciNumberFormula.ht...
but I don't have a background in mathematics and they knew that.
Here's the problem it started from:
Eleven lily pads are numbered from 0 to 10. A frog starts on pad 0 and wants to get to pad 10. At each jump, the frog can move forward by one or two pads, so there are many ways it can get to pad 10. For example, it can make 10 jumps of one pad, 1111111111, or five jumps of two pads, 22222, or go 221212 or 221122, and so on. We'll call each of these ways different, even if the frog takes the same jumps in a different order. How many different ways are there of getting from 0 to 10?
I immediately started thinking about seek times and elevator algorithms and that kind of stuff, which really threw me off course.
Not necessarily off-course. If you have 32 GB of RAM and you don't mind an average latency of 10000000 ms, a single 1 TB disk works just fine,
Use the filesystem. It's what it's there for, particularly if the data never changes. Nice big cache, xfs, nested hierarchical datastore using folders based on keynames, 20 lines of code, job done. 5000 lookups a second is trivial throughput for a decent disk with a decent readahead cache.
If multiple storage devices (hard disks or otherwise) are allowed, RAID (or just splitting the data across multiple disks) would ameliorate the seek problem somewhat, since seeks can then happen in parallel. I wouldn't see this problem as requiring a distributed solution, unless the bottleneck is the bus or storage controller rather than the single-disk seek performance.
You could also try increasing the throughput by having an in memory cache. The index/hash table can also be in main memory. Depending on the keys, locality can play a crucial role in prefetching data.
Damn you, now you've got me doing it. Nerd sniper. ;-)
One of my favorite questions is - Give me a normalized database structure that you'll implement if you were to build gmail - incorporate conversations, messages, multiple message participants and labels.
Then, depending on the candidate, I build upon the question,and go into various optimizations possible, the ways caching would be implemented, sharding/splitting/de-normalization would be done with load, etc. etc. With good candidates, its always a very interesting discussion. And even the bad ones can leave the interview thinking they know something and don't need to feel bad :)
Good candidates will start recognizing and acknowledging trade-offs, whereas bad candidates will settle on one clear path.
Good candidates will speak about the ideas they're trying to get across, whereas bad candidates will drop into talking about specific brands of technologies.
One candidate when faced with some questions on how he'll handle load said he'll switch to Oracle :) Promptly asked to leave.
Good ones clearly show that they have the ability to deal with multiple tracks of ideas, know that trade-offs need to be taken care of, in general, have an approach that with some ingenuity, almost all problems are solvable. Thats the attitude I (and I guess you too) look for.
I'd like to not be embarrassed when asked this question. What advice would you give someone wanting to learn enough to answer this question?
Are there any projects, books or other resources one could undertake?
In the intervening months since the layoffs, there's also been some attrition (as there always is) in those who weren't laid off. I know of at least two people who weren't laid off, but left for Facebook shortly after the layoffs.
I also know that they've already hired back at least one of my former coworkers since the layoffs.
Don't guess, ask. Having asked questions like this before, I can assure you that a reasonable interviewer will actually be pleased that you understand the alternatives.
(And, of course, if you get an unreasonable interviewer, you probably don't want to work there anyway.)
The first thing I did was work out the actual bytes/sec (throughput) needed for this many requests. Also given the 2 week timeframe I would question really hard what the data set looks like. Since the requests aren't randomly distributed does that mean you can narrow it down to a small enough portion to fit inside RAM.
I think the question is a good one because it does touch on a lot of aspects of infrastructure and systems design.
Does this follow when the original article says:
"The keys are not distributed evenly within the keyspace."
"Requests are distributed evenly (and randomly) across the keys"
On each request do a binary search. It takes 32 seeks in the worst case.
Even current consumer level SSDs can handle 45K IOPS, so you'd get 1400 complete searches/sec from one SSD.
You might need to spread your data over 4 SSDs, but you can search in parallel, so it will even be faster.
If you want more speed, replicate your drives and load balance.
There are better ways (with perfect hash functions), but this is the easiest, and requires no additional storage.
I worked on it on average two hours per day. It's not that ridiculous to believe that it would be (near) production quality in two weeks of fulltime work.
The question is somewhat similiar to
"give me the fastest sorting algorithm and prove that it is"
Everyone with a CS-degree or something similar will know this prove and Qsort from his very first years and I claim almost everyone will have forgotten the prove.
(I can remember that you can prove this by looking at the "descision" tree passed on the 1on1 comparision which will have depth of about the needet size ;) - but this is it)
I think most people will get the prove (even without google or wiki) in a resonable time (some hours?) but within interview-time and in this situation? In my case no way.
Because of extreme primitiveness of this case, custom solution would work better here than general purpose db. Sort the pairs by key and place in the tree, with the root node containing 2^11 keys (512K), second level - 2^11 nodes of 2^11 keys each (512K each node size), 3rd level - 2^22 nodes of 2^10 keys (256K each node size). Keep root and all the second level nodes always in memory (1G RAM required). Given a key to find, look it up through the root and the second level nodes in memory - this would result in the index of the 3rd level 256K node to be read from disk. Time-wise reading a 256K block from disk is just one IO - pretty much the same time as reading 1 byte.
5000 lookups/sec at 1 IO/lookup - 20 disk array of 250 IO/sec disks - 15K ones would do it. 2 arrays of 400GB disks is still ~2 times cheaper than 1 TB SSD array. RAID5(6)-ing the disks will leave us with 6TB. Short-stroke the first 1TB (70Gb x 20) to get much better speed than 250 IO/sec (or just use less than 20 disks to start with) and enjoy the rest 5TB for non-IO intensive purposes.
First you write:
> a system that lets you look up the value for a given key
and then later on you write:
> if they think the problem of looking up a key by value has been solved before
Did you mean to say "looking up a value by key" or am I misunderstanding it?
(Given that no one else has commented on it makes me question my question but I'm trying to get over my concern of being wrong on the internet and asking to be sure. :) )
Also, what does the interviewer mean by this:
> I generally ask these folks if they think the problem of looking up a key by value has been solved before, especially given the two weeks to be live in production requirement.
Is that a typo? s/key by value/value by key/ ?
closely followed by
"Draw a picture of your job/Facebook/the future
For some people this may seem like a fun task and they'll get excited about it, some may not find it useful and argue about it and others (the ones you want to avoid) will frown upon it and say 'Fine. I'll do it anyway'.
Most of the people I've interviewed are well accomplished engineers, but that doesn't always mean I want to work with them.
Why do you think it is NOT useful to draw a picture?
I've been hiring game programmers for a long time now. I think there are plenty of other ways to evaluate the non-tech personality/team-player side of candidates. Start with talking to them.
I ask very blunt questions in this area. Will provide examples if folks are interested.