The question would go like this:
Suppose I have a column-oriented file and I want to print out a column in a reverse-sorted order. How could I go about it?
This question is among the most effective ever. First it filters out the FizzBuzz failures right away, let's you see immediately how people think (does the candidate want to code it up or understands that they could do: cut | sort| head)? It lets you explore the various aspects of sorting numerical, alphabetical, different locales, in numerical you can have generic numerical sort etc. Then what if the file is really large, now a much better approach could be to split sort then merge sort back into one file.
everyone with real work experience has a story about sorting.
but then you can move on, let's do it in your favorite programming language, then explore of what if the data is "infinite" long, a stream ... and so on
it is a topic that can produce very interesting solutions, nobody is stressed out, and people that "fail" do understand why.
Edit: I will also say I feel that I can learn more about a person based on how they respond to easy questions. Are they cocky, are they showing off, are they rattled etc.
> does the candidate want to code it up or understands that they could do: cut | sort| head
Piping together commands is undoubtedly programming, just in ancient shell script. So in a sense you're really testing for bash expertise. Which is maybe really relevant to you but I wouldn't say you're really "avoiding" coding by knowing to use those commands together, you just decided to code it with obscure shell commands. I might fail your test by choosing to write that sort of thing in python, because I could write it much more reliably in 10 minutes than deal with all the incredibly weird things that can happen in bash. I mean the final product in bash might be slightly more elegant, but your terminal history is probably littered with "man cut" and various attempts at it.
But for your example, my answer might be: if you're querying this file once, you may be querying it again, it's probably just way simpler to load the thing into sqlite instead of trying to imitate sql with some janky unix commands.
> everyone with real work experience has a story about sorting.
I am 34 and an experienced coder and I literally have no stories about sorting, and I've never once wanted to sort a CSV file on the terminal.
what is not well received is the judgmental tone, passing judgment about me for things you cannot possibly know, no need for that either, simple questions also irritate some, very important to weed those people out too,
I expect you would fail the test because of the attitude, even though if this were a real job interview you'd do your best to suppress it, it would come out
But this is the irony---a job interview is a judgment. Why do you think feelings on this run so high?
How is this relevant? Now you're just taking cheap shots.
I would not recommend a candidate, who, when asked if they could do this with cut|sort|head would reply something like:
heh, what a pathetic question, I bet your history is full of "man cut"
it is not the right answer, it is needlessly obnoxious and indicates a person that can barely bottle up their emotions and quickly gives in under pressure. Usually not a good match to any team - unless they also bring in something massively beneficial.
"I wouldn't hire you with that attitude"
I'm getting more judgemental vibes from interviewer than interviewee.
there is some simple elegance and power to these old C tools
I'm a bit younger, but have done this dozens and dozens of times.
A lot of one off processes are way easier to handle with a bunch of terminal commands and pipes.
Not knowing Unix tools like cut and sort is a hard fail on a senior individual contributor in data science role, as is using sqlite which totally doesn't scale the way sort and cut does. Separates sheep from goats in data science land. You should really learn them if you're in the field and work with reasonably big data sets. Frankly you should learn them if you work with data at all, ever.
I've literally seen FAANGs recommendation engines powered with these tools running nightly on someone's desktop.
But maybe that’s more about separating sheep from shepherds.
Learning sort and cut takes literally takes all of 10 minutes, so if it makes you pass over an otherwise qualified candidate you have your priorities completely backwards.
Generally speaking, people like this have never actually dealt with large data sets, never dealt with issues involved with installing "unapproved software" on a machine (ridiculously common in The Real World), has probably never cleaned a dirty data set (what do you do when your giant csv is formatted in a way that Wes McKinney didn't think of?), and will in a senior role be a long term liability for a data science team that works on serious problems. Sure at one point I didn't know about them either: I wasn't a senior data scientist then. I submit that if you don't know about them and haven't actually used them, you aren't either.
Another good weeder for a person claiming to be senior: discuss how you would fix the performance of the default R naive Bayes implementation in e1071. It's numerically more or less correct, but written by deranged ape-men who don't understand how computers work (a problem in a lot of the R ecosystem; in the Python ecosystem, the problem is nobody has yet written algorithms for X, which ends up being a very similar problem: aka it's your job to code up sane algorithms).
OP is using knowledge of a specific technology as a heuristic for "has experience in role x"
But this always makes me wonder, couldn't you see that experience from a resume? If the candidate filled a data science role at somewhere reputable for 3 years, and you verify that they successfully filled that role, why rely on that heuristic?
As you say testing for the specific technology, when it can be learnt in 10 minutes, does not seem logical.
> Generally speaking, people like this have never actually dealt with large data sets
If the data is on a filesystem, then sed, grep and cut pipelines will likely be your fastest option (Yahoo! processed petabytes of logfiles for decades that way.)
If the data is already inside a database table and indexed well, that could be fast enough. But generally speaking, the ETL is often a bottleneck. And DBAs are $$$$ compared to "the UNIX way."
This sort of utter nonsense question, heavily loaded to your "standard" experience, which is anything but, is even worse than the questions cited in the article.
All you're doing is filtering for people who are in your tribe, who followed the same path as you and think like you, use the same tools as you and the same OS as you.
Pretend all you want, but you're filtering not by "experience" but by trying to find people in your tribe, which is naturally heavily weighted.
I am not selecting for a tribe, I am selecting for a job. The questions are loaded, of course. Among the many duties, the jobs do require processing large files, sometimes with cut, Python or C. I want the candidate to use the most appropriate tool as needed. I'd rather not have people implement functionality that already exists in the 'comm' command.
Of course, I want the candidate to ask me what the column separator is. That's why the question is formulated that way.
The right answer will depend on the column separator. Proposing the UNIX cut if the file is CSV is not such a good answer, but for tab-separated files, it is just fine. If the file is CSV and they tell me about cut, my next question would be if that is a good universal solution for CSV files in general.
When someone that knows about the pitfalls of using cut when parsing CSV it shows me they have indeed had experience with that.
Do you see why this question is the best... the possibilities are endless, and the rabbit hole much deeper than it may seem
TSV and CSV have the same limitations. A tab-separated file could still have tabs inside a field depending on the quoting convention. Either separator could be used with cut. I can't believe you are so confident in your partially truthful answers.
Oddly no one has heard of that, the only reason why I found out about it is because I had to read in punched tapes with 7 character ascii from an experiment done in the 80s during my undergrad.
Replacing CSV with flat JSON or parquet depending on the use case has been a good move for avoiding these issues. The risks of CSV are usually just too high.
The amount of times I've had to write my own sorting algorithm in my career: 0.
The amount of grep/sed/awk I've used? Countless.
Someone who is familiar with how powerful and flexible these tools are is likely to accomplish something that can benefit from them quicker than someone who isn't aware of them.
Also in my experience software devs that shy away from the command line because they don't like it rarely pan out.
% echo 'a,"b,c",d' | cut -f 1-3 -d ","
i.e. Be careful with your phrasing. That is a bias in itself.
I guess this is not what the questioner meant because they referred to using cut | sort | head as a solution. Though, I don't understand why head would be at the end of either problems solution so maybe I'm missing something. head could be a useful way of peeling out the column you want in the column-oriented problem.
All tech interviews start with a need to legitimize and reinforce the interviewers as successful and talented ....
Even if we're basically all terrible
OP asked a very open ended question, merely made a suggestion that it could also be done with some standard Unix programs (I grew up on windows and even I know about cut and awk because you spend enough time anywhere in tech you will know these). Why it triggered you, not sure, but perhaps the question it's doing its job after all.
I've watched a number of new grad hires pick up bash, vim, and version control from scratch in a month or two and go on the be very successful. For better or worse some good schools don't cover those sorts of ancillary skills, and not every good candidate will tinker with Linux as a hobby.
I didn't realize Linux Users counted as a race now :D
This is like complaining that you can't read the basics of a new programming language.
An example was a whiteboard problem requiring the BETWEEN syntax for SQL window functions, which is very uncommon. After I asked for a hint, the interviewer replied "You don't know the BETWEEN syntax for window functions? Everyone knows that."
I could tell I annoyed my interviewer when they told me I was wrong and I demurred, and politely asked them to look it up since there was some question about the facts. They did not look it up.
On a take-home test around that time, the question specifically said to use PostgreSQL just because the answer required a window function.
As a FAANG data scientist, I've never once wanted to use cut|sort|head nor have I wanted to work with CSV's. Everything is already sharded and encoded as a schema-enforced binary encoding like protobuf or thrift. The file is so large its better to favor Apache Beam or equivalent to parallelize the aggregations of particular fields over very large amounts of data. But, hopefully you just use some SQL-like interface such as BigQuery that when pointed to sharded files, can easily do aggregations for you with SQL-like language (which, kicks of distributed computing jobs under the hood and is not truly relational). Unless you're streaming data, then that's another question.
Testing unix commands is narrow minded IMO. If you want to test divide and conquer plus streaming, then just ask a flavor of that Leetcode question.
My partner is learning about data science now so I asked them if I could try this question on them in the context of a data science interview, first thing in the morning and without coffe. They looked at me and said "being asked data science interview questions by your spouse right after waking up is the worst thing in the world but I dunno I would load it into pandas and put it in a data frame". And like honestly that's not how I would do it (I would do awk | sort | head because I always forget the cut column options) but the whole point is that the answer prompts further discussion. Now I know to ask about python and pandas (the thing the candidate uses and knows), and not, I dunno, scala and cascading/scalding or whatever (the thing that I know or use). Good questions investigate what the candidate knows. Bad questions investigate whether or not the candidate knows the things you already know.
People on this site are way too concerned with "being correct".
If a list is sorted, then you'd be able to return the largest value. Since that is impossible the correct answer is that it's impossible.
the purpose of the interview is to filter out people that do not know the answer or have pre-learned something and don't fully understand its applications. When you are in a dialog it is a very different dynamic,
people that would not ask a question because it is already posted on leetcode are the problem the OP complains about
This answer would make you fail the interview :-)
This sort of thing is sadly common in interviews, where the interviewer some arbitrary answer in mind and expects you to read his mind, which is possible only some of the time.
By definition you can't sort an infinite list. You've conveniently turned the question, in your mind, into something like "how do you efficiently maintain an ordered list of incoming items?"
Hope I never pass your interview!
and it's literally called "merge sorted files" (page 175 of elements of programming interviews).
it is not so simple to regurgitate pre-learned answers when you alter a problem one small attribute at a time, in each case a different answer becomes optimal, thus you can see if the person understand what needs to be done or not
i feel very strongly this is a disingenuous claim. clinical psychologists go to school for a long time to learn how to assess people's abilities to think. why should i believe that you a random software engineer have any competency whatsoever. in reality this is exactly the reason standardized exams (or standardized interviews such as leetcode style problems) exist - because average person can't accurately make that call.