Word. Totally off topic, but: I work in the field of information technology since more than 20 years now, more or less. Not always the same focus, not always full time, but always IT related. I consider myself a good problem solver because of my self learning and analytical skills.
I recently applied for a job as a BI developer. The interview consisted of 10 questions about SQL. I more or less answered them, just 1 or 2 wrong. Not wrong as in "not correct" but rather "Not what we exactly expected" or "you did not see the little traps".
Comes out they didn't take me because of my lack of SQL skills. I do not understand how this kind of recruiting process will help anyone getting skilled people and how this is still common practice. It's frustrating for people like me, who do not have the complete SQL syntax in mind, but are flexible in choosing their problem solving approaches. A couple of years ago I started in a big data company, never heard of MongoDB before, little skills in Bash. If they would just asked me questions about that, hiring me would be a total no-go. They did hire me. I improved process, like measurable, and mastered MongoDB. Nothing, that one could expect from a questionaire.
A second interview, same outcome. They not even asked me detailled questions, just wanted to know what my SQL skills are. I answered: Immediate, but I'm good in learning. The did not take me, too.
Although, I understand that it's hard to evaluate this kind of skill, I'm really frustrated, when I face those "hiring techniques". Or maybe I'm just not good in SQL, and they anticipated it.. ;)
Interviewers don't seem to realize that possessing knowledge and fluency are a trade-off. If I'm amazing at SQL, I'll have a gaps elsewhere and vice-versa. There's just too much to learn, and stay on top of.
My takeaway when I fail an interview due to nonsense like this is that these aren't places I would've been happy working at anyway so they did me a favor by not hiring me.
Unless, the company's goal is to hire someone with technical minutiae that they can pay the lowest wage possible :)
Good. You don't want to work there. It's not a good place to be in. That signal is very clear.
I was interviewing for a senior data science role and the other points of contact I had in the process were surprisingly non-technical conversations with a VP of data and another senior data scientist.
Alas, my skills in rotating blobs on a grid failed me that day so that was that.
I know a couple of developers there and as far as I understood all applicants have to do the IQ test. They also thought it was ridiculous, but the CEO really really likes them, so it stays.
I call this the "game show style" interview. You know the answer, you get $10k. Next question. You didn't know the answer? You're out of the show. Next contestant.
To me this is very disrespectful of your time and your capabilities. Once I was pumped to go into an interview with a very well known company in Sydney, but was dismayed to learn that they do this game show interview (with IQ test, no less!). I wouldn't want to waste my most precious resources (i.e. time) for a company that doesn't respect me.
Success if often a trial and error process, involving slowly building a deep understanding of the problem you're trying to solve (business and technical), hitting lots of problems, and not giving up too easily (at least, not giving up due to surmountable technical hurdles).
This often means spending many hours banging your head into a brick wall; but in my experience these are often the times I'm learning quickest even if it doesn't feel like it at the time.
Having a background in acoustics, reservoir characterization, telecom networks, opens up clients because you 'get it' or at least you work hard to get it, which improves buy-in of the experts to sit down with you and answer your questions. You did your homework.
If you don't and just storm in talking about something something neural nets, they'll see it as a waste of time, won't bother explaining nuances, will delay sending data you desperately need. You won't have their cooperation even if you have executive support. There's no data in CSV form or an API to hit in most real world projects, so you need their help getting data, and their expertise to understand it.
Another major point is specifying the metrics. The real world metrics, not AUC or F1 scores. You need collaboration to get there, too.
There's so much to be done before there's data to work with, let alone good data. And there's so much after the model building step.
It can drive people to quit. One reason is that when you storm in and consider that people are morons, you get frustrated rapidly.
>> Another major point is specifying the metrics. The real world metrics, not AUC or F1 scores. You need collaboration to get there, too.
>> you need their help getting data, and their expertise to understand it.
All this x100. And it's made worse as this humility isn't taught in schools and rarely asked in data science interviews
The previous person on the project, although brilliant technically, thought they were "idiots who didn't understand crypto", as if it were the end goal. All it took was to keep quiet for a second and listen to what they had to say, and let them talk about what was problematic, instead of snark.
I guess the unsaid part of this is “... and curiosity AND are often able to leverage this in data-driven solutions to bussiness problems”. Because nobody cares whether you’re curious or not if you don’t bring any value to the company. With that said, the part you mentioned almost seems like an innate ability, while the part I filled in could probably be trained.
Suppose you were still in college and wanted to become a data scientist. Based on your current knowledge and values, how would you go about it, being a student? Which skills would you hone and how?
If I could redo my twenties, I would tell myself to choose some topic I cared about, find publicly available data on that topic, and just start exploring the dataset with basic pivot tables and graphs. Ideally write up what you found and publish it somewhere. As you find interesting things in your data you'll naturally start asking more questions and you'll learn modeling techniques as a function of those questions and writing about will help you become a clear communicator (this matters far more than technical knowledge)
> Plot a histogram out of the sampled data. If you can fit the bell-shaped "normal" curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected.
This is commonly taught in undergrad stats, but you shouldn't do this. I'm of the opinion that normality testing in general is usually a red herring, but this is specifically not a productive way of doing it. Use the other methods. A visual test that relies on how much the histogram approximates a bell curve is very prone to error, because a sample from a variety of other distributions can look visually normal even though it isn't.
More broadly speaking, the reason I don't like this is because it's an example of the kind of formulaic, cargo-culted recipes that are often used in statistics without critical thinking. You should strive to obtain a deep understanding of your data and its distribution, and you should be deeply skeptical if the sample you happen to have looks normal. Nature abhors normality, and the central limit theorem can only promise a tendency towards normality as n approaches infinity. It says nothing about what size sample you'll practically need for your specific data to be able to treat it as normal.
Although normality testing is useless in many situations, the parent comment somewhat overstates the degree of caution required. In many contexts the exact distribution doesn't matter; sort-of-normal is good enough. For example, the t-test is used ubiquitously. It assumes normality, so we would expect possible non-normality to be a major problem, right? Not so. The t-test is extremely robust to departures from normality given equal sample sizes [1,2,3]. Or you can use a so-called non-parametric test. Rather than investing great effort in specifying exactly what distribution you're dealing with, it's more productive to simply use a test that is robust against your unknowns and move on to pursuing your actual objectives.
It's true that if you are interested in predicting events in the tails of the distribution, you really do need to study the distribution in detail. Predicting rare events is very difficult. But if you're just interested in differences between group means, don't overthink it.
 Posten, H.O., Yeh, H.C., and Owen, D.B. (1977). Robustness of the two-sample t-test under violations of the homogeneity of variance assumption. Communications in Statistics - Theory and Methods 11, 109–126.
 Posten, H.O. (1992). Robustness of the two-sample t-test under violations of the homogeneity of variance assumption, part ii. Communications in Statistics - Theory and Methods 21, 2169–2184.
That's over selling it. Its very sensitive to fat tails and skew. That's the reason robust testing and estimation is a thing. Wilcox's research would be a good near contemporary place to start .
 Wilcox, Robustness of Standard Tests https://onlinelibrary.wiley.com/doi/abs/10.1002/978111844511... (paywalled)
Abstract: Conventional hypothesis‐testing methods such as
student's t, the ANOVA F test, and methods based on the
ordinary least squares regression estimator, are not
robust to violations of assumptions. In fact, there are
general conditions under which these methods can provide
poor control over the probability of a type I error and
inaccurate confidence intervals, no matter how large the
sample sizes might be. Relatively poor power is yet another
Conventional statistical methods have a very serious
flaw. They routinely miss differences among groups or
associations among variables that are detected by more
modern techniques - even under very small departures from
normality. Hundreds of journal articles have described
the reasons standard techniques can be unsatisfactory,
but simple, intuitive explanations are generally
unavailable. Improved methods have been derived, but they
are far from obvious or intuitive based on the training
most researchers receive. Situations arise where even
highly nonsignificant results become significant when
analyzed with more modern methods. Without assuming any
prior training in statistics, Part I of this book
describes basic statistical principles from a point of
view that makes their shortcomings intuitive and easy to
Yeah, I shouldn't have said "extremely". Its much more robust than is commonly perceived, but that does not make it "extremely" robust. Thank you for the correction. But please note that I specified equal sample sizes, so the fact that "there are general conditions under which these methods can provide poor control over the probability of a type I error" is not really a refutation—I already implied that the t-test is not generally (= in all cases) robust. I also mentioned non-parametric tests as a potential alternative. I do not wish to imply that the t-test is always the right choice. But I stand by the assertion that approximate knowledge of the distribution, such as obtained by inspecting a histogram, is perfectly adequate to choose a test. The main point remains — list what you know about your data, and pick a test that tells you what you want to know with a tolerable error level for your application.
You can spend an awful lot of time picking the "correct" statistical test (if there is such a thing; tradeoffs exist) with little gain. Worse, making a lot of decisions about what test to use after looking at the data leads to p-hacking, potentially leaving you with more bias than if you naively used a slightly-wrong test from the start.
To elaborate on the t-test discussion:
From your ref , "If sampling is from nonnormal distributions that are absolutely identical, so in particular the variances are equal, the probability of a Type I error will not exceed 0.05 by very much, assuming the method is applied with the desired probability of a Type I error set at α = 0.05. These two results have been known for some time and have been verified in various studies conducted in more recent years" (page 79). I think Wilcox makes this statement under the equal sample size caveat, but the text is unclear on this point. This does not guarantee robustness under nonnormality, but it does refute the idea that nonnormality is always fatal to the t-test.
Yes, a t-test can be affected by skew even with equal sample sizes. Whether this is a problem depends on the application. The example in  uses a lognormal distribution and ends up with actual α = 0.15 for n = 20, desired α = 0.05. Which isn't great, but is somewhat tolerable. Also, a lognormal distribution is strongly skew right and left-truncated, and this is easily noticeable on a plot and can be transformed to a normal distribution. So this does not refute the idea that a sort-of-normal distribution is fatal to the t-test.
You can look up the potential α level and β level errors from violating the assumptions of a particular test—they're tabulated. E.g., applying the t-test to a Pearson distribution produces a typical α level error of 0.005 at desired α = 0.05 . That's definitely tolerable. This is why I stated that sort-of-normal is good enough for a t-test. It's also fairly straightforward to calculate the expected errors yourself for a particular situation, if reassurance is needed. If your statistics are intended to support a high-stakes decision this may be a good use of time. Working to validate your finding by other means is better, though.
I'm not sure what your ref  (the "Robustness of Standard Tests" book chapter) is arguing for or against from the abstract alone. I don't have a copy of that book. The abstract mentions most flavors of maximum likelihood ratio tests, which is too broad a set of topics to discuss effectively. The implication seems to be that null hypothesis statistical tests are bad and something else (Bayesian analysis? visual inspection?) is better, which I don't necessarily disagree with. If you could please clarify its contents and if it is worth tracking down, I would appreciate it.
 Wilcox, Robustness of Standard Tests
 Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy
 Posten, H.O. (1984). Robustness of the Two-Sample T-Test. In Robustness of Statistical Methods and Nonparametric Statistics, D. Rasch, and M.L. Tiku, eds. (Springer, Dordrecht), pp. 92–99.
In the book Wilcox champions robust estimators and tests (a la Huber, Tukey) because efficiency of MLE is very brittle.
And how would you fit the data? There's not one, unique way to fit. And as you say, a small deviation can mean a lot: if you use L2 distance, the errors at the outer regions of a normal distribution are probably dwarfed by any deviation that occurs more towards the center.
The way to address the problem you speak of is to think hard about the kind deviations from equality that would be most damaging to the application (were it to slip in undetected). Once you know what's the most damaging deviation, you can then select a test that is very powerful in detecting that specific type of deviation.
For example, you mention Kolmogorov(Smirnov) test. Those are daft blind at the tails of the distribution. So in case you need to catch deviations at the tail, then you can safely skip KS test, they are no good. On the other hand KS tests are very powerful around the median, if your application requires high resolution there, KS is the test you want.
Another related SQL gotcha I saw multiple times is finding the top n records of each group in a table. Which is a know-it-or-you-don't implementation, and the interviewer can still be a jerk if they want by slamming the interviewee if they include ties, or not (RANK vs. ROW_NUMBER).
It's telling that there aren't any SQL window function examples in this repo.
Another fun aspect of SQL interviews is dialect-specific questions, particularly with how date/times are handled. Years ago, a company famous/infamous for primarily using mySQL explicitly noted in a take-home assignment problem definition that the database was PostgreSQL, which allowed them to ask the aforementioned window question problem, the AT TIME ZONE syntax for filter, and allow a specific definition for "beginning of week" which required me to download the database and test it manually.
What kind of companies make people jump through these nerd hoops? Do they actually have real work that needs doing or is this all just posturing by interviewers?
I was an interviewer for a business analyst & data science team recently and we needed to hire several analysts & data-scientists who had strong SQL skills because 90% of the data manipulation/processing used SQL. I was definitely aware that this limits who we hire, but we were very short-staffed and needed people so it was easiest for us to hire people who had a good understanding of SQL. That said, we did not care about syntax subtleties, just more if could you generally answer questions using SQL.
And for the data-scientist role there was a lot more than just SQL but it was a useful check.
But not once were the "Gauss-Markov conditions" mentioned. Frequentist theory was only marginally addressed. I taught myself some of that stuff from the Internet, such as hypothesis testing theory, p-values, t statistic, ANOVA, etc.
Also, I'd say I'm good with data structures and algorithms, complexity theory, graph theory etc.
I thought these skills would be a good fit for data science jobs, but I guess it's really such a wide umbrella term, that probably you're more looking for people trained in the frequentist, statistical side of it. What application field are you in, if it's no secret?
But having studied in 3 universities in different countries, CS just doesn't care about stats. Probability theory yes, but frequentist topics like statistical tests not really.
Further, despite having a fantastic reputation, my program only discussed frequentist ideas with near 0 mention of Bayesian reasoning/methods (outside the same Bayes rule questions asked in the first weeks of every stats class). Overall the education felt too traditional, I would have liked to seen mention of more modern methods like the bootstrap and certainly mention of Bayesian.
Which is not unsreasonable considering that these things arent really related to computers (although they happen to involve computation). I think its just an artifact of history and how things happened that ML is associated with CS and EE departments, but really its applied math, not a core CS topic like say compilers, formal languages and complexity.
> What is regression? Which models can you use to solve a regression problem?
The current answer only list some names that have "regression" in it, and the description of what a regression is doesn't say anything that distinguishes it from classification.
It fails to mention that regression (in ML terminology) is prediction of a continuous variable. And that almost any method can be used to do regression: knn, neural network, random forest, svm.
If other answers are of similar nature you might fail the interview.
In the late 1800s, Sir Francis Galton noticed that extremely tall or short parents usually had children that were not as tall or short as themselves, i.e. the children's heights were regressing (returning) to the mean. He collected hundreds of data points, graphed them, and estimated a coefficient describing this relationship, thereby inventing "linear regression."
We call them "regression" models simply because the first linear regression model was created to demonstrate the concept of regression to the mean.
Classical regression techniques can be (and are correctly) used on binary, ordinal, or categorical dependent variables. I know we teach people doing ML that the two forms of supervised ML are classification and regression, but that does a disservice mostly in order to make visual examples in teaching easier by making every topic a binary classification question and then saying "oh yeah, this works for regression too".
Granted in an interview you'd probably want to use context in case the hiring people were trained on a specific vocab, but that maybe speaks more to the folly of using these dial-an-answer systems in place of actually learning the material.
> The answers are given by the community
> If you know how to answer a question — please create a PR with the answer
> If there's already an answer, but you can improve it — please create a PR with improvement suggestion
> If you see a mistake — please create a PR with a fix
It’s completely reasonable to assess the quality of informational content without assuming a duty to proofread and revise that content.
Especially when the issue that might be corrected isn’t a typo or malapropism but ‘this whole thing isn’t that great.’
I would be very suspicious of using this repo for studying.
- Interviewers will know "what is known" by every candidate (with the help of these pages) and harder questions will be asked
- If these questions are asked at >junior levels, then RUN! the work will not satisfy you. The interview should be fun, and show the creativity of the candidate. These ones could be answered by anyone who read it a few times. I would not like to work with somebody who only know the answers to these questions and not more
This probably won't be the case for a question as basic as, "what is regression?" But for any intermediate to advanced interview question involving regression, I would expect companies to jealously guard it.
If you're earnestly interested in building and testing your knowledge, I would recommend you read The Elements of Statistical Learning and Data Analysis Using Regression and Multilevel/Hierarchical Models. Also a good upper undergrad textbook in probability, like A First Course in Probability.
A First Course in Probability has a lot of problems (with solutions) and worked examples, but it’s light on intuition and pedagogy. It’s not an easy book to learn from, on its own. I highly recommend listening to Joe Blitzstein’s STAT 110 lectures and reviewing the wealth of problems/notes. The greater mastery of probability theory that you have, the easier studying ML and stats is. https://projects.iq.harvard.edu/stat110/home
Elements of Statistical Learning is a true textbook—a comprehensive bible that could occupy you for many thousands of hours. ISLR is the better book for a crash course: http://faculty.marshall.usc.edu/gareth-james/ISL/
There are also lectures and slides from the authors: https://www.dataschool.io/15-hours-of-expert-machine-learnin...
If the goals is to have a skill hire, typically someone who can maintain an existing, well-documented system, then having them know trivial details and banked information can be quite helpful. On the other hand, talent hires I would take in a different direction. If the candidate's only stand out quality is a clear memorization of banked answers I would wonder whether they could work from fundamentals.
For example, consider "What is regression?".
The answer given is one way to go, but if the job posting would involve causal analysis or more statistical know-how, for example, it would probably be insufficient. I would want the candidate to speak about linear projections, about sampling assumptions, about probability models underlying the process and so on.
On the other hand, I could imagine that it would not be a good strategy to start to lecture about minute details of regression analysis when applying for a standard data scientist position when sitting on front of "applied data scientists" or even HR folks.
Anyone has any insights?
The general data science process is that when faced with a problem, you 1) select the appropriate algorithmic tool(s) for the problem and 2) apply the tool(s) to the data. One of the challenges in data science generally is that the tools can get pretty fucking complicated.
The point of theory interview questions in general is to assess the first point, whether or not a candidate has the capacity to pick the right tool for a given problem. They want to hear that you understand some of the standard tools and what sorts of problems they are good for. If you "get" the common tools, you'll likely be able to reason about the application of new/different tools for weird problems as you face them, or so the logic goes.
Everything I just said was more or less objective. This is my opinion: bad companies that do not know how to hire data scientists usually do what you're describing. They ask theory questions to assess whether or not the candidate already understands the specific tools they expect them to use. Good companies tend to pay more attention to whether the candidate is capable of understanding the universe of tools in general and are less worried about their specific application area.
I should note that this is less relevant when hiring consultants or "plug and play" senior people. Of course for those roles you want to know that the candidate has done something similar already and is primed for success.
Ultimately these types of questions like "What is feature selection" are more likely to be encapsulated into case studies where the answer to the question itself will be, using feature selection.
For example: "Let's say you have thousands of categorical features for an anonymized dataset involving human traits, how would you figure out which predictors are the most important?"
Me and my co-workers have been asked exactly that by top companies although it was more for machine learning engineer/applied scientist positions. Many textbook questions asked one after the other. It's not the only interview type they did but it definitely mattered and was often the first filter. So if you didn't answer well enough then you were out.
According to the author, by reading one hundred pages of the book (plus some extra bonus pages), you will be ready to build complex AI systems, pass an interview or start your own business.
It is a read first book and buy later if you think it is good enough.
The answer will most likely be no.