
Data science interview questions with answers - fagnerbrack
https://github.com/alexeygrigorev/data-science-interviews
======
ryndbfsrw
I’ve worked in the field for 7 years now so not that long but long enough to
build some heuristics. The best data scientists are just people who try to
understand the ins and outs of business processes and look at problems with
healthy suspicion and curiosity. The ability to explain the nuances of
manifolds in SVMs is not something that comes into it outside these contrived
interviews. I prefer to ask candidates how they would approach solving a
problem I’m facing at that moment rather than these cookie cutter tests which
are easy to game and tell me nothing

~~~
y42
>> I prefer to ask candidates how they would approach solving a problem

Word. Totally off topic, but: I work in the field of information technology
since more than 20 years now, more or less. Not always the same focus, not
always full time, but always IT related. I consider myself a good problem
solver because of my self learning and analytical skills.

I recently applied for a job as a BI developer. The interview consisted of 10
questions about SQL. I more or less answered them, just 1 or 2 wrong. Not
wrong as in "not correct" but rather "Not what we exactly expected" or "you
did not see the little traps".

Comes out they didn't take me because of my lack of SQL skills. I do not
understand how this kind of recruiting process will help anyone getting
skilled people and how this is still common practice. It's frustrating for
people like me, who do not have the complete SQL syntax in mind, but are
flexible in choosing their problem solving approaches. A couple of years ago I
started in a big data company, never heard of MongoDB before, little skills in
Bash. If they would just asked me questions about that, hiring me would be a
total no-go. They did hire me. I improved process, like measurable, and
mastered MongoDB. Nothing, that one could expect from a questionaire.

A second interview, same outcome. They not even asked me detailled questions,
just wanted to know what my SQL skills are. I answered: Immediate, but I'm
good in learning. The did not take me, too.

Although, I understand that it's hard to evaluate this kind of skill, I'm
really frustrated, when I face those "hiring techniques". Or maybe I'm just
not good in SQL, and they anticipated it.. ;)

~~~
noodlenotes
There's another possibility but only because you mentioned SQL specifically.
People often use SQL as shorthand for "understand how to manipulate data," and
if a new hire doesn't have this skill it can really set a data team back, so
interviewers are touchy about SQL. It would be helpful if they clarified if it
was the SQL syntax skills or the data manipulation skills they had a problem
with. But generally I agree with you that companies should prioritize problem
solving skills over technical minutiae.

~~~
kvn_95
> _But generally I agree with you that companies should prioritize problem
> solving skills over technical minutiae._

Unless, the company's goal is to hire someone with technical minutiae that
they can pay the lowest wage possible :)

------
fractionalhare
The quality and depth of answers here is pretty inconsistent. But this in
particular is a pet peeve of mine:

 _> Plot a histogram out of the sampled data. If you can fit the bell-shaped
"normal" curve to the histogram, then the hypothesis that the underlying
random variable follows the normal distribution can not be rejected._

This is commonly taught in undergrad stats, but you shouldn't do this. I'm of
the opinion that normality testing in general is usually a red herring, but
this is specifically not a productive way of doing it. Use the other methods.
A visual test that relies on how much the histogram approximates a bell curve
is very prone to error, because a sample from a variety of other distributions
can look visually normal even though it isn't.

More broadly speaking, the reason I don't like this is because it's an example
of the kind of formulaic, cargo-culted recipes that are often used in
statistics without critical thinking. You should strive to obtain a deep
understanding of your data and its distribution, and you should be deeply
skeptical if the sample you happen to have looks normal. Nature abhors
normality, and the central limit theorem can only promise a _tendency_ towards
normality as _n_ approaches infinity. It says nothing about what size sample
you'll practically need for your specific data to be able to treat it as
normal.

~~~
jpeloquin
> You should strive to obtain a deep understanding of your data and its
> distribution, and you should be deeply skeptical if the sample you happen to
> have looks normal.

Although normality testing is useless in many situations, the parent comment
somewhat overstates the degree of caution required. In many contexts the exact
distribution doesn't matter; sort-of-normal is good enough. For example, the
t-test is used ubiquitously. It assumes normality, so we would expect possible
non-normality to be a major problem, right? Not so. The t-test is extremely
robust to departures from normality given equal sample sizes [1,2,3]. Or you
can use a so-called non-parametric test. Rather than investing great effort in
specifying exactly what distribution you're dealing with, it's more productive
to simply use a test that is robust against your unknowns and move on to
pursuing your actual objectives.

It's true that if you are interested in predicting events in the tails of the
distribution, you really do need to study the distribution in detail.
Predicting rare events is very difficult. But if you're just interested in
differences between group means, don't overthink it.

[1]
[http://www.jerrydallal.com/LHSP/student3.htm](http://www.jerrydallal.com/LHSP/student3.htm)
[2] Posten, H.O., Yeh, H.C., and Owen, D.B. (1977). Robustness of the two-
sample t-test under violations of the homogeneity of variance assumption.
Communications in Statistics - Theory and Methods 11, 109–126. [3] Posten,
H.O. (1992). Robustness of the two-sample t-test under violations of the
homogeneity of variance assumption, part ii. Communications in Statistics -
Theory and Methods 21, 2169–2184.

~~~
srean
> The t-test is extremely robust to departures from normality given equal
> sample sizes

That's over selling it. Its very sensitive to fat tails and skew. That's the
reason robust testing and estimation is a thing. Wilcox's research would be a
good near contemporary place to start [0][1].

[0] Wilcox, Robustness of Standard Tests
[https://onlinelibrary.wiley.com/doi/abs/10.1002/978111844511...](https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat06350)
(paywalled)

    
    
        Abstract: Conventional hypothesis‐testing methods such as
        student's t, the ANOVA F test, and methods based on the
        ordinary least squares regression estimator, are not
        robust to violations of assumptions. In fact, there are
        general conditions under which these methods can provide
        poor control over the probability of a type I error and
        inaccurate confidence intervals, no matter how large the
        sample sizes might be. Relatively poor power is yet another 
        concern.
    

[1] Fundamentals of Modern Statistical Methods: Substantially Improving Power
and Accuracy
[https://books.google.co.in/books?id=uUNGzhdxk0kC](https://books.google.co.in/books?id=uUNGzhdxk0kC)

    
    
       Conventional statistical methods have a very serious
       flaw. They routinely miss differences among groups or
       associations among variables that are detected by more
       modern techniques - even under very small departures from
       normality. Hundreds of journal articles have described
       the reasons standard techniques can be unsatisfactory,
       but simple, intuitive explanations are generally
       unavailable. Improved methods have been derived, but they
       are far from obvious or intuitive based on the training
       most researchers receive. Situations arise where even
       highly nonsignificant results become significant when
       analyzed with more modern methods. Without assuming any
       prior training in statistics, Part I of this book
       describes basic statistical principles from a point of
       view that makes their shortcomings intuitive and easy to
       understand.

~~~
jpeloquin
> Its very sensitive to fat tails and skew. That's the reason robust testing
> and estimation is a thing.

Yeah, I shouldn't have said "extremely". Its much more robust than is commonly
perceived, but that does not make it "extremely" robust. Thank you for the
correction. But please note that I specified _equal sample sizes_ , so the
fact that "there are general conditions under which these methods can provide
poor control over the probability of a type I error" is not really a
refutation—I already implied that the t-test is not generally (= in all cases)
robust. I also mentioned non-parametric tests as a potential alternative. I do
not wish to imply that the t-test is always the right choice. But I stand by
the assertion that approximate knowledge of the distribution, such as obtained
by inspecting a histogram, is perfectly adequate to choose a test. The main
point remains — list what you know about your data, and pick a test that tells
you what you want to know with a tolerable error level for your application.

You can spend an awful lot of time picking the "correct" statistical test (if
there is such a thing; tradeoffs exist) with little gain. Worse, making a lot
of decisions about what test to use after looking at the data leads to
p-hacking, potentially leaving you with more bias than if you naively used a
slightly-wrong test from the start.

To elaborate on the t-test discussion:

From your ref [1], "If sampling is from nonnormal distributions that are
absolutely identical, so in particular the variances are equal, the
probability of a Type I error will not exceed 0.05 by very much, assuming the
method is applied with the desired probability of a Type I error set at α =
0.05. These two results have been known for some time and have been verified
in various studies conducted in more recent years" (page 79). I think Wilcox
makes this statement under the equal sample size caveat, but the text is
unclear on this point. This does not guarantee robustness under nonnormality,
but it does refute the idea that nonnormality is always fatal to the t-test.

Yes, a t-test can be affected by skew even with equal sample sizes. Whether
this is a problem depends on the application. The example in [1] uses a
lognormal distribution and ends up with actual α = 0.15 for n = 20, desired α
= 0.05. Which isn't great, but is somewhat tolerable. Also, a lognormal
distribution is strongly skew right and left-truncated, and this is easily
noticeable on a plot and can be transformed to a normal distribution. So this
does not refute the idea that a sort-of-normal distribution is fatal to the
t-test.

You can look up the potential α level and β level errors from violating the
assumptions of a particular test—they're tabulated. E.g., applying the t-test
to a Pearson distribution produces a typical α level error of 0.005 at desired
α = 0.05 [2]. That's definitely tolerable. This is why I stated that sort-of-
normal is good enough for a t-test. It's also fairly straightforward to
calculate the expected errors yourself for a particular situation, if
reassurance is needed. If your statistics are intended to support a high-
stakes decision this may be a good use of time. Working to validate your
finding by other means is better, though.

I'm not sure what your ref [0] (the "Robustness of Standard Tests" book
chapter) is arguing for or against from the abstract alone. I don't have a
copy of that book. The abstract mentions most flavors of maximum likelihood
ratio tests, which is too broad a set of topics to discuss effectively. The
implication seems to be that null hypothesis statistical tests are bad and
something else (Bayesian analysis? visual inspection?) is better, which I
don't necessarily disagree with. If you could please clarify its contents and
if it is worth tracking down, I would appreciate it.

[0] Wilcox, Robustness of Standard Tests

[1] Fundamentals of Modern Statistical Methods: Substantially Improving Power
and Accuracy

[2] Posten, H.O. (1984). Robustness of the Two-Sample T-Test. In Robustness of
Statistical Methods and Nonparametric Statistics, D. Rasch, and M.L. Tiku,
eds. (Springer, Dordrecht), pp. 92–99.

~~~
srean
No worries and I totally agree that t-test are surprisingly robust to some
benign deviations from Gaussianity, a lot more than one would have thought. In
my line of work I have had to watch out for fat tails (ridiculously common)
and skew -- they can be potent t-test killers.

In the book Wilcox champions robust estimators and tests (a la Huber, Tukey)
because efficiency of MLE is very brittle.

------
minimaxir
The weirdest thing about data science interviews (when I was actively
interviewing) is SQL gotcha questions. Especially with window functions.
Here's a long HN thread a few months ago about an annoying situation with an
interviewer asserting uncommon syntax is common:
[https://news.ycombinator.com/item?id=23053981](https://news.ycombinator.com/item?id=23053981)

Another related SQL gotcha I saw _multiple_ times is finding the top _n_
records of each group in a table. Which is a know-it-or-you-don't
implementation, and the interviewer can still be a jerk if they want by
slamming the interviewee if they include ties, or not (RANK vs. ROW_NUMBER).

It's telling that there aren't any SQL window function examples in this repo.

Another fun aspect of SQL interviews is dialect-specific questions,
particularly with how date/times are handled. Years ago, a company
famous/infamous for primarily using mySQL explicitly noted in a take-home
assignment problem definition that the database was PostgreSQL, which allowed
them to ask the aforementioned window question problem, the AT TIME ZONE
syntax for filter, and allow a specific definition for "beginning of week"
which required me to download the database and test it manually.

~~~
claudeganon
These kinds of stories seem insane to me. I currently have to do some data
science-y stuff in the more generalist consulting dev role I have. I used to
do a lot of SQL stuff years ago, but have forgotten most of the syntax beyond
the basics. That being said, all it took is some minor googling and reading of
a few blog posts to solve some middling hard problems for my client.

What kind of companies make people jump through these nerd hoops? Do they
actually have real work that needs doing or is this all just posturing by
interviewers?

~~~
gerbler
I think an element of it is that you can have unambiguous questions with SQL,
which taken to the extreme do fall into gotcha territory.

I was an interviewer for a business analyst & data science team recently and
we needed to hire several analysts & data-scientists who had strong SQL skills
because 90% of the data manipulation/processing used SQL. I was definitely
aware that this limits who we hire, but we were very short-staffed and needed
people so it was easiest for us to hire people who had a good understanding of
SQL. That said, we did not care about syntax subtleties, just more if could
you generally answer questions using SQL.

And for the data-scientist role there was a lot more than just SQL but it was
a useful check.

------
bobdosherman
Linear regression does not require errors to be iid, normal, and homoscedastic
for it to "work". One of the ways to separate candidates is to push on what
(and how) assumptions can be weakened, what the consequences are for
estimation and inference, and what sort of corrections can be incorporated for
maintaining consistency, improving efficiency, correcting biases, etc. An
entry level candidate may not have (nor need to have) a complete understanding
of asymptotic theory, but they should know what the purpose of robust standard
errors are and how to use them.

~~~
bonoboTP
Interesting and perhaps shows the cultural differences between ML and stats
people. I took a machine learning course in my bachelor's and two more ML
courses in my master's (CS). These weren't some "deep learning lite", mess-
around-in-Keras courses, because DL wasn't even big back then. We covered lots
of stuff, Bayesian linear regression, Gaussian processes, Gibbs sampling,
Metropolis-Hastings, hierarchical Dirichlet processes, SVMs, multi-class SVM,
PCA, kernel PCA, perceptrons, CMAC neural nets, Hebbian learning, AdaBoost,
Fisher vectors, EM algorithms for various distributions, fuzzy logic,
optimization methods like conjugate gradients etc etc.

But not once were the "Gauss-Markov conditions" mentioned. Frequentist theory
was only marginally addressed. I taught myself some of that stuff from the
Internet, such as hypothesis testing theory, p-values, t statistic, ANOVA,
etc.

Also, I'd say I'm good with data structures and algorithms, complexity theory,
graph theory etc.

I thought these skills would be a good fit for data science jobs, but I guess
it's really such a wide umbrella term, that probably you're more looking for
people trained in the frequentist, statistical side of it. What application
field are you in, if it's no secret?

~~~
srean
By tribe I am firmly in the machine learning camp but I have serious doubts
that one can be a good hands-on data-science practitioner if one does not have
a good foundation in statistics.

~~~
bonoboTP
Statistics is not really in focus in most CS programs. Indeed I'm not sure
where it is. Perhaps in applied math programs. Stats is kinda too dirty and
realworld for pure math types, and in science programs and medicine it's
usually just taught as a bunch of magic formulas to memorize and rules of
thumb passed down from generation to generation without understanding. Perhaps
physics departments do have both the necessary math skills and the need for
stats so they may provide a good education in this.

But having studied in 3 universities in different countries, CS just doesn't
care about stats. Probability theory yes, but frequentist topics like
statistical tests not really.

~~~
jmt_
This was my experience at a high ranking engineering public state school in
the US. Stats is delegated to the applied math program usually (in fact, my
degree was titled "Applied Mathematics & Statistics"). You can choose to
concentrate in subjects like algorithms, operations research, fin stuff,
statistics, etc. CS as well as other sciences had one required intro to prob &
stats, but that's it outside electives.

Further, despite having a fantastic reputation, my program only discussed
frequentist ideas with near 0 mention of Bayesian reasoning/methods (outside
the same Bayes rule questions asked in the first weeks of every stats class).
Overall the education felt too traditional, I would have liked to seen mention
of more modern methods like the bootstrap and certainly mention of Bayesian.

~~~
srean
> Stats is delegated to the applied math program usually

Which is not unsreasonable considering that these things arent really related
to computers (although they happen to involve computation). I think its just
an artifact of history and how things happened that ML is associated with CS
and EE departments, but really its applied math, not a core CS topic like say
compilers, formal languages and complexity.

------
KKPMW
I only read the first question in theory.md and think the answer is quite
weak.

> What is regression? Which models can you use to solve a regression problem?

The current answer only list some names that have "regression" in it, and the
description of what a regression is doesn't say anything that distinguishes it
from classification.

It fails to mention that regression (in ML terminology) is prediction of a
continuous variable. And that almost any method can be used to do regression:
knn, neural network, random forest, svm.

If other answers are of similar nature you might fail the interview.

~~~
dgellow
From the README:

> The answers are given by the community

> If you know how to answer a question — please create a PR with the answer

> If there's already an answer, but you can improve it — please create a PR
> with improvement suggestion

> If you see a mistake — please create a PR with a fix

~~~
kristjansson
I don’t like this response.

It’s completely reasonable to assess the quality of informational content
without assuming a duty to proofread and revise that content.

Especially when the issue that might be corrected isn’t a typo or malapropism
but ‘this whole thing isn’t that great.’

~~~
dgellow
Sure, I’m not saying that you have to correct it. Just that the project
doesn’t assume that they have correct answers for everything and expect some
participation of “the community” at large.

------
latentdeepspace
Do people realize that, these interview question collection do not help? I
think there is 2 things to address here:

\- Interviewers will know "what is known" by every candidate (with the help of
these pages) and harder questions will be asked

\- If these questions are asked at >junior levels, then RUN! the work will not
satisfy you. The interview should be fun, and show the creativity of the
candidate. These ones could be answered by anyone who read it a few times. I
would not like to work with somebody who only know the answers to these
questions and not more

~~~
marcusabu
I think it's a bit far fetched to assume that recruiters tailor interview
questions based on these Github repo's. For me as a junior data scientist it
really useful to test my own knowledge and highlight areas which I need to
study more.

~~~
fractionalhare
Lots of tech and finance companies (particularly those with standardized
interview processes) will blacklist questions if they're found online. Those
companies will constantly check GitHub, GeeksForGeeks and Leetcode to see if
their questions are listed there with solutions.

This probably won't be the case for a question as basic as, "what is
regression?" But for any intermediate to advanced interview question
_involving_ regression, I would expect companies to jealously guard it.

If you're earnestly interested in building and testing your knowledge, I would
recommend you read _The Elements of Statistical Learning_ and _Data Analysis
Using Regression and Multilevel /Hierarchical Models_. Also a good upper
undergrad textbook in probability, like _A First Course in Probability_.

~~~
jointpdf
A couple recommendations piggybacking off of yours:

 _A First Course in Probability_ has a lot of problems (with solutions) and
worked examples, but it’s light on intuition and pedagogy. It’s not an easy
book to learn from, on its own. I highly recommend listening to Joe
Blitzstein’s STAT 110 lectures and reviewing the wealth of problems/notes. The
greater mastery of probability theory that you have, the easier studying ML
and stats is.
[https://projects.iq.harvard.edu/stat110/home](https://projects.iq.harvard.edu/stat110/home)

 _Elements of Statistical Learning_ is a true textbook—a comprehensive bible
that could occupy you for many thousands of hours. ISLR is the better book for
a crash course: [http://faculty.marshall.usc.edu/gareth-
james/ISL/](http://faculty.marshall.usc.edu/gareth-james/ISL/)

There are also lectures and slides from the authors:
[https://www.dataschool.io/15-hours-of-expert-machine-
learnin...](https://www.dataschool.io/15-hours-of-expert-machine-learning-
videos/)

~~~
disgruntledphd2
Also, Regression and Other Stories is the new edition of the Regression with
Multilevel models book, and it's much, much better (especially for n00bs).

------
zwaps
Very interesting. I wonder in what type of interviews would these answers be
considered ideal? Do people at OpenAI ask these sorts of questions, or is this
more targeted toward other industries that require data scientists but are not
populated by "stats" experts?

For example, consider "What is regression?". The answer given is one way to
go, but if the job posting would involve causal analysis or more statistical
know-how, for example, it would probably be insufficient. I would want the
candidate to speak about linear projections, about sampling assumptions, about
probability models underlying the process and so on.

On the other hand, I could imagine that it would not be a good strategy to
start to lecture about minute details of regression analysis when applying for
a standard data scientist position when sitting on front of "applied data
scientists" or even HR folks.

Anyone has any insights?

~~~
smeeth
My experience is that interviews like these are for supporting role data
science jobs. E.g. company x has a product (tech or not), they have some data,
and they want to hire someone to make that data useful in improving their
product or selling more of it.

The general data science process is that when faced with a problem, you 1)
select the appropriate algorithmic tool(s) for the problem and 2) apply the
tool(s) to the data. One of the challenges in data science generally is that
the tools can get pretty fucking complicated.

The point of theory interview questions in general is to assess the first
point, whether or not a candidate has the capacity to pick the right tool for
a given problem. They want to hear that you understand some of the standard
tools and what sorts of problems they are good for. If you "get" the common
tools, you'll likely be able to reason about the application of new/different
tools for weird problems as you face them, or so the logic goes.

Everything I just said was more or less objective. This is my opinion: bad
companies that do not know how to hire data scientists usually do what you're
describing. They ask theory questions to assess whether or not the candidate
already understands the specific tools they expect them to use. Good companies
tend to pay more attention to whether the candidate is capable of
understanding the universe of tools in general and are less worried about
their specific application area.

I should note that this is less relevant when hiring consultants or "plug and
play" senior people. Of course for those roles you want to know that the
candidate has done something similar already and is primed for success.

------
data4lyfe
I have to agree that these interview questions function more as a cheatsheet
review than actually anything practical that would be seen in an interview.
Data science interviews don't function as a biology test where you're just
rattling off memorizations to how neural networks or linear models work.

Ultimately these types of questions like "What is feature selection" are more
likely to be encapsulated into case studies where the answer to the question
itself will be, using feature selection.

For example: "Let's say you have thousands of categorical features for an
anonymized dataset involving human traits, how would you figure out which
predictors are the most important?"

Source: [https://www.interviewquery.com/](https://www.interviewquery.com/)

~~~
marcinzm
>I have to agree that these interview questions function more as a cheatsheet
review than actually anything practical that would be seen in an interview.
Data science interviews don't function as a biology test where you're just
rattling off memorizations to how neural networks or linear models work.

Me and my co-workers have been asked exactly that by top companies although it
was more for machine learning engineer/applied scientist positions. Many
textbook questions asked one after the other. It's not the only interview type
they did but it definitely mattered and was often the first filter. So if you
didn't answer well enough then you were out.

------
rhacker
These are at best machine learning questions, not data science. ML is
definitely a sub-category of data science, but if you see a job posting for
data science, you're 100% not going to be doing machine learning. That would
have been labeled as a machine learning job posting.

------
teleforce
For Machine Learning theory related to Data Science I'd highly recommend "The
Hundred-Page Machine Learning Book" by Andriy Burkov [1].

According to the author, by reading one hundred pages of the book (plus some
extra bonus pages), you will be ready to build complex AI systems, pass an
interview or start your own business.

It is a read first book and buy later if you think it is good enough.

[1] [http://themlbook.com/](http://themlbook.com/)

------
marcinzm
I've seen multiple companies ask candidates to write working code for machine
learning end-to-end from scratch. As in, write a stochastic gradient descent
logistic regression model with training, inference, etc. without any libraries
beyond pandas/numpy If you're lucky they'll provide you the equations or let
you google them. So something to memorize including the various numpy/pandas
gotchas.

~~~
conjectures
That's a great question. Because it lets you as a candidate screen out places
run by morons.

~~~
bitxbit
Years back someone asked me an interview question to explain sd and confidence
intervals. I just gave the wrong answers and left. Especially in nascent
fields such as ML and DS, it’s very important to work with people who actually
know what they are doing. Otherwise you will have wasted years of your
precious twenties doing absolutely nothing productive.

