Hacker News new | past | comments | ask | show | jobs | submit login
A Student's Guide to Preparing for Data Science Interviews (acheronanalytics.com)
129 points by rogocopH on July 21, 2017 | hide | past | favorite | 59 comments

Unfortunately there is very little here that isn't just general interview advice.

As a new grad that went through the hunt very recently - it was a messy process. Very few places will consider you without extensive experience, or a masters/Ph.D. Of course if you're hiring people to research machine learning algorithms that's justifiable, but plenty of the responsibilities people associate with data scientists don't require advanced degrees.

And the number of posts asking for 5, 7, even 10 years of experience... absolutely astounding.

As someone uninterested in going back to school, I've resigned myself to getting some work experience and doing personal projects for 1-2 yrs before trying again.

I recently graduated from University of Pennsylvania with a masters in computer science after 9 years experience as a DBA and GIS, + 1 year experience working for a social media lab doing natural language processing (co authored on several papers) + 1 year experience working with big data (10 tb in size).

applied to ~85 data science positions. I can't even get recruiters to call me for a phone screening, so don't feel down =)

Unfortunately most people posting data science jobs don't really have good priors on the necessary qualifications for those positions. As someone who has gotten multiple data science positions in startups and larger enterprises, I highly recommend applying to data science jobs through platforms like AngelList, which allow you to speak to employers directly and circumvent many of the bureaucratic processes that tend to discourage people from applying to jobs they are actually well qualified for.

I just graduated undergrad as well. The biggest factors assisting me in getting a DS position were:

1. Previous internship in data science 2. Experience developing R packages and putting them on GitHub 3. Really having statistical theory down pat

You don't need an advanced degree to be a data scientist but you need a strong understanding of stats and how to work with data. Having an advanced degree is a good indicator that you can do that. But it's not a prerequisite for an undergrad: Github, internships, TA-ships can make that up.

I think one advantage is that while PhD's are typically very good at the research process and the techniques used in their research, undergrads could be more flexible and adaptive to different situations.

spoopy01 nailed it. Most of the people hiring data scientists don't really know what data science is, or probably even why they need it. So they inflate the hell out of the requirements to a. CYA and b. hopefully get somebody so experienced they can come in and make up for the lack of organizational understanding of data science. IOW, they want somebody who can "teach us what we don't know".

Teaching me what I don't know is what I want with all of my engineering hires. I want people better than me who will tell me why my architecture isn't ideal or whatever the case is.

Yes, but I'd argue that's on a different level from "teach us why we need an 'architecture'" or "teach us how to use this 'data science' stuff". Some people are trying to be buzzword compliant when they don't actually understand the buzzwords. Ya know?

Or maybe there's a better explanation for asking for a candidate with a Ph.D. in Statistics to create a linear regression model in Excel. Because truth is, for many companies that's all they need.

a Ph.D. in Statistics to create a linear regression model in Excel

Linear regression, logistic regression and k-means clustering, if you can get a project into actual real-money production on one of those, you are already well ahead of 90% of data scientists. And these techniques are decades old!

I get that it's difficult to get these gigs..

But also recall, that there are alot of people with general computer backgrounds and a whole lot more real-world work experience who could pick up the basics of data scicnce stuff as well..

Places where these jobs tend to be needed are often very highly placed and very strategic, with the decisions being made based on the data very critical to the overall company health. Some of what is being requested by requiring this level of credentials is real-world work experience and a certain level of maturity/professionalism that is often implied by this as well..

If the level of skill is basic enough that any other regular employee could learn it in short order, why wouldn't they and then have the cushy 'sit around and play with numbers all day' job too.. There needs to be enough 'there there' to placate internal politics as well..

Further, having a more advanced person would also likely need to be taking leadership roles in picking and rolling out solutions with long term costs associated with them, training other employees, etc, which also comes with the professionalism/experience part..

Most of these listings have over-inflated experience requirements. A lot of them will entertain you even if you have half the experience they ask for.

Good to know. As a new grad I couldn't really claim to have any experience, so that wouldn't have helped me. I was just commenting that these requirements were absurd, considering that the pool of people meeting them would be so small.

Definitely will take note of that in the future though.

So I've been interviewing for data analyst/science positions since leaving Apple in April.

I may do a postmortem on my search later, but speaking from my experience with many, many interviews over the past couple months, the TL;DR is that the conventional interview wisdom on Hacker News/the cscareerquestions subreddit/this article is wrong and out of date. Interviews for such positions require a different set of skills than just reading Cracking the Code Interview (and ones that you can't get at a data bootcamp).

What kinds of technical questions do they ask in a data science interview?

On the stats side, often higher-level theory questions, such as "How does the k-means algorithm work?", "How do you select the best k for k-means?", "What is the curse of dimensionality?" which again would not be things covered at a data boot camp or data science thought pieces on Medium.

On the technical side, there is often more-advanced SQL (nested JOINs + PostgreSQL window functions). On the big data side, there is often discussion of distributed systems (e.g. Spark clusters) and practical algorithmic complexity at scale (i.e. instant fail if you suggest anything loglinear or slower).

I overheard some colleagues talking about a recent interview where a candidate with "stellar industry experience", i.e. Kaggle wins and previous ML experience at a valley company, who couldn't explain Bayes rule to them, let alone rederive Naive Bayes. While books like the one below are extremely theoretical, anyone interviewing for these kind of roles should spend at least a week or two just looking through this to see what kind of algorithms and properties are studied in theory.

Foundations of Data Science (Blum, Hopcroft, Kannan) https://www.cs.cornell.edu/jeh/book2016June9.pdf

This is like not hiring a [big name coding competition] winner because he didn't know radix sort.

I think it's more like not hiring a big name coding competition winner because they never bothered to learn how to use version control, or any coding best practice, or any language other than C.

Trying to do data science with zero knowledge of the fundamentals of probability is dangerous. Bayes rule isn't some kind of deep magic, it's covered within the first few lectures of an undergraduate probability course and it's absolutely necessary to understand the output of any machine learning model.

>I think it's more like not hiring a big name coding competition winner because they never bothered to learn how to use version control, or any coding best practice, or any language other than C.

Depends on what you're hiring for, but I'll take "competition winner with no version control" over "average programmer with expert VC capabilities".

>Bayes rule isn't some kind of deep magic

Yes, it's largely conceptually obsolete.

The people jamming out weekly SOTA machine learning models on arxiv aren't sitting around meditating on conditional probabilities. They're making little tweaks to giant models that are basically impossible for a human to comprehend.

> Yes, it's largely conceptually obsolete.

I'm sorry, what? How did you arrive at a point where you believe this is true? This is like calling compilers "obsolete."

Is it because you believe deep learning has "taken over" or something?

Try to derive e.g. a face detector from bayes theorem. You immediately arrive at computationally intractable sums/integrals. Yet, we have super-human image classifiers. Therefore, bayes theorem is obsolete. Sure, you can try to retrofit bayes theorem on top of a neural net, but who cares?

> You immediately arrive at computationally intractable sums/integrals.

So we instead sample from that posterior.

Unless you think MCMC is also obsolete, in which case I’ll see myself out.

You're right, but a) you have comp efficiency issues with MCMC, and b) just empirically MCMC models don't work as well as gradient descent + NN for many tasks.

And you don't have computational efficiency issues with NNs?

We're also ignoring the benefits of a posterior distribution, which is useful for understanding the data-generating process.

Yeah of course. I can't explain to you why NNs outperform bayesian approaches, probably just NNs are capturing the correct type of prior for vision/language tasks. And yeah bayesian models are more interpretable but when you have millions of latent variables I'm not sure interpretability is a thing.

Yep, we arrived at my larger point: if you care about interpretability, NNs are horrible and Bayesian techniques are pretty damn great.

Well certainly, but interpretability is obsolete.

Now you're just trolling. :)

I'm not sure I agree with that. I don't know any ML researchers that don't know about probability, but maybe they exist somewhere. Machine learning research isn't a good model for "data science" writ large.

Maybe there are some jobs and some problem spaces where you can just tweak big black box models and you don't ever need to think about what their output means. But if you're the kind of data scientist who helps make decisions with data -- you better believe statistics and probability is conceptually relevant. As soon as models meet the real world, you've got to understand probability in order to know what to expect.

> Yes, it's largely conceptually obsolete.


All the stats stuff you mentioned (and more) is laid out very nicely in the second chapter of "Elements of Statistical Learning" https://web.stanford.edu/~hastie/Papers/ESLII.pdf

Having never been to a data boot-camp -- what _do_ they teach there, if those wouldn't be covered?

Galvanize, the one mentioned in the article and markets the most, is vague on their curriculum but is more scikit-learn oriented. https://www.galvanize.com/san-francisco/data-science#curricu...

Metis, another known bootcamp, does explicitly mention things like k-means, which is something I didn't know: https://www.thisismetis.com/data-science-bootcamps

Having spoken to several students as well as looking at their GitHub projects from the galvanize data science program, I am fairly confident they cover those stats subjects in question.

Not to say that a data science boot camp is all you would ever need to know, but I would give them a little more credit in what they teach. That galvanize link is not really their syllabus, they probably keep it vague to force you to sign up to their email list to get the real one.

I'm not sure which bootcamps you're referring to, but I went to one that went into the topics from your first sentence in depth. The topics from your second sentence were not as thoroughly explored, though.

The stat questions you mentioned are all answered from a introductory modeling book or class. I am surprised those subjects are not even talked about in a data boot camp.

Hi, minimaxir. If confidentiality doesn't prohibit it, I'd be curious to know what the most interesting roles you interviewed for were.

If you had to summarize into three points?

I have one notable observation:

On Hacker News, every time an interview thread pops up, there is a discussion decrying the use of technical screenings before an onsite, and often suggest practical work experience instead using a homework assignment (which this article does not discuss).

Most of the companies I've talked with for data analyst/science roles have given me both a homework assignment and a technical screen before the onsite. And often a prescreen test before both of them.

There have been a number of occasions where I easily passed the homework screen but failed the technical screen (without any feedback as to why). And it's beginning to get annoying.

Did you find another job yet? If not, what type of opportunity are you looking for? Data science is also a wide, somewhat poorly defined domain.

Still looking. Mostly for any role with a data analyst/data scientist title (i.e. I am not applying for the machine learning/NLP roles which require a PhD and the authorship of several papers since there is no point).

Analyzing the Analyzers, free eBook. Assuming the student is sharp on technical skills, this look at the human side could be helpful to prepare. https://www.amazon.com/Analyzing-Analyzers-Introspective-Sur...

How many data scientist jobs are actually out there? I can understand data scientist being a position at one of the big 5 tech companies, but are they really in demand elsewhere?

I've never actually met someone off the internet who calls themselves a data scientist.

> I've never actually met someone off the internet who calls themselves a data scientist.

You could probably say the same about a lot of job titles. I've never meet a sanitation worker but my trash gets picked up once a week nonetheless.

I think the field is seeing a large amount of "title bloat". I've seen job postings for data scientist roles that are only looking for SQL and Microsoft Excel skills. I almost accepted a data scientist role at a small manufacturing company, but a more realistic job title would have been "Marketing Analyst"

Manufacturing, finance, areospace, automotive industry, and medical industry are a few other places hiring data scientists.

We exist. I just presented at a firmwide Data Science Forum for a very large financial firm.

Well, there's a job post on HN right now looking for data scientists :)

Ad-tech, fin-tech, marketing analytics firms all hire data scientists in huge numbers. Media companies as well, increasingly. This is probably NYC-biased.

I think it's a dumb name for what I do so I call myself a programmer if anyone asks.

And in some consultancies too.

Technically anybody who uses Excel is a "data scientist". Just got to get the right buzzwords.

This is absurd and false. This person is an analyst of some sort.

Maybe this holds in the consulting world? It definitely does not hold in the tech world, IME.

This is absurd and false. This person is an analyst of some sort.

And what is a data scientist then, if their work does not involve analysing data and presenting their analysis?

99.9% of "data science" is exactly what people used to do in tools like Excel, MATLAB, even SQL, just in Jupyter instead. On a Mac while sipping a latte.

> And what is a data scientist then, if their work does not involve analysing data and presenting their analysis?

This is a dead giveaway that you have no idea what you’re talking about. You’ve captured about 5% of my work.

The rest of the time, I’m writing software (ETL pipelines or real-time services, including tests), debugging some distributed system, collecting or cleaning data, or gathering requirements and developing feature specs with other folks.

Fortunately for you, you nailed the Mac-using latte-drinking part!

EDIT: Reading your comment history on regression and k-means. you _do_ know what you’re talking about. It is hard to get models into production, so I’m surprised to see your snark here. What gives? Do you have experience with DS who don’t deliver?

What gives? Do you have experience with DS who don’t deliver?

I have experience of DS who define what they do by the tools they use, not the results they deliver, it's a pet peeve of mine :-)

Thanks for going back and making the edit!

The thing is : Data Science requires ... "scientific" rigor and thought process. A lot of people who hire often forget that science is integral to data science: it's right there in the name.

I think this was posted earlier. But some companies really just want a statistician.

Very few companies are actually using their data scientists as scientists. From my experience.Except for when I worked at a large hospital. We had a research board, and had to be certified to study Humans CITI. But beyond that..

But some companies really just want a statistician

In what way is what statisticians do not "scientific"? Setting up and rejecting (or not) the null hypothesis is the very definition of the scientific process...

"Data science" is a field with so much conceptual churn and fads that interviewing for it is a completely ridiculous notion.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact