
How To Hire A Data Scientist - mlmilleratmit
http://blog.bright.com/2012/11/13/how-to-hire-a-data-scientist/
======
kevinalexbrown
For recruiters who aren't sure where to find these mythical post-docs who do
science (with "Data"):

1\. Go to a scientific journal that involves serious computational work.

2\. Look at the last author on an article. This is usually the lab boss.

3\. Google "$LASTNAME lab webpage".

4\. Look for graduate student and post-doc profiles.

5\. Email them and offer more than the 38k they make for working 60+ hrs/wk.

6\. Now you have a "Data Scientist"

Alternatively, just Google "Argonne National Labs".

There is no shortage of Data Scientists with years of experience (every
scientist I've talked to has guffawed heartily and asked "what other kind of
scientist is there?"). Yes, I get that the term is just a very unfortunate
choice of words (on the order of General Linear Model vs General _ized_ Linear
Model), but the point is that no one knows to brand themselves this way.
Recruiters are probably just frustrated by the lack of linked-in-ness, and the
fact that most people competent at "Data Science" don't know what that means.

Here are things virtually every PhD in a computational discipline will have
done:

1\. Written code. It might be just Matlab, Python, R, etc.

2\. Written up and communicated the results with compelling visualizations,
both orally and written

3\. Published a paper at some point demonstrating they can do this.

4\. Dealt with failure and hunches that didn't work out after weeks of work.

On the hiring part, after you've found these mythical creatures:

As someone who does science (with Data, even!), here is the best question I
can think of to ascertain my competence: "Diagram an approach to answer a
particular question, with emphasis on ruling out competing explanations and
demonstrating whether a result is true. Sketch some hypothetical
visualizations you'd use that show what your result is, how large the effect
is, and how sure you are that you're right."

I should be able to do that. On the flip side, if an employer can reason with
a post-doc about that process in an intelligent fashion, they would be excited
about leaving academia. You might need to teach them to use a cluster, EC2,
whatever, but you will not have to teach them to ask and answer questions.

~~~
hessenwolf
Erm... isn't a 'Data Scientist' just a shitty name for a Statistician? Maybe I
am biased, but it seems like somebody just came up with an ass-hat new name
for my profession.

~~~
eshvk
There are enough people who come from top notch statistical backgrounds who
don't necessarily have a firm grasp in whatever fancy machine learning
heuristic is in flavor today (which is not a deal killer), or don't know how
to code (which probably is depending on the company)

------
pseut
I think I'm in the minority, but I really hate the term, "data scientist." It
seems usually to mean, "senior statistician, but with training and credentials
expected of an RA" (to clarify, that isn't meant as a comment on the original
article). I would be especially skeptical about hiring someone who _self-
identifies_ as a "data scientist," people are trained as Statisticians,
Biostatiticians, computer scientists, various subspecialties that end in
"-metrician" (e.g. Econometrician, Psychometrician, Cliometrician), etc; no
one is trained as a "data scientist." Unless you're hiring someone really
junior, you want the "data scientist" to have a specialty -- anyone good will
have one.

But the best way to find a good "data scientist" is probably the best way to
find a good programmer -- be one yourself; tap your professional network; and
hire people as consultants/freelancers on non-critical projects before making
a real commitment. Identifying someone with a deep skill that one doesn't
possess oneself is pretty much impossible. And on the flip side, I have
trouble imagining that someone who really knows what he or she is doing would
want to work for some unknown.

If you want someone to scrape and clean data with Perl and generate some
scatter plots and histograms, look for undergrads with good grades who worked
as Research Assistants, or recent grads working as RAs at consulting firms,
research centers, governmental agencies, or think tanks. They'll do great (by
and large), they've have had some informal training from a more senior
researcher to help put everything in context, and faculty often steer their
best students into those sorts of jobs, so there's a pretty strong quality
screen. I'm sure there are other places to find people too.

~~~
rm999
>but I really hate the term, "data scientist."

I think most people do, but I've never heard a good term for the job. It's
like "we want someone who can take large amounts of data and do something
awesome with it". What do you call that?

>Unless you're hiring someone really junior, you want the "data scientist" to
have a specialty -- anyone good will have one.

Not sure I agree with this, I want people who are well-rounded. I think it's
great to find someone who specialized at something, but I'd want that person
to be able to grow the rest of his abilities up to par. Example: let's say you
specialized in machine learning. If you don't understand building scalable
systems, you can't take a holistic view of a project; how will you know if
your algorithms can scale to a production environment? Or, if you can't
program well, you can't write code to actually get your algorithms into place.
Or if you can't understand the business side of things, you won't be able to
build trust with the rest of the company, and hence you won't be able to
contribute.

~~~
retroafroman
>"we want someone who can take large amounts of data and do something awesome
with it". What do you call that?

Analyst

~~~
rm999
That's close to what I used to be called. It's a very overused and vague term,
I'd argue far worse than the already bad 'data scientist'. Search job postings
for 'analyst' and see the wide variety of jobs that turn up.

------
rfrey
"1. Tell me about some peer reviewed papers that you published as first
author?"

It could be that the author's first criterion is an important predictor. But
it seems to me that unless somebody is actually _in_ academia, publishing
papers is more akin to a hobby than a professional qualification, especially
given the inherent bias against unaffiliated authors.

Edit: On re-reading, the author (hardtke) writes the above when talking
_specifically about weeding through post-doc applicants_. So my quote is out
of context and my criticism unfair.

~~~
disgruntledphd2
Yeah, I fail that criterion hard (I've had five rejections, does that count?).
Nonetheless, while the author is probably showing his (or her) biases, there's
definitely a nugget of truth there. For a more coding focused candidate, an
equivalent question would be questions about software you have designed, built
and promoted (as that's essentially what the question is asking).

~~~
hardtke
Basically, I want to see if people can finish stuff. For software engineers,
tasks tend to be easily defined and of shorter duration. At least at Bright,
we have Data Scientists working on multiple simultaneous projects that take a
few days to several weeks to complete. Knowing when you are done with a
complicated, possibly open-ended, project is very difficult.

~~~
michaelochurch
_For software engineers, tasks tend to be easily defined and of shorter
duration._

For crappy ones doing commodity work, this is true. For good software
engineers, not so much.

I'm a data scientist by pedigree (before it was called that) who's spent the
past few years in "regular old" software engineering (and probably heading
back in the DS direction). Trust me that software engineering done right is as
subtle and talent-intensive as DS.

The problem is that SWE's are terrible at marketing themselves as a group and
generally get too little respect and autonomy to have architectural successes.

------
snorkel
Data Scientist simply means a developer who also knows statistics. Here's how
to sniff them out:

1\. What is you favorite programming language and why?

2\. Which is your least favorite programming language and why?

3\. Explain how Bayesian spam filtering works.

4\. How do you determine if a given data sample is statistically significant?

5\. Suppose we ran a brand advertising campaign on radio and on television.
Neither ad campaign uses special tracking codes or custom landing pages, both
ads simply mention the web site address. What tools and methodology would you
use to measure the response rate from radio ads vs. tv ads, and predict the
total response that will generated from each ad?

~~~
a_bonobo
>4\. How do you determine if a given data sample is statistically significant?

That's a weird question - what am I to prove? Are there differences in the
dataset? How does the data look like? Is there a second dataset against which
to compare? What kind of data is it, ordinal or nominal? Is the dataset
normally distributed?

I can't just say "that data sample is statistically significant" without a
background against which to compare against!

>3\. Explain how Bayesian spam filtering works. I would say the majority of
non-web developer scientists don't know how this works, why should we?

~~~
tymekpavel
If you've had any exposure to Bayesian statistics, it's very easy to explain
the approach. You're basically tokenizing the email by word, and then
predicting the probability that P(email is spam|token).

I would expect any data science candidate to have (at least) a basic
understanding of Bayesian statistics.

~~~
a_bonobo
It's just that an e-mail filter is the last thing most scientists working in
biology, chemistry, physics, medicine are actually working with, so it's (in
my opinion) a rather unfitting example, you might be selecting against these
people who fit the bill of "data scientist" probably better than the average
"I once did something in Excel"-web guy.

------
mlmilleratmit
"If you want to find a Data Scientist, find yourself a disgruntled postdoc
toiling away on brilliant scientific research, but failing to land a
professorship because … all the professor jobs are taken!"

That quote is more than just humorous, it points out one compelling answer to
the question in the title of the post -- Perhaps all scientists are data
scientists, you simply have to lure them into a new domain of study.

~~~
disgruntledphd2
Possibly, i have certainly made that point before (and it remains my most
upvoted post, so certainly other phd students and postdocs agree with me).

I do think that people need to stop expecting to get a physicist,
statistician, economist or applied mathematician (every graduate analysis job
I've seen had that wonderful qualification) as most of those people already
have really well paying jobs in finance (or satisfying careers in academia),
and open their eyes to the fact that for many data science roles, social
scientists are probably a better fit.

If you're dealing with numbers generated by people's interaction with a
website, a handy background to have is in some form of quantitative social
science. (I am of course horribly biased, by being a quantitatively trained
social scientist).

In any case, I expect data science to go through a dot.com like boom and then
a horrible crash, so it may be a good idea to get the skills (and possibly
qualifications) now, while the sun still shines and people are still
hypnotised by the promise of big data, rather than the tedium and slog of
extracting value from it.

~~~
michaelochurch
_most of those people already have really well paying jobs in finance (or
satisfying careers in academia)_

I've always taken the data scientist distinction to be the startup world's
answer to the "quant" designation. Quant jobs are a lot better than "just a"
programmer jobs, even at hot startups, so I see the "data scientist" title as
an answer to that. It's a startup quant.

~~~
rm999
Yeah, I've actually heard people use the terms interchangeably. Although
'quant' is almost exclusively used in the finance world. I interviewed for a
couple of those jobs and got caught up on stuff I've never heard of before,
like pricing bonds and combining interest rates.

------
rm999
Your article seems to be written from a phd-centric, low-industry-experience
perspective. I'd reconsider this bias if it's real, finding good data
scientists is hard and you may be eliminating good potential candidates.

I've been what you could call a data scientist for over five years now, and
worked with dozens of people you could also call data scientists with
different degrees and varying experience. From my sample, I don't think PhDs
add much, if any, value over masters degrees after a year or two of experience
(I'm biased here, I don't have a PhD). I think industry experience can add a
tremendous amount of value you can't get from a degree, but it comes at a cost
premium. Not related to your article: I've also found the best people have
physics or computer science + applied math backgrounds.

------
saosebastiao
One of the best data scientists I know doesn't have anything near a PhD. Just
a plain old Undergrad.

<http://www.youtube.com/watch?v=-3dw09N5_Aw>

I think there are too many ways to miss the mark when it comes to hiring data
scientists. Looking for PhDs only is just as dumb as looking for people with
10 years of experience with Hadoop. There are some important things that they
need to know, sure, but where they come from is next to meaningless.

------
micro_cam
One problem with hiring academics is that they can be far too focused on their
subdiscipline as this is necessary to break new ground. An astro physicist
might be amazing at identifying peaks in tremendous amounts of data but have
no idea how to do basic analysis of a graph.

When we interview someone I like to start talking/asking about matrix
decomposition (eigenvectors, svd ect) and see how excited they get.

I consider knowing about things like MDS and pagerank a bare minimum and if
someone can bring up a more recent or esoteric application (locally linear
embedding, graph partitioning) they stand out.

Asking about the nuances of estimating probability densities from data
(bin/histogram vs kde etc) is another good one and something that stops a lot
of the cooler theoretical statistics and information theory from getting used
(or used well) in the real world.

Both of these questions get more at "do you understand the basic building
blocks that come up over and over again" more then "is your research
groundbreaking and new." Asking about techniques to do the above at scale also
ups the difficulty of the interview.

------
bane
"Data Scientist" is such a weird title for somebody who is basically a
statistician with good IT skills. I think in some market verticals they're
simply called "Quantitative Analysts" and make _very_ good money.

In a previous job I did lots of analysis on very large data (at the time) data
volumes, millions of structured or unstructured records, homo and
heterogeneous datasets. Lots of aggregation, sifting, sorting, simplifying,
deduping, summarizing, etc. All in support of similar kinds of things that
"Data Scientist" positions seem to be intended to support. But the output was
not a statistical model, or a machine learning exercise or some other similar.
It was the distillation of gigabytes of data into a handful of slides and a
report. Usually with a virtuous cycle of feedback directly into software
development to improve and expand the next go-around.

But almost no statistics. Very very little, and what I did was very basic
stuff.

What is that kind of job called? In my day we called it a "Data Analyst" but I
don't see that around much.

I'm going to make a prediction, "Data Scientist" as "Senior Statistician" is
going to be short-lived. I don't think they're going to provide the value
companies think they will in most cases. "Data Analyst" is much more general
purpose and useful cross-domains, except most Data Analyst don't have proper
statistical training.

A Data Analyst with statistical training would be a much more useful tool to
an organization seeking to make sense out of large volumes of data than a
Senior Statistician as they'll have a much wider variety of tools at their
disposal than just looking at the world through the statistics lens.

Bonus, jobs advertising "Data Analyst" can _demand_ things like machine
learning AND entity extraction AND automatic summarization AND data sanitation
AND automatic correlation analysis AND automatic colocation analysis etc.

Most of the jobs I've seen looking for Data Scientists are for companies that
are probably going to try and end up using them as high-priced Data Analysts,
except the job reqs are all wrong and the candidates that get hired are _way_
over qualified.

But this role is still evolving I suppose, IBM [1] views it as an evolution
from the business/data analyst. So they definitely seem to be on the side of
_not so much statistics_ and more _analysis_.

1 - [http://www-01.ibm.com/software/data/infosphere/data-
scientis...](http://www-01.ibm.com/software/data/infosphere/data-scientist/)

~~~
hessenwolf
I'm going to throw it out there that a Statistician with poor IT skills
nowadays is like a carpenter with poor measuring, cutting and hammering
skills.

------
pfanner
I'm actually on the other side: currently writing my bachelor's thesis in
statistical physics, having a lot to do with probability stuff, statistics and
data. I'd like to take a year off and work in a company to get some real life
experience before I start my Master's degree, because I don't want to stay in
academia after that. But I have no idea where I can find companies which could
need my abilities and where I could work for 0.5-1 year. Any ideas? I live in
Germany.

~~~
mlmilleratmit
Most shops are willing to let you work remotely. Anyone that's down with stat.
mech. has a solid base for quantitatively attacking most problems. Luckily the
'domain' knowledge required is general human consumer behavior, which you'll
know quite a bit about already. And a surprising amount of that can be
reasonably modeled by a microcanonical ensemble.

------
cmckay
Are people hiring freelance/part time for data science work? I'm a physics
professor at a small liberal arts college, and would like to move my career in
the direction of data science. Picking up a client or two for smaller/short
term projects would really help, I think.

As an aside, it isn't just the postdocs who are disgruntled. I was awarded
tenure last year, and while there are aspects of the job I love, there are
others that push me toward making a change.

------
se85
JavaScript Ninja, Scrum master and now data scientists?

By their definition i'm a scientist because I've built a few products! I
certainly don't see myself as one.

