
What it's like to be on the data science job market - rrtwo
http://treycausey.com/data_science_interviews.html
======
randcraw
Outstanding post, especially the supplement info from others. Most of the
opportunities I've seen in DS also have emphasized engineering over science.
(Maybe that's due to my job history.)

I've also wondered what fraction of DS employers use Hadoop but not enough
data to warrant it. Certainly the DJIA giant pharma where I work doesn't.

~~~
x0x0
That's bog standard -- every company uses hadoop. Then when you see the actual
datasets, they're _maybe_ a couple hundred gigs completely denormalized. Yet
you still have to use hadoop/hive/spark to access them, with all the
inefficiencies, complexity, and slowness those bring.

One of the things Trey skipped -- he got the first two only -- that is very
annoying is the big breaks in the data science field are data scientist /
analysis; data scientist / builder; and data engineer/etl. Data scientists'
work sits on top a giant batch of data engineering, and often companies (imo
intentionally) try to hire data scientists by dangling interesting analysis or
implementation work, but when you dig deep enough or worse, accept the job
offer, it's really 80%+ data engineering. (And they get pissy when you quit
two months in after discovering this, both because that's not what I want to
do and because relationships founded on lies tend not to work out well for
employees.)

The other very difficult thing you get is project tests; it's hard to test
something deeply in 5 hours. Even when companies claim to want to test
statistics knowledge, the tests almost always turn out to be dominated by data
ingestion/cleaning work. Or they're simply too much work. eg Stitchfix wanted
me to spend 10+ hours implementing an analysis after just speaking to a
recruiter, without even having spoken to one of their data scientists because
they were "too busy". The recruiter was grumpy when I stopped responding to
email.

~~~
mistermann
> Then when you see the actual datasets, they're maybe a couple hundred gigs
> completely denormalized. Yet you still have to use hadoop/hive/spark to
> access them, with all the inefficiencies, complexity, and slowness those
> bring.

I was always under the impression that one of the benefits of NoSQL was its
speed, but then watching a webcast the other day querying a very small
dataset, I was shocked at how slow it was, and this was in contrast to another
demo where a different query was mind boggingly fast compared to comparable
performance on a traditional SQL platform. (Yes, I know the particulars matter
here and it's not that good of a question without that specificity, but any
light you could shine on this would be appreciated.)

For data of "a couple hundred gigs", what platform would you say is more
appropriate?

~~~
x0x0
no, the benefit of nosql, at least for data science, is scalability. ie what
do you do when you can't fit the data on a single machine. This works great at
a former employer, who really did have pb scale datasets. The vast vast
majority of companies do not have pb scale datasets. Most don't have tb
datasets.

as for what do you do, postgres / mysql; pandas /R; or roll your own code
depending on precisely what you need. But you can rack a pretty beefy box with
256g ram in it, 2 xeons, and a ton of ssd + spindle disk for $10k. Nothing
that nosql or hadoop or spark do can't be done easier, written way faster,
executed faster, and kept running more easily on a single box or even better
in a single process.

For example: at my current gig, I work on 20-40g raw datasets. Ingest to
pandas and externalize user agent strings drops it to 5g or so. That process
takes 30 to 60 minutes, but I do it once, cache the results, and update
incrementally.

------
RogerL
For anyone who isn't a data scientist, much of this applies to any interview
experience, especially the latter part of the article.

Recruiters and companies tend to flip out when I push back, but you know what?
That is an excellent signal that this is the wrong job for you. I only have a
very brief time to form an impression of your company and your team; why come
to me with irrational or unsupported by the data behavior? I really don't
understand it. Everyone claims hiring is really hard, and then they do
everything they can to alienate the interviewee, then wonder when the offer is
turned down, or why the person failed to perform some stupid coding trick on
demand they last saw 20 years ago, maybe, in a classroom.

Make the interviewee like you and want to work for you. That shouldn't be hard
to understand. Then figure out what work you need to have done, and talk to
them about it. It'll be readily clear in most cases. If you are lucky and land
a live one, there mind will be straying far from the constraints of your
little problem and pretty much have grasped your business, your problems, and
are full of ideas of how to improve them all. If not, you probably still have
a good worker (if they are able to do the work, i.e. didn't lie on their
resume).

I would add to this article - do what you can to see the source code[1]. If
you can't, often questions can expose what it is like. Most won't give good
answers, but if you are put through the normal wringer one of the 6-12 people
you talk to will be fairly open and honest. Every place has warts and
limitations - the question is whether these are due to inavoidable tradeoffs
(jump on board), or a horrible culture/infrastructure (run away unless you are
being very, very well compensated to fix the problem).

[1] trawl the github/bitbucket page of every engineer if you have to, or of
course the company's pages if they do open source. It's surprising how much
undocumented spaghetti is released by companies in 'support' of their
products. I'm mulling pulling it up on a laptop and doing a little code review
if the questions for me get silly. But realistically, I'll probably not accept
the offer to interview if it is really bad.

------
mikeskim
has anyone tried asking technical questions back? e.g. a list of putnam
questions.

~~~
rcpt
Can you fit a parabola of arc length 4 inside the unit circle?

