
Beware the data science pin factory: The power of the data science generalist - ericcolson
https://multithreaded.stitchfix.com/blog/2019/03/11/FullStackDS-Generalists/
======
reilly3000
I really wish hiring managers read this. I am a data generalist, and have had
no traction with obtaining even an interview for a data science job. I’ve
setup a private JupyterHub where I run python ETL, interactive models, and
dashboards. I deployed Metabase several times and have written hundreds of SQL
queries. I’ve used Tableau with gigantic datasets. I built a front end
serverless analytics pipeline from scratch with AWS that handles 30M
events/mo. I've demonstrably grown revenue and margins in multiple contexts
with my data products. I’m working on making a fully dynamic frontend for
content recommendations. I have self-taught all of these skills in the past 3
years after a decade in sales, marketing, and entrepreneurship. What I haven’t
done: a CS/math degree (mine was music), graduate work, or tech work at a
household name. Lived in the Bay Area. Gotten an interview for any data job.
Sigh.

~~~
apohn
Have you tried paid services/consulting arms of software or cloud companies?
Teams that bill customers at an hourly rate? They generally look for
generalists who can help customers tackle problems at different levels of the
stack. They aren't looking for PhDs in statistics.

When I interview people who have your type of background, I tend to get
confused by what exactly it is the person wants to do (Analyze Data? Build an
Analytics Pipeline/Architecture? Write Software/Services? Be an Analytics IT
person?). Sometimes they have to talk about all the "cool" stuff they've done
and lose focus on what they bring to the role. I also become skeptical because
it's really easy nowadays to follow a few tutorials on 50 different things and
then boast about how you did it all yourself.

Even reading your comment, you don't sound like somebody who wants to analyze
data.

As far as being a generalist, I definitely agree that it's good to have
somebody with skills in ETL, analyzing data, and _maybe_ building a software
service. But what happens is that all those things happen at different speeds
and then people get crushed. You're asked to investigate some data quickly
over 3 days, but suddenly the software service you built is having issues and
you need 2 weeks to dig in and fix it, and also your ETL job is overloading
the server you need to fix it yesterday but you need help from a somebody else
to figure it out. Oh, and that Metabase thing you installed is broken and the
VP was using it and has a big demo tomorrow.

~~~
rchaud
> I tend to get confused by what exactly it is the person wants to do (Analyze
> Data? Build an Analytics Pipeline/Architecture? Write Software/Services?

But this is the crux of the job-seeker's dilemma. If he/she is specific about
their interests when speaking to an interviewer, they might get a response
like "well, we're really looking for someone whose operational focus is
[something else]".

And if they're not super-specific (I doubt anyone does data analysis
exclusively without any other involvement in the project), but instead attempt
to give examples where they had demonstrable impact working across a number of
domains, you might hear a response like this:

> Even reading your comment, you don't sound like somebody who wants to
> analyze data.

~~~
apohn
I don't think a person has to be super specific or say "I am interested in X
and Y." What they do need is consistency in a resume so the
reviewer/interviewer can evaluate what their primary and secondary focus areas
are. Sometimes you have to leave stuff out. For example, Data Scientists
typically don't deploy and maintain a Data Science stack unless it's a really
small company. And a Data Science Infrastructure person at a bigger company
probably isn't analyzing data unless they are just playing around to validate
their stack.

But if you have a resume (or say this during an interview) that gives equal
weight to the data analysis and the stack deployment, it's just confusing to
the person reading it. Especially in Data Science, which already confusing
from a skillset perspective. Lots of resumes look like the applicants just
thought 10 things with minimal overlap were cool and decided to put them on
their resume.

Even if you did work at a 5 person startup and had the unoffical title of
"Data Scientist, Data Engineer, Data DevOps, DB Admin, and Chief Data Officer"
I'd recommend you downplay some of those based on the jobs you are applying
for. Figure out what is essential and what is +1

------
jonathankoren
I’m not sure this is entirely true. The author is arguing for full stack
scientists, and I prefer those people, but they’re hard to find, and even then
you don’t necessarily want them doing everything. Worse yet, if you put
someone in a full stack position, and they’re not already full stack, you need
to budget a lot of mentoring, because if you don’t, you’re going to get a big
pile of unmaintainable code.

The author kind of builds a strawman of super specialized data scientists that
constantly throw code over the wall to someone else. That doesn’t work, and
you simply can’t do that unless your headcount is in the thousands. You have
to have people that can productionize their work. At the same time, he’s
arguing that scientists should should be maintaining their own data
infrastructure, but that’s not good either.

The best advice I was given was to hire people either to make you smarter, or
to make you stronger/faster. You hire data scientists and ML experts to make
you smarter. They should be working on problems that you can’t solve today.
Infrastructure on the other hand, isn’t your product. It’s overhead. It’s a
tool. Comparatively, it’s easier to hire people to build and maintain your
infrastructure. Hire people to do that. All the time your scientists are
dealing with infrastructure, is time they could be doing useful work.

All that said, know when you should just shove the infrapeople aside and do it
yourself.

~~~
gowld
Infrastructure isn't your product. Why build an infrastructure instead of
buying it?

~~~
perpetualpatzer
Not OP, but given the context, it seems OP is using infrastructure to mean
"all prerequisites to doing ML/data analysis work."

Some of that (e.g. datawarehousing, etc.) is easier to outsource; other parts
(data acquisition from your product, ETL design, etc.) are necessarily bespoke
to your company an thus not readily "buyable." I understand OP to be arguing
roughly "you can get a good DBA for much cheaper than you can get a good ML
Engineer (much less a good ML Engineer who's ALSO a good DBA), so there's no
sense in making Database management part of the Data Scientist role."

~~~
jonathankoren
You have correctly understood what I am saying.

------
sgt101
A very good article, but I think that there is a missing concept - which is
organisational maturity. In a fully mature data driven organisation (like...
errm Google I guess - reading Jeff Deans papers anyway) there is a well
developed data fabric, polished processes for providing credentials and
authority, right sized resourcing pools and also substantial diversity of
specialisation coupled with experience and domain insight. Specialists can
flourish and deliver value out of proportion to their costs. In other, less
developed, organisations there's no chance this will happen and specialists
will be left floundering looking for the setting in which they can do their
thang.

~~~
zwieback
> A very good article, but I think that there is a missing concept - which is
> organisational maturity.

Maturity and also scale - I suppose a small or even one-man shop requiring a
generalist could be mature. Once you get to a certain size specialization
happens automatically.

~~~
sgt101
I agree that a one man shop can be "mature"; but there are many very large
scale operations that have cultures that absolutely preclude speciality.

------
mmsimanga
Article's sentiments are also true for Business Intelligence. The most
effective (I deliberately used work effective) BI developers have the
following qualities interested in the business, able to chat to clients
(emotional intelligence) and also able to code. The best BI people end up
being generalists. Talkative nerds who can converse with business types and
from the business end, you get the business people who are genuinely curious
and willing to learn some SQL.

Being able to communicate is key in BI because this enables you to focus on
the right business problems.

------
opportune
I agree and disagree with this post. I do think data scientists need to be
better at data processing and do more of it. But I still think you do need a
separation of labor between people setting up pipelines and people building
models from the data. The real issue is that there are a lot of data science
departments where they wittle away at their models in some notebook and then
they're "done" once the notebook is showing the right metrics. Data scientists
should be writing their models from the beginning so that they can
productionize them once they are finished. There shouldn't be frequent hand
off events requiring lots of communication between DS, pipelines, and data
engineering teams, there should be an integration process set up so the flow
of work continues to function without intervention.

------
thekhatribharat
Interestingly, the article doesn’t talk about the _scale of production_ and
its effects on _productivity_. When you produce _lots of pins_ , division of
labor is a known way to increase productivity.

A data science generalist may work fine for a small data shop but as you grow
and expand data science in your organization, we know the next step to
increase productivity involves _specialization_ (AKA division of labor). It
happens not just in data science, but in all business functions and with all
business roles.

Marketing, Sales, Finance, Engineering, Operations - every business function
uses specialization to get productivity gains. So while _generalists_ may work
for you if you’re a small business or a large business spinning up a new
business function, specialization is a proven economic tool for productivity
gains as you grow.

Interestingly, as a business function grows, the communication costs and the
ensuing delays increase and this is a known side-effect of specialization
within that business function. This doesn’t mean one throws away
specialization and runs to the other extreme of the spectrum with their use of
_generalists_. There’s a tradeoff organizations make here and there’s been a
lot of experimentation done in this space like - _Amazon 's two-pizza teams_
([https://zurb.com/word/two-pizza-team](https://zurb.com/word/two-pizza-
team)), _Spotify’s Squads_ , etc - these organizational structures are not
universally applicable but they’re interesting developments to look at.

 _Shameless Plug_ (on current state of data science market) -
[https://medium.com/open-factory/state-of-the-m-art-big-
data-...](https://medium.com/open-factory/state-of-the-m-art-big-data-
analytics-2396c321d7b9)

------
natalyarostova
I generally agree with this article, and I am, and continue to aspire to be, a
strong generalist data scientist. However, I do still enjoy/need to have 1 or
2 really really strong quants/statistician types on my team, since they are
able to solve certain problems at a level of depth I can't reach. However, if
they aren't supported by generalists, they also struggle to make impact.

~~~
UncleEntity
Yes, indeed, the main issue TFA missed out on is comparative advantage.

Theoretically speaking...it's much more efficient to have the specialists
doing what they do best instead of trying to learn how to optimize SQL queries
or whatever.

------
bpyne
This sounds suspiciously like the battle software developers have been waging
with people who want to run software development in a manufacturing model. The
battle itself really sucks the love of making something right out of you.

------
mempko
The author points this out at the end but I want to highlight it. Adam Smith
also said that division of labor makes a person "as stupid and ignorant" as a
person can become.
[https://www.pitt.edu/~syd/ASIND.html](https://www.pitt.edu/~syd/ASIND.html)

------
lincpa
I'm Financial Analyst, CPA, CIA, CTA, Statistician, Expert System Developer.

I independently developed a financial analysis expert system, with a strong
ability to innovate and execute.

All my expertise is entirely self-taught.

My Project:
[https://github.com/linpengcheng/fa](https://github.com/linpengcheng/fa)

My technology Blog:
[https://github.com/linpengcheng/PurefunctionPipelineDataflow](https://github.com/linpengcheng/PurefunctionPipelineDataflow)

~~~
lincpa
@pirocks, Don't you think this is a case of data science generalist creating
good products? What kind of psychology makes you give a downvote? Why delete
your comments again!

------
metakermit
Wow, this is a cool "interactive whitepaper" website :)

[https://algorithms-tour.stitchfix.com/](https://algorithms-
tour.stitchfix.com/)

------
tomrod
This works in environments where infrastructure can support it. It can be
downright blissful!

------
mlthoughts2018
This article is terrible. You can’t make a case by putting a bunch of
unsupported assertions into section-heading fonts and then just filling in
paragraphs.

This reads like a desperate business person wrote it, who wishes that one
full-stack set of drives made sense and coexisted in a single person to make
that labor cheaper and more commidity, despite the reality that it’s simply
not true.

The person who spent the time to master web service frameworks, query
languages and product engineering _necessarily_ did not also master
professional level knowledge of deep learning or MCMC sampling or natural
language processing.

The two types of people need to coexist and work symbiotically, but it’s just
asinine wishful thinking to pretend like they are the same person, let alone
to write a baseless essay full of assertions that if they aren’t the same
person it somehow results in first principles economic inefficiency.

~~~
gk1
You'll have to do better than an _ad hominem_ \+ "the opposite is true."
Author is Chief Algorithms Officer at Stitchfix and former VP Data Science &
Engineering at Netflix.

~~~
mlthoughts2018
No, sorry. Argument from authority doesn’t mean the original article has a
cogent point.

There’s no burden on anyone to refute anything from this piece, as the piece
itself has not met any basic requirement of presenting facts or evidence in
the first place.

It’s merely a matter of fact to point out this deficiency of the article. The
premises of the article _could still be accurate_ (though I think that is
fleetingly unlikely), but even if so, _this article_ does not justify any of
those claims, so nobody could know one way or the other from _this article_.
Again, this is just a matter of observation of the justifications given.

This author would personally find it more convenient if the skillset of data
scientists and data platform engineers coexisted in one person who also
happened to have the drive to undertake employment spanning all those skill
sets, and wouldn’t become unhappy if the employer did not respect
specializations. So this author has decided to read tea leaves out of economic
principles and superimpose this wish as if it was justified by some first
principles analysis.

In fact, this wishful thinking seems exactly in line with the flawed
perspective that executives or director level employees will have. They don’t
want to have to care about motivation and intellectual curiosity required to
keep certain kinds of knowledge workers happy & productive, and spend lots of
time trying to justify how their business units embody corporate platitudes
about customer-driven passion. It’s quite easy to see why they would fall
victim to this sort of naive wishful thinking. It’s quite similar to CTOs
getting suckered by turn-key consulting solutions. It’s not even surprising
that VPs & C-suite executives would be very wrong about this type of work.

~~~
thekhatribharat
> _Author is Chief Algorithms Officer at Stitchfix and former VP Data Science
> & Engineering at Netflix._

This is what I was referring to when I said _argument from authority_ :)

From
[https://en.wikipedia.org/wiki/Argument_from_authority](https://en.wikipedia.org/wiki/Argument_from_authority)
: _a fallacy to cite an authority on the discussed topic as the primary means
of supporting an argument_

